100% found this document useful (1 vote)
406 views

Advances in Computational Intelligence

16th International Work-Conference on Artificial Neural Networks, IWANN 2021 Virtual Event, June 16–18, 2021 Proceedings, Part I

Uploaded by

Salvador
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
406 views

Advances in Computational Intelligence

16th International Work-Conference on Artificial Neural Networks, IWANN 2021 Virtual Event, June 16–18, 2021 Proceedings, Part I

Uploaded by

Salvador
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 636

Ignacio Rojas

Gonzalo Joya
Andreu Català (Eds.)

Advances in
LNCS 12861

Computational
Intelligence
16th International Work-Conference
on Artificial Neural Networks, IWANN 2021
Virtual Event, June 16–18, 2021
Proceedings, Part I
Lecture Notes in Computer Science 12861

Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA

Editorial Board Members


Elisa Bertino
Purdue University, West Lafayette, IN, USA
Wen Gao
Peking University, Beijing, China
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Gerhard Woeginger
RWTH Aachen, Aachen, Germany
Moti Yung
Columbia University, New York, NY, USA
More information about this subseries at http://www.springer.com/series/7407
Ignacio Rojas Gonzalo Joya
• •

Andreu Català (Eds.)

Advances in
Computational
Intelligence
16th International Work-Conference
on Artificial Neural Networks, IWANN 2021
Virtual Event, June 16–18, 2021
Proceedings, Part I

123
Editors
Ignacio Rojas Gonzalo Joya
University of Granada University of Málaga
Granada, Spain Málaga, Spain
Andreu Català
Technical University of Catalonia
Barcelona, Spain

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Computer Science
ISBN 978-3-030-85029-6 ISBN 978-3-030-85030-2 (eBook)
https://doi.org/10.1007/978-3-030-85030-2
LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues

© Springer Nature Switzerland AG 2021, corrected publication 2021


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

We are proud to present the set of final accepted papers for the 16th edition of the
IWANN conference - the International Work-Conference on Artificial Neural
Networks - held online during June 16–18, 2021. Unfortunately, the 2021 edition of the
conference had to be carried out remotely due to the consequences of the COVID-19
pandemic, but interactive digital platforms were used to preserve the participatory
climate of previous editions.
IWANN is a biennial conference that seeks to provide a discussion forum for
scientists, engineers, educators, and students about the latest ideas and realizations in
the foundations, theory, models, and applications of hybrid systems inspired by nature
(neural networks, fuzzy logic, and evolutionary systems) as well as in emerging areas
related to these topics. As in previous editions of IWANN, this year’s conference
aimed to create a friendly environment that could lead to the establishment of scientific
collaborations and exchanges among attendees. The proceedings include all the com-
munications presented at the conference. Extended versions of selected papers will also
be published in special issues of relevant journals (such as PeerJ Computer Science and
Neural Proccesing Letters).
Since the first edition in Granada (LNCS 540, 1991), the conference has evolved
and matured. The list of topics in the successive Call for Papers has also evolved,
resulting in the following list for the present edition:
1. Mathematical and theoretical methods in computational intelligence: mathe-
matics for neural networks; RBF structures; self-organizing networks and methods;
support vector machines and kernel methods; fuzzy logic; and evolutionary and
genetic algorithms.
2. Neurocomputational formulations: single-neuron modeling; perceptual modeling;
system-level neural modeling; spiking neurons; and models of biological learning.
3. Learning and adaptation: adaptive systems; imitation learning; reconfigurable
systems; and supervised, non-supervised, reinforcement, and statistical algorithms.
4. Emulation of cognitive functions: decision-making; multi-agent systems; sensor
mesh; natural language; pattern recognition; perceptual and motor functions (visual,
auditory, tactile, virtual reality, etc.); robotics; and planning motor control.
5. Bio-inspired systems and neuro-engineering: embedded intelligent systems;
evolvable computing; evolving hardware; microelectronics for neural, fuzzy, and
bioinspired systems; neural prostheses; retinomorphic systems; brain-computer
interfaces (BCI) nanosystems; and nanocognitive systems.
6. Advanced topics in computational intelligence: intelligent networks;
knowledge-intensive problem-solving techniques; multi-sensor data fusion using
computational intelligence; search and meta-heuristics; soft computing; neuro-fuzzy
systems; neuro-evolutionary systems; neuro-swarm; and hybridization with novel
computing paradigms.
vi Preface

7. Applications: expert systems; image and signal processing; ambient intelligence;


biomimetic applications; system identification, process control, and manufacturing;
computational biology and bioinformatics; parallel and distributed computing;
human computer interaction, internet modeling, communication and networking;
intelligent systems in education; human-robot interaction; multi-agent systems; time
series analysis and prediction; and data mining and knowledge discovery.
At the end of the submission process, and after a careful peer review and evaluation
process (each submission was reviewed by at least 2, and on average 2.8, Program
Committee members or additional reviewers), 85 papers were accepted for oral pre-
sentation, according to the reviewers’ recommendations.
During IWANN 2021, several special sessions were held. Special sessions are a
very useful tool for complementing the regular program with new and emerging topics
of particular interest for the participating community. Special sessions that emphasize
multi-disciplinary and transversal aspects, as well as cutting-edge topics are especially
encouraged and welcome, and in this edition of IWANN 2019 comprised the
following:
– SS01: Agent-based Models for Policy Design Towards a More Sustainable World.
Organized by Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas, Noelia
Sánchez-Maroño, and Alejandro Rodríguez-Arias
– SS02: Convolutional Neural Networks: Beyond Traditional Solutions.
Organized by Irina Perfilieva, Jan Platos, and Jan Hula
– SS03: Quality Control Charts Based on Imprecise Information.
Organized by Gholamreza Hesamian
– SS04: Neural Networks for Time Series Forecasting.
Organized by Grzegorz Dudek
– SS05: Randomization in Deep Learning.
Organized by Claudio Gallicchio, Massimo Panella, and Ponnuthurai Nagaratnam
Suganthan
– SS06: Intelligent Computing Solutions for SARS-CoV-2 COVID-19 (INClutions
COVID-19).
Organized by Carmen Paz Suárez Araujo and Juan Luis Navarro Mesa
– SS07: Multi-valued Cognitive Intelligence.
Organized by Prem Kumar Singh
– SS08: Meta-learning and Other Automatic Learning Approaches in Intelligent
Systems.
Organized by Rashedur M Rahman, Ahsanur Rahman, Tanzilur Rahman, Shafin
Rahman, Luis Garcia, and Ali Cheraghian
– SS09: New Advances in Artificial Intelligence for Green Computing.
Organized by Antonello Rosato and Massimo Panella
– SS10: Attentive Models and Visual Attention in Computer Vision and AI.
Organized by Lorenzo Baraldi and Marcella Cornia
– SS11: Biosignals Processing.
Organized by Antonio Fernandez-Caballero, Roberto Sánchez-Reolid, and Beatriz
García-Martínez
Preface vii

– SS12: Information Fusion in Deep Learning for Biomedicine.


Organized by Miguel Atencia, Francisco Veredas, and Ruxandra Stoean
In this edition of IWANN, we were honored to have the presence of the following
invited speakers:
1. Pierre Baldi, University of California, Irvine, USA
2. Jeanna Matthews, Division of Mathematics and Computer Science, Clarkson
University, USA
3. Davide Anguita, University of Genova, Italy
It is important to note that, for the sake of consistency and readability of the book,
the presented papers are not organized as they were presented in the IWANN 2021
sessions but classified under 13 chapters. The organization of the papers is in two
volumes arranged basically following the topics list included in the Call for Papers. The
first volume (LNCS 12861), entitled Advances on Computational Intelligence, IWANN
2021, Part I, is divided into seven main parts and includes contributions on
1. Information Fusion in Deep Learning for Biomedicine
2. Intelligent Computing Solutions for SARS-CoV-2 COVID-19 (INClutions
COVID-19)
3. Advanced Topics in Computational Intelligence
4. Biosignals Processing
5. Deep Learning
6. Meta-learning and Other Automatic Learning Approaches in Intelligent Systems
7. Artificial Intelligence and Biomedicine
The second volume (LNCS 12862), entitled Advances on Computational Intelli-
gence, IWANN 2021, Part II, is divided into six main parts and includes contributions
on
1. Convolutional Neural Networks: Beyond Traditional Solutions
2. Bio-inspired Systems and Neuro-Engineering
3. Agent-based Models for Policy Design Towards a More Sustainable World
4. Randomization in Deep Learning
5. Neural Networks for Time Series Forecasting
6. Applications in Artificial Intelligence
The 16th edition of the IWANN conference was organized by the University of
Granada, the University of Malaga, and the Polytechnical University of Catalonia,
Spain.
We would also like to express our gratitude to the members of the different com-
mittees for their support, collaboration, and good work. We specially thank our
Steering Committee (David Anguita, Andreu Catalá, Marie Cotrell, Gonzalo Joya,
Kurosh Madani, Madalina Olteanu, Ignacio Rojas, and Ulrich Rückert), the Technical
Assistant Committee (Miguel Atencia, Francisco García-Lagos, Luis Javier Herrera,
and Fernando Rojas), the Program Committee, the reviewers, invited speakers, and
viii Preface

special session organizers. Finally, we want to thank Springer and especially Ronan
Nugent, Alfred Hofmann, and Anna Kramer for their continuous support and
cooperation.

June 2021 Ignacio Rojas


Gonzalo Joya
Andreu Catala

The original version of the book was revised: the affiliations of Gonzalo Joya and
Andreu Català as well as the last name of Andreu Català were not correct. This is now
corrected. The correction to the book is available at
https://doi.org/10.1007/978-3-030-85030-2_51
Organization

Program Committee
Kouzou Abdellah Djelfa University, Algeria
Vanessa Aguiar-Pulido Cornell University, USA
Arnulfo Alanis Garza Instituto Tecnologico de Tijuana, Mexico
Amparo Alonso-Betanzos University of A Coruña, Spain
Jhon Edgar Amaya University of Tachira, Venezuela
Gabriela Andrejkova Pavol Jozef Safarik University, Slovakia
Anastassia Angelopoulou University of Westminster, UK
Davide Anguita University of Genoa, Italy
Javier Antich Tobaruela University of the Balearic Islands, Spain
Miguel Atencia University of Málaga, Spain
Jorge Azorín-López University of Alicante, Spain
Davide Bacciu University of Pisa, Italy
Antonio Bahamonde University of Oviedo at Gijon, Spain
Halima Bahi University of Annaba, Algeria
Juan Pedro Bandera Rubio University of Malaga, Spain
Oresti Banos University of Granada, Spain
Emilia Barakova Einhoven University of Technology, The Netherlands
Lorenzo Baraldi University of Modena and Reggio Emilia, Italy
Andrzej Bartoszewicz Technical University of Lodz, Poland
Bruno Baruque University of Burgos, Spain
Lluís Belanche Universitat Politècnica de Catalunya, Spain
Sergio Bermejo Universitat Politècnica de Catalunya, Spain
Francisco Bonin-Font University of the Balearic Islands, Spain
Julio Brito University of La Laguna, Spain
Joan Cabestany Universitat Politècnica de Catalunya, Spain
Eldon Glen Caldwell University of Costa Rica
Tomasa Calvo University of Alcala, Spain
Azahara Camacho WATA Factory, Spain
Hoang-Long Cao Vrije Universiteit Brussel, Belgium
Carlos Carrascosa Universidad Politecnica de Valencia, Spain
Francisco Carrillo Pérez University of Granada, Spain
Luis Castedo University of A Coruña, Spain
Pedro Castillo University of Granada, Spain
Daniel Castillo-Secilla University of Granada, Spain
Andreu Catala Universitat Politècnica de Catalunya, Spain
Ana Rosa Cavalli Institut Mines-Telecom/Telecom SudParis, France
Miguel Cazorla University of Alicante, Spain
Pablo C. Cañizares Universidad Complutense de Madrid, Spain
x Organization

Ali Cheraghian Australian National University, Australia


Zouhair Chiba Hassan II University of Casablanca, Morocco
Maximo Cobos University of Valencia, Spain
Valentina Colla Scuola Superiore S. Anna, Italy
Feijoo Colomine Universidad Nacional Experimental del Táchira,
Venezuela
Pablo Cordero University of Málaga, Spain
Marcella Cornia University of Modena and Reggio Emilia, Italy
Francesco Corona Aalto University, Finland
Marie Cottrell Université Paris 1 Panthéon-Sorbonne, France
Raúl Cruz-Barbosa Universidad Tecnológica de la Mixteca, Mexico
Miguel Damas University of Granada, Spain
Daniela Danciu University of Craiova, Romania
Luiza de Macedo Mourelle State University of Rio de Janeiro, Brazil
Angel Pascual Del Pobil Universidad Jaime I, Spain
Enrique Dominguez University of Malaga, Spain
Grzegorz Dudek Czestochowa University of Technology, Poland
Richard Duro University of A Coruna, Spain
Gregorio Díaz University of Castilla-La Mancha, Spain
Marcos Faundez-Zanuy Escola Superior Politècnica, Tecnocampus, Spain
Francisco Fernandez University of Extremadura, Spain
De Vega
Enrique Fernandez-Blanco University of A Coruña, Spain
Carlos Fernandez-Lozano University of A Coruña, Spain
Antonio University of Castilla-La Mancha, Spain
Fernández-Caballero
Jose Manuel Ferrandez Politecnic University of Cartagena, Spain
Leonardo Franco University of Málaga, Spain
Claudio Gallicchio University of Pisa, Italy
Luis Garcia University of Brasília, Brazil
Esther Garcia Garaluz Eneso Tecnología de Adaptación SL, Spain
Beatriz Garcia Martinez University of Castilla-La Mancha, Spain
Emilio Garcia-Fidalgo University of the Balearic Islands, Spain
Francisco Garcia-Lagos University of Malaga, Spain
Jose Garcia-Rodriguez University of Alicante, Spain
Patricio García Báez University of La Laguna, Spain
Rodolfo García-Bermúdez Universidad Técnica de Manabí, Ecuador
Patrick Garda Sorbonne Université, France
Peter Gloesekoetter Münster University of Applied Sciences, Germany
Juan Gomez Romero University of Granada, Spain
Pedro González Calero Politecnic University of Madrid, Spain
Juan Gorriz University of Granada, Spain
Karl Goser Technical University Dortmund, Germany
M. Grana Romay UPV/EHU, Spain
Jose Guerrero Universitat de les Illes Balears, Spain
Bertha Guijarro-Berdiñas University of A Coruña, Spain
Organization xi

Alberto Guillen University of Granada, Spain


Pedro Antonio Gutierrez University of Cordoba, Spain
Luis Herrera University of Granada, Spain
Cesar Hervas University of Cordoba, Spain
Wei-Chiang Hong Jiangsu Normal University, China
M. Dolores Jimenez-Lopez Rovira i Virgili University, Spain
Gonzalo Joya University of Málaga, Spain
Vicente Julian Universitat Politècnica de València, Spain
Nuno Lau University of Aveiro, Portugal
Otoniel Lopez Granado Miguel Hernandez University, Spain
Rafael Marcos Luque Baena University of Málaga, Spain
Fernando López Pelayo University of Castilla-La Mancha, Spain
Ezequiel López-Rubio University of Málaga, Spain
Kurosh Madani LISSI/Université PARIS-EST Creteil, France
Mario Martin Universitat Politècnica de Catalunya, Spain
Bonifacio Martin Del Brio University of Zaragoza, Spain
Jesús Medina University of Cádiz, Spain
Jj Merelo University of Granada, Spain
Jose M. Molina Universidad Carlos III de Madrid, Spain
Miguel A. Molina-Cabello University of Málaga, Spain
Angel Mora Bonilla University of Málaga, Spain
Juan Carlos Morales Vega University of Granada, Spain
Gines Moreno University of Castilla-La Mancha, Spain
Juan Moreno Garcia University of Castilla-La Mancha, Spain
Juan L. Navarro-Mesa University of Las Palmas de Gran Canaria, Spain
Nadia Nedjah State University of Rio de Janeiro, Brazil
Alberto Núñez Universidad Complutense de Madrid, Spain
Julio Ortega University of Granada, Spain
Alberto Ortiz Universitat de les Illes Balears, Spain
Osvaldo Pacheco University of Aveiro, Portugal
Esteban José Palomo University of Malaga, Spain
Massimo Panella University of Rome “La Sapienza”, Italy
Irina Perfilieva University of Ostrava, Czech Republic
Hector Pomares University of Granada, Spain
Alberto Prieto University of Granada, Spain
Alexandra Psarrou University of Westminster, UK
Pablo Rabanal Universidad Complutense de Madrid, Spain
Md. Ahsanur Rahman North South University, Bangladesh
Shafin Rahman North South University, Bangladesh
Tanzilur Rahman North South University, Bangladesh
Sivarama Krishnan National Library of Medicine, USA
Rajaraman
Mohammad Rashedur North South University, Bangladesh
Rahman
Ismael Rodriguez Universidad Complutense de Madrid, Spain
Alejandro Rodríguez Arias Universidade da Coruña
xii Organization

Fernando Rojas University of Granada, Spain


Ignacio Rojas University of Granada, Spain
Ricardo Ron-Angevin University of Málaga, Spain
Antonello Rosato “Sapienza” University of Rome, Italy
Fabrice Rossi SAMM - Université Paris 1, France
Peter M. Roth Graz University of Technology, Austria
Fernando Rubio Universidad Complutense de Madrid, Spain
Ulrich Rueckert Bielefeld University, Germany
Addisson Salazar Universitat Politècnica de València, Spain
Roberto Sanchez Reolid Universidad de Castilla-La Mancha, Spain
Noelia Sanchez-Maroño University of A Coruña, Spain
Jorge Santos ISEP
Jose Santos University of A Coruña, Spain
Jose A. Seoane Vall d’Hebron Institute of Oncology, Spain
Prem Singh GITAM University-Visakhapatnam, India
Jordi Solé-Casals University of Vic - Central University of Catalonia,
Spain
Ruxandra Stoean University of Craiova, Rumania
Carmen Paz Suárez-Araujo University of Las Palmas de Gran Canaria, Spain
Claude Touzet Aix-Marseille University, France
Daniel Urda University of Burgos, Spain
Oscar Valero University of Islas Baleares, Spain
Francisco Velasco-Alvarez University of Málaga, Spain
Marley Vellasco Pontifical Catholic University of Rio de Janeiro
(PUC-Rio), Brazil
Alfredo Vellido Universitat Politècnica de Catalunya, Spain
Francisco J. Veredas University of Málaga, Spain
Michel Verleysen Universite catholique de Louvain, Belgium
Ivan Volosyak Rhine-Waal University of Applied Sciences, Germany
Mauricio Zamora Universidad de Costa Rica
Contents – Part I

Information Fusion in Deep Learning for Biomedicine

Deep Learning for the Detection of Frames of Interest in Fetal Heart


Assessment from First Trimester Ultrasound . . . . . . . . . . . . . . . . . . . . . . . . 3
Ruxandra Stoean, Dominic Iliescu, Catalin Stoean, Vlad Ilie,
Ciprian Patru, Mircea Hotoleanu, Rodica Nagy, Dan Ruican,
Rares Trocan, Andreea Marcu, Miguel Atencia, and Gonzalo Joya

Deep Learning Based Neural Network for Six-Class-Classification


of Alzheimer’s Disease Stages Based on MRI Images . . . . . . . . . . . . . . . . . 15
Tim Rörup, I. Rojas, H. Pomares, and P. Glösekötter

Detection of Tumor Morphology Mentions in Clinical Reports in Spanish


Using Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Guillermo López-García, José M. Jerez, Nuria Ribelles, Emilio Alba,
and Francisco J. Veredas

Enforcing Morphological Information in Fully Convolutional Networks to


Improve Cell Instance Segmentation in Fluorescence Microscopy Images. . . . 36
Willard Zamora-Cárdenas, Mauro Mendez, Saul Calderon-Ramirez,
Martin Vargas, Gerardo Monge, Steve Quiros, David Elizondo,
Jordina Torrents-Barrena, and Miguel A. Molina-Cabello

Intelligent Computing Solutions for SARS-CoV-2 Covid-19


(INClutions COVID-19)

A Bayesian Classifier Combination Methodology for Early Detection


of Endotracheal Obstruction of COVID-19 Patients in ICU . . . . . . . . . . . . . 49
Francisco J. Suárez-Díaz, Juan L. Navarro-Mesa,
Antonio G. Ravelo-García, Pablo Fernández-López,
Carmen Paz Suárez-Araujo, Guillermo Pérez-Acosta,
and Luciano Santana-Cabrera

Toward an Intelligent Computing Solution for Endotracheal Obstruction


Prediction in COVID-19 Patients in ICU . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Pablo Fernández-López, Carmen Paz Suárez-Araujo,
Patricio García-Báez, Francisco Suárez-Díaz, Juan L. Navarro-Mesa,
Guillermo Pérez-Acosta, and José Blanco-López
xiv Contents – Part I

Advanced Topics in Computational Intelligence

Features Spaces with Reduced Variables Based on Nearest Neighbor


Relations and Their Inheritances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Naohiro Ishii, Kazunori Iwata, Naoto Mukai, Kazuya Odagiri,
and Tokuro Matsuo

High-Dimensional Data Clustering with Fuzzy C-Means: Problem, Reason,


and Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Yinghua Shen, Hanyu E, Tianhua Chen, Zhi Xiao, Bingsheng Liu,
and Yuan Chen

Contrastive Explanations for Explaining Model Adaptations . . . . . . . . . . . . . 101


André Artelt, Fabian Hinder, Valerie Vaquet, Robert Feldhans,
and Barbara Hammer

Dimensionality Reduction: Is Feature Selection More Effective Than


Random Selection? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Laura Morán-Fernández and Verónica Bolón-Canedo

Classification in Non-stationary Environments Using Coresets


over Sliding Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Moritz Heusinger and Frank-Michael Schleif

Deep Reinforcement Learning in VizDoom via DQN


and Actor-Critic Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Maria Bakhanova and Ilya Makarov

Adaptive Ant Colony Optimization for Service Function Chaining


in a Dynamic 5G Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Segundo Moreno and Antonio M. Mora

On the Use of Fuzzy Metrics for Robust Model Estimation:


A RANSAC-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Alberto Ortiz, Esaú Ortiz, Juan José Miñana, and Óscar Valero

A New Detector Based on Alpha Integration Decision Fusion . . . . . . . . . . . 178


Addisson Salazar, Gonzalo Safont, Nancy Vargas, and Luis Vergara

A Safe and Effective Tuning Technique for Similarity-Based Fuzzy


Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Ginés Moreno and José A. Riaza

Predictive Ability of Response Surface Methodology (RSM)


and Artificial Neural Network (ANN) to Approximate Biogas Yield
in a Modular Biodigester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Modestus O. Okwu, Lagouge K. Tartibu, Olusegun D. Samuel,
Henry O. Omoregbee, and Anna E. Ivbanikaro
Contents – Part I xv

Biosignals Processing

Analysis of Electroencephalographic Signals from a Brain-Computer


Interface for Emotions Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Beatriz García-Martínez, Antonio Fernández-Caballero,
Arturo Martínez-Rodrigo, and Paulo Novais

A Fine Dry-Electrode Selection to Characterize Event-Related Potentials


in the Context of BCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Vinicio Changoluisa, Pablo Varona, and Francisco B. Rodriguez

Detection of Emotions from Electroencephalographic Recordings by Means


of a Nonlinear Functional Connectivity Measure . . . . . . . . . . . . . . . . . . . . . 242
Beatriz García-Martínez, Antonio Fernández-Caballero, Raúl Alcaraz,
and Arturo Martínez-Rodrigo

P300 Characterization Through Granger Causal Connectivity in the Context


of Brain-Computer Interface Technologies . . . . . . . . . . . . . . . . . . . . . . . . . 253
Vanessa Salazar, Vinicio Changoluisa, and Francisco B. Rodriguez

Feature and Time Series Extraction in Artificial Neural Networks


for Arousal Detection from Electrodermal Activity . . . . . . . . . . . . . . . . . . . 265
Roberto Sánchez-Reolid, Francisco López de la Rosa,
Daniel Sánchez-Reolid, María T. López,
and Antonio Fernández-Caballero

Deep Learning

Context-Aware Graph Convolutional Autoencoder . . . . . . . . . . . . . . . . . . . 279


Asma Sattar and Davide Bacciu

Development and Implementation of a Neural Network-Based Abnormal


State Prediction System for a Piston Pump . . . . . . . . . . . . . . . . . . . . . . . . . 291
Mauricio Andrés Gómez Zuluaga, Ahmad Ordikhani, Christoph Bauer,
and Peter Glösekötter

Iterative Adaptation to Quantization Noise . . . . . . . . . . . . . . . . . . . . . . . . . 303


Dmitry Chudakov, Sergey Alyamkin, Alexander Goncharenko,
and Andrey Denisov

A BERT Based Approach for Arabic POS Tagging. . . . . . . . . . . . . . . . . . . 311


Rakia Saidi, Fethi Jarray, and Mahmud Mansour

Facial Expression Interpretation in ASD Using Deep Learning . . . . . . . . . . . 322


Pablo Salgado, Oresti Banos, and Claudia Villalonga

Voxel-Based Three-Dimensional Neural Style Transfer . . . . . . . . . . . . . . . . 334


Timo Friedrich, Barbara Hammer, and Stefan Menzel
xvi Contents – Part I

Rendering Scenes for Simulating Adverse Weather Conditions . . . . . . . . . . . 347


Prithwish Sen, Anindita Das, and Nilkanta Sahu

Automatic Fall Detection Using Long Short-Term Memory Network . . . . . . . 359


Carlos Magalhães, João Ribeiro, Argentina Leite, E. J. Solteiro Pires,
and João Pavão

Deep Convolutional Neural Networks with Residual Blocks for Wafer Map
Defect Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
Zemenu Endalamaw Amogne, Fu-Kwun Wang, and Jia-Hong Chou

Enhanced Convolutional Neural Network for Age Estimation . . . . . . . . . . . . 385


Idowu Aruleba and Serestina Viriri

Deep Interpretation with Sign Separated and Contribution


Recognized Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Lucas Y. W. Hui and De Wen Soh

Deep Learning for Age Estimation Using EfficientNet . . . . . . . . . . . . . . . . . 407


Idowu Aruleba and Serestina Viriri

Towards a Deep Reinforcement Approach for Crowd Flow Management . . . . 420


Wejden Abdallah, Dalel Kanzari, and Kurosh Madani

Classification of Images as Photographs or Paintings by Using


Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
José Miguel López-Rubio, Miguel A. Molina-Cabello,
Gonzalo Ramos-Jiménez, and Ezequiel López-Rubio

Parallel Corpora Preparation for English-Amharic Machine Translation . . . . . 443


Yohanens Biadgligne and Kamel Smaïli

Fast Depth Reconstruction Using Deep Convolutional Neural Networks. . . . . 456


Dmitrii Maslov and Ilya Makarov

Meta-Learning and Other Automatic Learning Approaches


in Intelligent Systems

A Study of the Correlation of Metafeatures Used for Metalearning . . . . . . . . 471


Adriano Rivolli, Luís P. F. Garcia, Ana C. Lorena,
and André C. P. L. F. de Carvalho

Learning Without Forgetting for 3D Point Cloud Objects . . . . . . . . . . . . . . . 484


Townim Chowdhury, Mahira Jalisha, Ali Cheraghian,
and Shafin Rahman
Contents – Part I xvii

Patch-Wise Semantic Segmentation of Sedimentation


from High-Resolution Satellite Images Using Deep Learning . . . . . . . . . . . . 498
Tahmid Hasan Pranto, Abdulla All Noman, Asaduzzaman Noor,
Ummeh Habiba Deepty, and Rashedur M. Rahman
Learning Image Segmentation from Few Annotations:
A REPTILE Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Héctor F. Satizábal and Andres Perez-Uribe

Artificial Intelligence and Biomedicine

Impacted Tooth Detection in Panoramic Radiographs . . . . . . . . . . . . . . . . . 525


James Faure and Andries Engelbrecht
Deep Learning for Diabetic Retinopathy Prediction . . . . . . . . . . . . . . . . . . . 537
Ciro Rodriguez-Leon, William Arevalo, Oresti Banos,
and Claudia Villalonga
Facial Image Augmentation from Sparse Line Features Using Small
Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Shih-Kai Hung and John Q. Gan
Ensemble Models for Covid Prediction in X-Ray Images . . . . . . . . . . . . . . . 559
Juan Carlos Morales Vega, Francisco Carrillo-Perez,
Jesús Toledano Pavón, Luis Javier Herrera Maldonado,
and Ignacio Rojas Ruiz
Validation of a Nonintrusive Wearable Device for Distress Estimation
During Robotic Roller Assisted Gait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Marta Díaz-Boladeras, Xavier Llanas, Marta Musté, Elsa Pérez,
Carlos Pérez, Alex Barco, and Andreu Català
Deep Learning for Heart Sounds Classification Using Scalograms
and Automatic Segmentation of PCG Signals . . . . . . . . . . . . . . . . . . . . . . . 583
John Gelpud, Silvia Castillo, Mario Jojoa, Begonya Garcia-Zapirain,
Wilson Achicanoy, and David Rodrigo
Skin Disease Classification Using Machine Learning Techniques . . . . . . . . . 597
Mohammad Ashraful Haque Abir, Golam Kibria Anik,
Shazid Hasan Riam, Mohammed Ariful Karim, Azizul Hakim Tareq,
and Rashedur M. Rahman
Construction of Suitable DNN-HMM for Classification Between Normal
and Abnormal Respiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Masaru Yamashita
Correction to: Advances in Computational Intelligence . . . . . . . . . . . . . . . . C1
Ignacio Rojas, Gonzalo Joya, and Andreu Català

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621


Contents – Part II

Convolutional Neural Networks: Beyond Traditional Solutions

Error-Correcting Output Codes in the Framework of Deep


Ordinal Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Javier Barbero-Gómez, Pedro Antonio Gutiérrez,
and César Hervás-Martínez

Features as Keypoints and How Fuzzy Transforms Retrieve Them . . . . . . . . 14


Irina Perfilieva and David Adamczyk

Instagram Hashtag Prediction Using Deep Neural Networks . . . . . . . . . . . . . 28


Anna Beketova and Ilya Makarov

Bio-inspired Systems and Neuro-Engineering

Temporal EigenPAC for Dyslexia Diagnosis. . . . . . . . . . . . . . . . . . . . . . . . 45


Nicolás J. Gallego-Molina, Marco Formoso, Andrés Ortiz,
Francisco J. Martínez-Murcia, and Juan L. Luque

Autonomous Driving of a Rover-Like Robot Using


Neuromorphic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Enrique Piñero-Fuentes, Salvador Canas-Moreno,
Antonio Rios-Navarro, Tobi Delbruck, and Alejandro Linares-Barranco

Effects of Training on BCI Accuracy in SSMVEP-based BCI. . . . . . . . . . . . 69


Piotr Stawicki, Aya Rezeika, and Ivan Volosyak

Effect of Electrical Synapses in the Cycle-by-Cycle Period and Burst


Duration of Central Pattern Generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Blanca Berbel, Alicia Garrido-Peña, Irene Elices, Roberto Latorre,
and Pablo Varona

Operation of Neuronal Membrane Simulator Circuit for Tests


with Memristor Based on Graphene and Graphene Oxide. . . . . . . . . . . . . . . 93
Marina Sparvoli, Jonas S. Marma, Gabriel F. Nunes,
and Fábio O. Jorge

Agent-Based Models for Policy Design Towards a More


Sustainable World

Informing Agent-Based Models of Social Innovation Uptake . . . . . . . . . . . . 105


Patrycja Antosz, Wander Jager, Gary Polhill, and Douglas Salt
xx Contents – Part II

Sensitivity Analysis of an Empirical Agent-Based Model of District


Heating Network Adoption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Gary Polhill, Doug Salt, Tony Craig, Ruth Wilson, and Kathryn Colley

Generating a Synthetic Population of Agents Through Decision Trees


and Socio Demographic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas,
Alejandro Rodríguez-Arias, and Noelia Sánchez-Maroño

Randomization in Deep Learning

Improved Acoustic Modeling for Automatic Piano Music Transcription


Using Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Peter Steiner, Azarakhsh Jalalvand, and Peter Birkholz

On Effects of Compression with Hyperdimensional Computing


in Distributed Randomized Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 155
Antonello Rosato, Massimo Panella, Evgeny Osipov, and Denis Kleyko

Benchmarking Reservoir and Recurrent Neural Networks for Human State


and Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Davide Bacciu, Daniele Di Sarli, Claudio Gallicchio, Alessio Micheli,
and Niccolò Puccinelli

Neural Networks for Time Series Forecasting

Learning to Trade from Zero-Knowledge Using Particle


Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Stefan van Deventer and Andries Engelbrecht

Randomized Neural Networks for Forecasting Time Series


with Multiple Seasonality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Grzegorz Dudek

Prediction of Air Pollution Using LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 208


Stanislaw Osowski

Applications in Artificial Intelligence

Detection of Alzheimer’s Disease Versus Mild Cognitive Impairment Using


a New Modular Hybrid Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Alberto Sosa-Marrero, Ylermi Cabrera-León,
Pablo Fernández-López, Patricio García-Báez,
Juan Luis Navarro-Mesa, Carmen Paz Suárez-Araujo,
and for the Alzheimer’s Disease Neuroimaging Initiative
Contents – Part II xxi

Fine-Tuning of Patterns Assignment to Subnetworks Increases the Capacity


of an Attractor Network Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Mario González, Ángel Sánchez, David Dominguez,
and Francisco B. Rodríguez

A Combined Approach for Enhancing the Stability of the Variable


Selection Stage in Binary Classification Tasks . . . . . . . . . . . . . . . . . . . . . . 248
Silvia Cateni, Valentina Colla, and Marco Vannucci

A Convolutional Neural Network as a Proxy for the XRF Approximation


of the Chemical Composition of Archaeological Artefacts in the Presence
of Inter-microscope Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Catalin Stoean, Leonard Ionescu, Ruxandra Stoean, Marinela Boicea,
Miguel Atencia, and Gonzalo Joya

Implementation of Data Stream Classification Neural Network Models


Over Big Data Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Fernando Puentes-Marchal, María Dolores Pérez-Godoy,
Pedro González, and María José Del Jesus

Performance Evaluation of Classical Classifiers and Deep Learning


Approaches for Polymers Classification Based on Hyperspectral Images . . . . 281
Javier Lorenzo-Navarro, Silvia Serranti, Giuseppe Bonifazi,
and Giuseppe Capobianco

Hotel Recognition via Latent Image Embeddings . . . . . . . . . . . . . . . . . . . . 293


Boris Tseytlin and Ilya Makarov

Time Series Prediction with Autoencoding LSTM Networks. . . . . . . . . . . . . 306


Federico Succetti, Andrea Ceschini, Francesco Di Luzio,
Antonello Rosato, and Massimo Panella

Improving Indoor Semantic Segmentation with


Boundary-Level Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Roberto Amoroso, Lorenzo Baraldi, and Rita Cucchiara

EvoMLP: A Framework for Evolving Multilayer Perceptrons . . . . . . . . . . . . 330


Luis Liñán-Villafranca, Mario García-Valdez, J. J. Merelo,
and Pedro Castillo-Valdivieso

Regularized One-Layer Neural Networks for Distributed


and Incremental Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas,
and Beatriz Pérez-Sánchez
xxii Contents – Part II

Frailty Level Prediction in Older Age Using Hand Grip Strength Functions
Over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Elsa Pérez, Jose E. Torres Rangel, Marta Musté, Carlos Pérez,
Oscar Macho, Francisco S. del Corral Guijarro, Aris Somoano,
Cristina Gianella, Luis Ramírez, and Andreu Català

Accuracy and Intrusiveness in Data-Driven Violin Players Skill Levels


Prediction: MOCAP Against MYO Against KINECT . . . . . . . . . . . . . . . . . 367
Vincenzo D’Amato, Erica Volta, Luca Oneto, Gualtiero Volpe,
Antonio Camurri, and Davide Anguita

Features Selection for Fall Detection Systems Based on Machine Learning


and Accelerometer Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Carlos A. Silva, Rodolfo García−Bermúdez, and Eduardo Casilari

Autonomous Docking of Mobile Robots by Reinforcement Learning


Tackling the Sparse Reward Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
A. M. Burgueño-Romero, J. R. Ruiz-Sarmiento, and J. Gonzalez-Jimenez

Decision Support Systems for Air Traffic Control with Self-enforcing


Networks Based on Weather Forecast and Reference Types
for the Direction of Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Dirk Zinkhan, Sven Eiermann, Christina Klüver, and Jürgen Klüver

Impact of Minority Class Variability on Anomaly Detection by Means


of Random Forests and Support Vector Machines . . . . . . . . . . . . . . . . . . . . 416
Faisal Saleem Alraddadi, Luis F. Lago-Fernández,
and Francisco B. Rodríguez

Analyzing the Land Cover Change and Degradation


in Sundarbans Mangrove Forest Using Machine Learning
and Remote Sensing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Ashikur Rahman Khan, Anika Khan, Shehzin Masud,
and Rashedur M. Rahman

Correction to: Advances in Computational Intelligence . . . . . . . . . . . . . . . . C1


Ignacio Rojas, Gonzalo Joya, and Andreu Català

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439


Information Fusion in Deep Learning
for Biomedicine
Deep Learning for the Detection
of Frames of Interest in Fetal Heart
Assessment from First Trimester
Ultrasound

Ruxandra Stoean1,2(B) , Dominic Iliescu3 , Catalin Stoean1,2 , Vlad Ilie1 ,


Ciprian Patru3 , Mircea Hotoleanu1 , Rodica Nagy3 , Dan Ruican3 ,
Rares Trocan1 , Andreea Marcu3 , Miguel Atencia4 , and Gonzalo Joya4
1
Romanian Institute of Science and Technology, Cluj, Romania
[email protected]
2
Faculty of Sciences, University of Craiova, Craiova, Romania
{ruxandra.stoean,catalin.stoean}@inf.ucv.ro
3
University of Medicine and Pharmacy of Craiova, Craiova, Romania
4
Universidad de Málaga, Málaga, Spain
{matencia,gjoya}@uma.es

Abstract. The current paper challenges convolutional neural networks


to address the computationally undebated task of recognizing four key
views in first trimester fetal heart scanning (the aorta, the arches, the
atrioventricular flows and the crossing of the great vessels). This is the
primary inspection of the heart of a future baby and an early recogni-
tion of possible problems is important for timely intervention. Frames
depicting the views of interest were labeled by obstetricians and given to
several deep learning architectures as a classification task against other
irrelevant scan sights. A test accuracy of 95% with an F1-score ranging
from 90.91% to 99.58% for the four key perspectives shows the potential
in supporting heart scans even from such an early fetal age, when the
heart is still quite underdeveloped.

Keywords: Deep learning · Classification · Fetal heart · Ultrasound ·


First trimester pregnancy

1 Introduction
Medicine is one of the main and most important stages where deep learning
(DL) has been playing a successful role during the last couple of years [6,16,20].
Its models have targeted the recognition of a wide range of diseases on the base
of various image types, from histopathological slides [1,14,18,24] to MRI scans
[17,22,29] and CT [4,15], to name several recent entries in the medical areas
mostly used for computer vision.
A further challenging scope has become to use DL models for a more complex
task, i.e. to analyze medical data coming from video scans [28]. A good choice
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 3–14, 2021.
https://doi.org/10.1007/978-3-030-85030-2_1
4 R. Stoean et al.

for such a problem is fetal ultrasound (US), where there is a moving patient-in-
patient. An even more stimulating scenario is to additionally examine an organ
in motion, e.g. the heart.
Medically speaking, fetal echocardiography (FECG) is one of the crucial US
assessments during pregnancy towards an early detection of possible congeni-
tal heart disease (CHD) [13]. The disease accounts for 50% of childhood deaths
attributed to congenital malformations [19,21]. The US assessment can allow
immediate intervention after birth, or even intrauterine treatment, and an asso-
ciated decreased cost of care. While screening for CHD started with the high-risk
groups (genetic, environmental factors), it is now generally advocated, due to
many cases discovered in low-risk pregnancies [3]. For the first trimester, how-
ever, the guidelines do not impose the obligation of screening for CHD as the
heart structures are underdeveloped. Nevertheless, an early detection of the con-
dition is desirable to start with the first trimester [9,10], when the pregnancy
can be terminated in case of severe abnormalities, or that can be correlated with
the findings of the subsequent trimester [11].
In a FECG, the physicians look for the absence, distortion or improper posi-
tioning of different heart components per se or in relation to one another, as
well as for an incorrect blood flow in the heart. The difficulty in the analysis
of the fetal heart generally stems from several acknowledged facets [2,7]: the
low expertise in fetal echocardiography of early stage sonographers, the absence
of a standardized protocol, the susceptibility of US imaging to noise given by
the presence of shadows, the ambiguity of the structures due to the small heart
size, the variability of the disposal of the heart and its components due to the
movement of the probe, the activity of the fetus and the fetus-probe orientation,
as well as the existence of different viewing planes (apical, lateral).
In this context, DL models can serve as a virtual assistant and indicate
to the clinician the views of interest with the presence or the absence of cer-
tain characteristics of importance in the CHD diagnosis. Among the existing
literature entries, described in more detail in Sect. 2, all of which dealing with
computational heart assessment from second trimester scans, the current paper
differently and firstly considers a DL architecture to recognize the elementary
structures and behavior that are present in a first trimester FECG, especially
given the absence of compulsory screening invoked by the low development of
the heart. Several distinct elements were highlighted by studies regarding the
heart anatomy in the first trimester that should be visible in the color Doppler
US scans or the lack of any of them already implies a possible anomaly that
needs to be further investigated by the sonographer. To our knowledge, this is
the first study concerning computational support for heart screening during the
first trimester (12–14 weeks gestational age).
The paper is structured as follows. A review of the works dealing with sec-
ond trimester fetal heart inspection from the US are depicted in Sect. 2. The
data set is then described in Sect. 3. The employed methodology for the recog-
nition of crucial frames is explained in Sect. 4, also addressing the prior data
pre-processing step. The experiments are outlined in Sect. 5, and the conclusions
and prospective directions are given in Sect. 6.
Deep Learning for First Trimester Fetal Heart Assessment 5

2 State of the Art in Fetal Heart Assessment


Computational support for the complex problem of fetal cardiac video analysis
has appeared solely in the last couple of years and it is scarce, as compared to
those works approaching echo- and electrocardiography. Moreover, the state of
the art in automated analysis of the fetal heart has targeted only the second
trimester examination for CHD detection, as shown below.
The study [2] took 91 US videos from 12 subjects (with a gestational age of
20 to 35 weeks) and appointed a combination of a particle-filtering method, for
connecting the predictions from multiple frames, and a random forest predictor,
to estimate cardiac phase, viewing plane, visibility, position and orientation from
each one. However, CHD cannot be identified and the results are not integrated
into the video scan. The work progresses in [8], where a three-task learning
model is formulated towards heart view classification, localization and orien-
tation. Three heart views of second trimester scans are detected against back-
ground through classification with a VGGNet. Detection is conducted through
a circular anchor mechanism. Localization is achieved through an intersection
over union loss, while orientation through a cosine one. A bi-directional LSTM
is employed to add the temporal information over frames.
In [5], 615 cardiac US frames of fetuses with a normal heart condition at 18–
28 weeks were collected, in almost 50%–50% apical/non-apical orientation. The
task set was to accurately segment the ventricular septum. This was performed
using a combination of YOLO, U-Net and VGG-16, while also referring to pre-
and post-cropped frames through an encoder-decoder component in order to
take time into account.
The paper [12] has assigned a Yolov2 object detection model to localize and
classify 18 heart components from US scans of 363 subjects, taken at 18–34
weeks. The bounding box detected for each structure and the probability associ-
ated is displayed in real time. Additionally a binary colored barcode correspond-
ing to the detection of each heart element is represented along the scanning
timeline. The approach is however based on the assumption that the fetus scan
is always performed in a fixed direction from the stomach, through the heart, to
the vascular arches (taken mostly in the apical view); thus the heart appears in
the same position in all the US frames. Training was performed on the normal
cases only, and the 14 CHD cases were used for test, where an abnormality score
computed the total number of well detected components.
In contrast, to the best of our knowledge, this is the first study to perform first
trimester heart analysis at 12–14 weeks old gestational age from fetal echocar-
diography. Heart examination in the first trimester is yet more difficult, given
the even lower structural development of the organ components present at this
early stage. The paper therefore aims to distinguish the key frames that must
be present in the heart sweep by formulating the problem into a classification
task and solving it through DL architectures. Afterwards, the selected and clas-
sified frames will be analyzed by the physician. Additionally, if not all four key
views are detected, then the physician is alerted to look more attentively for the
missing components.
6 R. Stoean et al.

3 Data Set
The current work aims to detect the presence/absence of key frames from first
trimester US videos. Given that some component cannot be viewed is already an
early indicator of a heart condition. The four key frames to be distinguished are
represented by the main planes of the cardiac sweep with color Doppler applied
[27]: the atrioventricular (AV) flows in four-chamber view plane (hereinafter, AV
flows), the aorta in the left ventricular outflow tract plane (hereinafter, aorta),
the crossing of the great vessels (the X sign) in the right ventricular outflow
tract plane and the arches (the V sign) in three-vessel plane.
These views that must be learnt and further discovered in a new video were
provided by the physicians in meaningful images extracted from US scans. They
are taken only in the apical plane. The US videos were recorded by distinct
machines, namely General Electric Voluson E10, E8 and E6 systems, GE Health-
care, Zipf, Austria, equipped with RAB 4-8-D (2–8 MHz), RAB 6D (2–7 MHz),
and RM6C (2–6 MHz) transducers. Also, sonographers with different degrees of
expertise have conducted the examinations. These prerequisites were taken in
order to allow for variability in the rendering performance of the machine and
the human experience and thus test the generalization ability of the model.
The data set consists of 326 US scans of fetuses age 12–14 weeks. There are
some patients that have several examinations recorded. The collection was split
into training/validation/test subsets, making sure that the videos of a same
patient are not present at the intersection of any two subsets. All the data
have been preprocessed, following the procedure outlined in Subsect. 4.1. Figure 1
shows that the features that ought to be present vary greatly from one scan to
another, due to the conditions already stated before in the introduction, the
color or directional power Doppler approach, and the software and ultrasound
machine type.

4 Methodology

DL is applied to the given frames only after a preceding preprocessing, in order


to set the focus on the cardiac area of the fetus and ignore the background area.

4.1 Data Preprocessing

In order to achieve this goal, first the area of interest is marked with a red circle.
Recognizing it is however not straightforward, as it has an irregular shape and
positioning, a similar color to the neighbouring tissues and is not easily separable
from the surrounding areas. Nevertheless, it always covers the center of the image
and it is connected with the other tissues through narrow bridges.
Further, since most of the frames are dual, showing both the standard gray-
scale mapping US capture, on the left, and the Doppler interrogation, on the
right, the subsequent phase is to separate the two sides of an image. Then, a
procedure in six steps was used. Firstly, the image is converted to gray scale
Deep Learning for First Trimester Fetal Heart Assessment 7

Fig. 1. Examples of different frames showing the four key appearances that ought to
be present in the first trimester fetal heart (each on one row). From top to bottom: the
aorta, the V sign, the AV flows, and the X sign.

and a threshold is applied to eliminate background noise. Then, erosion with a


10 × 10 pixels kernel is performed in order to eliminate the connecting bridges.
Thirdly, the spots that cover the central area of the image are identified and the
spot with the highest coverage is taken. Next, dilation with the same 10 × 10
8 R. Stoean et al.

pixels kernel is applied in order to restore the dimension of the selected spot.
Subsequently, the convex hull of the dilated spot is drawn and filled. Finally, the
generated spot is used as the mask to extract the area of interest.

4.2 Frame Classification


The preprocessed frames are next investigated for the detection of the four key
views of the first trimester fetal heart.
Their identification can be further transposed into a classification problem
with five categories: the four compulsorily present structures (aorta, V sign, AV
flows and X sign) and further unimportant shapes that can be seen in the video
(referred next as “other”).
Hence a CNN architecture is appointed to independently learn the features
indicative of each view and perform the classification. Once it is able to recognize
each view, the final aim becomes that, when presented with a new video, the
model takes it frame by frame and signals those views that it found as present
in the scan.

Fig. 2. Illustration of the number of sample US images per each training-validation-test


set, as well as per class.

5 Experiments
Several models were chosen and the findings support the use of DL for this task.

5.1 Setup
Six DL architectures were fine tuned on the training frames: DenseNet-201,
Inception-V4, ResNet-152, ResNet-18, ResNet-50 and Xception. The metric that
is targeted by the models is the classification accuracy.
As mentioned in Subsect. 4.1, the input images contain only a cropped section
of the US scans, i.e. the one from the right side, where the color Doppler is
present. These images were resized to 224 × 224 pixels. The options used for data
Deep Learning for First Trimester Fetal Heart Assessment 9

augmentation contain a random image flip with a probability of 0.5, a probability


of 0.75 for a random rotation between −10 and 10 degrees, a random zoom of
up to 1.1 times the image, a random lightning and contrast change controlled
by a threshold of 0.2 and a random symmetric warp of magnitude between −0.2
and 0.2. There are images from 192 US scans in the training set, from another
67 in the validation one and finally from the remaining 67 in the test set. In each
separate set, there may be several US scans from the same person, but taken at
different times. However, there is no patient that appears at the intersection of
any two subsets. The total number of images in all sets and all classes is 7251.
From these, 4260 are in the training set, 1495 in validation and 1496 in the test
set. The split per class is shown in Fig. 2.
A 1cycle policy [23] is used for the training process, at first with 10 epochs
and afterwards unfreezing all layers and training for another 10 epochs. The
learning rate is fine-tuned for each architecture in turn and its values are chosen
according to the strategy proposed in [23]. The batch size was taken equal to 32.
The experiments are made on Google Colab, using a Tesla T4 GPU. The source
codes are written in Python with the Pytorch and Fast.ai libraries.

Table 1. Mean validation and test accuracy obtained by the 6 models.

Accuracy (%) DenseNet201 InceptionV4 ResNet152 ResNet18 ResNet50 Xception


Validation 97.34 97.23 96.65 97.27 97.23 96.72
Test 95.56 94.38 94.42 95.34 95.52 94.96

Table 2. Precision, recall and F1-score in a test run with confusion matrix in Fig. 3.

Measures (%) Aorta V sign AV flows X Sign Other Macro average Weighted avg
Precision 92.86 96.69 99.58 91.4 98.35 95.77 97.21
Recall 94.55 98.87 99.58 90.43 97.28 96.14 97.19
F1-score 93.69 97.77 99.58 90.91 97.81 95.95 97.19

5.2 Results

The validation and test accuracy, reported in mean after 5 repeated runs, are
shown in Table 1. The confusion matrix reached in one run is depicted in Fig. 3.
Its accompanying metrics are outlined in Table 2.
Figure 4 shows that, if we put a confidence threshold for the largest output of
the last Softmax layer, the number of mislabeled samples decreases, with some
loss in accuracy. The example is run on the same run whose output is presented
in Fig. 3 and Table 2. The results are discussed in the next subsection.
10 R. Stoean et al.

Fig. 3. The confusion matrix obtained on the test samples in the most accurate run of
DenseNet201. The accuracy is 97.19%. The other related metrics are given in Table 2.

5.3 Discussion

The fastest architecture from the tried ones is, as expected, ResNet-18, which
needs in average 12 min for 10 epochs of fitting the 1cycle policy. Note however
that the model is twice trained for 10 epochs, at first using a large set of layers
frozen and then after unfreezing them. Additionally, after unfreezing, there is
also a process of searching for an appropriate learning rate that takes slightly
more than 5 min for ResNet-18. The entire process of loading the data, training,
validation and then applying the model on the test set lasts for the same model
almost 35 min. The running times for training 10 epochs using 1cycle policy
take for ResNet-50 13.2 min, 15.5 for Xception, 23.6 for InceptionV4, and 26.2
for ResNet152 and DenseNet201.
The five models analyzed have achieved significantly high accuracy results;
all in a narrow margin. These results confirm the viability of our CNN proposal
to work with data from the first trimester of pregnancy. The best result obtained
in a single run on the test set is achieved by the DenseNet201 architecture and
it is detailed in Fig. 3 and Table 2. A small number of frames belonging to the
Other class are mistaken for key views, especially the Aorta and the X sign.
These two elements are described by a smaller number of frames in the data set
(see Fig. 2).
The inclusion of a large amount of samples from Other class is intended to
reflect the large amount of samples in a US video that should not be labeled as
one of the important 4 classes. This class covers a relatively large and random
type of arrangement of the components (flows in different shapes and colors,
while in some frames they can even be absent). A part of them are also relatively
close to the views that correspond to key structures like Aorta or X sign, hence
there are cases that may mislead the model.
In an attempt to further make use of the outputs of the last Softmax layer,
we experiment a shallow approach of using a confidence threshold in the fol-
lowing manner: assign a validation or test sample to a class only if the largest
Deep Learning for First Trimester Fetal Heart Assessment 11

Fig. 4. The confidence threshold reduced the number of samples that are mistaken at
the cost of decreasing the accuracy. Accuracy + Other reconsiders the accuracy after
assigning all samples with probability below the threshold to Other. First row shows
the results on the validation set, while the second refers to the test set.

output from the Softmax corresponds to that certain class and its value is larger
than the threshold. In the left plots in Fig. 4 we represent the accuracy when the
model simply ignores the samples whose values are below the threshold. A first
decision is to not assign them to any class, but simply consider them as misla-
belled in the computation of the metric and forward them to the obstetrician for
assessment. The second option (Accuracy + Other) attributes the class Other
to these samples. Naturally, the latter leads to a better classification accuracy
and actually represents a viable scenario, since the expert is in fact especially
interested in the first 4 classes, and not in the Other samples. However, there are
still some samples that are mislabeled and their values from the Softmax remain
larger than the threshold, and these are illustrated in the plots from the right
side. The trend is very similar for both the validation (first row) and the test
(second row) cases. For instance, when we set a threshold of 0.99, an accuracy of
84.28% is achieved on the validation set by putting away the samples with the
Softmax output below the threshold, while, if they are assigned to Other, the
accuracy increases to 91.51%. The corresponding values for the test set when
putting the same threshold are of 83.42% and 88.1%, respectively. The percent-
age of the remaining errors in the setting of this threshold is 0% for validation
and 0.53% for the test set (i.e. samples with Softmax outputs greater than the
threshold, but mislabelled).

6 Conclusions
The current research consists of constructing a CNN model able to accurately
identify and classify several key frames from a first trimester fetal heart scanning.
12 R. Stoean et al.

To do this, as a preliminary step, we have built a classification data set of


US frames denoting the fetal heart from this pregnancy period. This can aid
the physician to further take a decision regarding the presumption of a normal
condition (when all the key frames are present) or the acquisition of further
scans when they are absent, since there could be a problematic situation.
There are six CNN architectures tried and fine-tuned for the given task and
the results on the test set are of around 95.5% for correctly classifying 1496
samples. They are very encouraging, since even if an error that is below 5% is
present, a US video recording of up to 5 s of a healthy fetal heart usually contains
more than one key frame for each class in turn, and therefore it is very likely
that at least one frame per class is detected.
We also propose a shallow approach to test the confidence of the results.
While this comes at a cost of losing some samples from reaching a weaker decision
on them, the error rate is decreased both on the validation and on the test
sets. Although this experiment shows promising results, we intend to deal with
uncertainty [25,26] in future work to model more objective thresholds.
The data set will continue to be expanded and we expect to reach an even
better overall accuracy. Also we intend to adjust the loss in such a way as to
disadvantage the mislabeling of the Other samples as significant key frames.
Another task that is intended for future work is to be able to further indicate
normal/abnormal condition in key frames detected as present.

Acknowledgement. This work was supported by a grant of the Romanian Ministry


of Research and Innovation, CCCDI – UEFISCDI, project number 408PED/2020, PN-
III-P2-2.1-PED-2019-2227, Learning deep architectures for the Interpretation of Fetal
Echocardiography (LIFE ), within PNCDI III, as well as the Plan Propio de Investi-
gación, Transferencia y Divulgación Cientı́fica of the Universidad de Málaga.

References
1. Benhammou, Y., Achchab, B., Herrera, F., Tabik, S.: Breakhis based breast cancer
automatic diagnosis using deep learning: taxonomy, survey and insights. Neuro-
computing 375, 9–24 (2019)
2. Bridge, C.P., Ioannou, C., Noble, J.A.: Automated annotation and quantitative
description of ultrasound videos of the fetal heart. Med. Image Anal. 36, 147–161
(2017)
3. Cara, M., Tudorache, S., Dimieru, R., Florea, M., Patru, C., Iliescu, D.: Prenatal
first trimester assessment of the heart. Ann. Cardiol. Cardiovasc. Med. 1(2), 1008
(2017)
4. Cui, S., et al.: Development and clinical application of deep learning model for
lung nodules screening on CT images. Sci. Rep. 10, 1–10 (2020)
5. Dozen, A., et al.: Image segmentation of the ventricular septum in fetal car-
diac ultrasound videos based on deep learning using time-series information.
Biomolecules 10(11), 1526 (2020)
6. Esteva, A., et al.: Deep learning-enabled medical computer vision. NPJ Digit. Med.
4, 5 (2021)
Deep Learning for First Trimester Fetal Heart Assessment 13

7. Garcia-Canadilla, P., Sánchez Martı́nez, S., Crispi, F., Bijnens, B.: Machine learn-
ing in fetal cardiology: what to expect. Fetal Diagn. Ther. 47, 363–372 (2020)
8. Huang, W., Bridge, C.P., Noble, J.A., Zisserman, A.: Temporal HeartNet: towards
human-level automatic analysis of fetal cardiac screening video. In: Descoteaux, M.,
Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI
2017. LNCS, vol. 10434, pp. 341–349. Springer, Cham (2017). https://doi.org/10.
1007/978-3-319-66185-8 39
9. Hutchinson, D., et al.: First-trimester fetal echocardiography: identification of car-
diac structures for screening from 6 to 13 weeks’ gestational age. J. Am. Soc.
Echocardiogr. 30(8), 763–772 (2017)
10. Iliescu, D., et al.: Improved detection rate of structural abnormalities in the first
trimester using an extended examination protocol. Ultrasound Obstet. Gynecol.
42(3), 300–309 (2013)
11. Jicinska, H., et al.: Does first-trimester screening modify the natural history of
congenital heart disease? Circulation 135(11), 1045–1055 (2017)
12. Komatsu, M., et al.: Detection of cardiac structural abnormalities in fetal ultra-
sound videos using deep learning. Appl. Sci. 11(1), 371 (2021)
13. Letourneau, K., Horne, D., Soni, R., McDonald, K., Karlicki, F., Fransoo, R.:
Advancing prenatal detection of congenital heart disease: a novel screening protocol
improves early diagnosis of complex congenital heart disease. Obstet. Gynecol.
Surv. 73, 557–559 (2018)
14. Lichtblau, D., Stoean, C.: Cancer diagnosis through a tandem of classifiers for
digitized histopathological slides. PLoS ONE 14(1), 1–20 (2019)
15. Lindgren Belal, S., et al.: Deep learning for segmentation of 49 selected bones in
CT scans: first step in automated PET/CT-based 3D quantification of skeletal
metastases. Eur. J. Radiol. 113, 89–95 (2019)
16. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image
Anal. 42, 60–88 (2017)
17. Lundervold, A.S., Lundervold, A.: An overview of deep learning in medical imaging
focusing on MRI. Zeitschrift für Medizinische Physik 29(2), 102–127 (2019). special
Issue: Deep Learning in Medical Physics
18. Mittal, S., Stoean, C., Kajdacsy-Balla, A., Bhargava, R.: Digital assessment of
stained breast tissue images for comprehensive tumor and microenvironment anal-
ysis. Front. Bioeng. Biotechnol. 7, 246 (2019)
19. Patel, N., Narasimhan, E., Kennedy, A.: Fetal cardiac us: techniques and normal
anatomy correlated with adult CT and MR imaging. RadioGraphics 37(4), 1290–
1303 (2017)
20. Piccialli, F., Somma, V.D., Giampaolo, F., Cuomo, S., Fortino, G.: A survey on
deep learning in medicine: why, how and when? Inf. Fusion 66, 111–137 (2021)
21. Pinto, N.M., Morris, S.A., Moon-Grady, A.J., Donofrio, M.T.: Prenatal cardiac
care: goals, priorities & gaps in knowledge in fetal cardiovascular disease: perspec-
tives of the fetal heart society. Prog. Pediatr. Cardiol. 59, 101312 (2020)
22. Sherkatghanad, Z., et al.: Automated detection of autism spectrum disorder using
a convolutional neural network. Front. Neurosci. 13, 1325 (2019)
23. Smith, L.N.: A disciplined approach to neural network hyper-parameters: part 1 -
learning rate, batch size, momentum, and weight decay (2018)
24. Stoean, R.: Analysis on the potential of an EA-surrogate modelling tandem for
deep learning parametrization: an example for cancer classification from medical
images. Neural Comput. Appl. 32, 313–322 (2020)
14 R. Stoean et al.

25. Stoean, R., et al.: Automated detection of presymptomatic conditions in spinocere-


bellar ataxia type 2 using monte-carlo dropout and deep neural network techniques
with electrooculogram signals. Sensors 20(11), 3032 (2020)
26. Stoean, R., Stoean, C., Atencia, M., Rodrı́guez-Labrada, R., Joya, G.: Ranking
information extracted from uncertainty quantification of the prediction of a deep
learning model on medical time series data. Mathematics 8(7), 1078 (2020)
27. Tudorache, S., Cara, M., Iliescu, D.G., Novac, L., Cernea, N.: First trimester two-
and four-dimensional cardiac scan: intra- and interobserver agreement, compari-
son between methods and benefits of color doppler technique. Ultrasound Obstet.
Gynecol. 42(6), 659–668 (2013)
28. Wang, J., et al.: Automated interpretation of congenital heart disease from multi-
view echocardiograms. Med. Image Anal. 69, 101942 (2021)
29. Yamanakkanavar, N., Choi, J.Y., Lee, B.: MRI segmentation and classification of
human brain using deep learning for diagnosis of Alzheimer’s disease: a survey.
Sensors 20(11), 3243 (2020)
Deep Learning Based Neural Network
for Six-Class-Classification of Alzheimer’s
Disease Stages Based on MRI Images

Tim Rörup1(B) , I. Rojas2 , H. Pomares2 , and P. Glösekötter1


1
University of Applied Science Münster, 48153 Münster, Germany
[email protected]
2
University of Granada, 108010 Granada, Spain

Abstract. State of the art classifiers split Alzheimer’s disease progres-


sion into a limited number of stages and use a comparatively small
database. For the best treatment, it is desirable to have the highest
resolution from the progression of the disease. This paper proposes a
reliable deep convolutional neural network for the classification of six
different Alzheimer’s disease stages based on Magnetic Resonance Imag-
ing (MRI). The peculiarity of this paper is the introduction of a new,
sixth, disease stage, and the large amount of data that has been taken
into account. Additionally, not only the testing accuracy is analyzed, but
also the robustness of the classifier to have feedback on how certain the
neural network makes its predictions.

Keywords: Alzheimer’s disease · Deep learning · Neural network ·


Magnetic resonance imaging · Multi-Class-classification

1 Introduction

According to the Alzheimer’s Disease Neuroimaging Initiative (ADNI),


Alzheimer’s disease is the leading cause of dementia and is currently incurable.
The disease pattern of Alzheimer’s worsens over the number of years after the
diagnosis. In the early stage of the illness, memory loss is comparatively small.
However, within the further progression of the disease, patients become inca-
pable to follow a conversation, react to their environment, and in most cases,
Alzheimer’s eventually results in death. It is the sixth most common cause of
death in the United States of America, with an average expectation of life after
the initial diagnosis ranges from four to eight years. Since the main risk factor
is aging, the disease can affect everyone. Therefore, it is essential to promote
research for this neurodegenerative disease [1].
A typical part of the medical procedure to diagnose Alzheimer’s disease is
magnetic resonance imaging. In this case, MRI images from the human brain are
used to make sure that the symptoms of the patient are caused by Alzheimer’s

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 15–23, 2021.
https://doi.org/10.1007/978-3-030-85030-2_2
16 T. Rörup et al.

and not by any other medical condition, such as tumors, strokes, or aging [2].
Another important aspect of MRI is the early detection of the illness. Medical
experts are confident to detect biological markers in MRI images to discover
Alzheimer’s before severe impairments occur [3].
The ADNI separates Alzheimer’s disease in six stages: the study provides
data from cognitively normal (CN) participants with no signs of cognitive impair-
ment or dementia. Next, referring to the progression of the disease ADNI unites
patients with significant memory concern (SMC). The affiliation to this class is
mainly based on self-reports of the participants. It has shown that significant
memory concerns have a higher likeliness of progression which makes SMC a
substantial class due it’s early occurrence. Chronological next are patients with
cognitive impairments where early mild cognitive impairment (EMCI) is the
earliest stage followed be mild cognitive impairment (MCI) and late mild cogni-
tive impairment (LMCI). MCI participants have no signs of dementia and their
daily activities are basically preserved. They either self-reported their subjective
memory concerns or are reported by people from their immediate surrounding.
Belonging to the different MCI levels is decided by a neuropsychological test
to measure memory functions called Wechsler Memory Scale. The sixth disease
stage combines participants with Alzheimer’s disease (AD) [4].
Most state of the art approaches considering multi-class-classification based
on MRI images are based on a comparatively small database and separate
Alzheimer’s disease into less than six classes. State of the art classifiers mostly
consider Alzheimer’s patients, patients with mild cognitive impairment, and the
healthy control group [5,6]. Herrera et al. [6] provide a three-class-classifier based
on 1350 images with accuracies of around 92%.
A few related works divide MCI into the subclasses early mild cognitive
impairment and late mild cognitive impairment [7,8]. Farooq et al. [9] considered
LMCI and implemented different four-class-classifiers. These classifiers have high
accuracies of approximately 98% but only consider 355 patients. An improvement
is published by Baskar D, Blessy C Simon et al. [7]. Here, different five-class-
classifiers are implemented, including the class EMCI, which are based on 3000
images and have accuracies of over 90%.
However, to detect the sickness as early as possible it is desirable to have the
highest resolution of its progression. Hence, this paper will introduce a new class
containing people with significant memory concern to close the gap between the
healthy control group and people with early mild cognitive impairment. This
can have a huge impact on the early detection of Alzheimer’s disease.
The multi-class classifier proposed in this paper considers six different disease
stages: CN, SMC, EMCI, MCI, LMCI, and AD. With MRI data of over 4500
patients, a reliable and robust neural network is developed to predict the disease
stage of Alzheimer’s patients with a Cross-Validation-Accuracy of 99,99%.
Six-Class-Classification of Alzheimer Disease 17

2 Methodology
2.1 Data

Three-dimensional MRI images from about 4500 participants of six classes are
obtained from the Alzheimer’s disease Neuroimaging Initiative [1]. The very first
step is pre-processing the original MRI images. Hence, each image is spatially
normalized using Statistical Parametric Mapping to have a bounding box of 157
× 189 × 136 cubic millimeter voxels. Subsequently each image is disassembled
along their coronal axis and gray-color-normalized. The outcome is 189 color-
normalized coronal brain slices per subject.
To reduce the computational cost a Multi-Objective Genetic Algorithm is
applied to provide information on which two-dimensional coronal brain slices
are the most relevant for classifying the disease stages. Therefore, a Support
Vector Machine (SVM) is trained with different amounts of coronal slices. For
each amount of slices the SVM is trained several times, each time with differ-
ent coronal brain slices. The algorithm finds the SVMs with the best trade-off
between accuracy and number of used slices, called dominating individuals. In
other words: there are no other SVMs with the same accuracy that needs less
slices, and vice versa. Hence, each dominating individual is trained with a dif-
ferent amount of slices. Counting the slice indices that were used to train these
dominating SVMs, some slices appear more often than others. On this way the
most relevant slices are found. This process can be retraced in a paper from
Valenzuela O. et al. [10]. The resulting dataset for the classifier in this paper
consists of nine two-dimensional coronal brain slices per patient with the indices:
114, 106, 104, 90, 82, 72, 61, 56, 55.

Fig. 1. Three of the nine two-dimensional coronal brain slices with the corresponding
class
18 T. Rörup et al.

Some images may occur blurry after the pre-processing. Hence, a Laplacian-
filter deletes these blurry images so there is a marginal reduction in the amount
of data. The actual number of patients and slices per class can be seen in Table 1.
Figure 2 visualizes the axis of axial, coronal and sagittal brain slices.

Table 1. Number of patients and coronal slices per class

Class CN SMC EMCI MCI LMCI AD


Number of patients 1131 541 363 1094 607 856
Number of coronal slices 10176 4868 3266 9845 5458 7696

Fig. 2. Axial, coronal and sagittal axis respectively slices of the brain [11]

2.2 Pipeline

Initially, only three classes CN, MCI, and AD were used to train, evaluate, and
compare different pre-trained open-source classifiers to find the best one suited
for the used data. Starting with only three classes keeps the computational cost
small. Subsequently, the best performing classifier was trained and evaluated by
all six classes. Finally, a 10-Fold-Cross-Validation is applied, and the results are
analyzed in their accuracy and robustness in order to make a reliable statement
about the performance of the final classifier. Additionally, the Cross-Validated
classifier gets also evaluated on a set of spatially shifted coronal brain slices. This
pipeline is represented in Fig. 3.
Six-Class-Classification of Alzheimer Disease 19

Fig. 3. Pipeline for data-processing from two-dimensional coronal brain slices to the
final six-class-classifier

2.3 Model Parameters, Complexity, and Computing Time


The neural network used in this paper is called ResNet50. A pre-trained ver-
sion is used, so the network has already adapted weights as the result of being
trained several epochs on the ImageNet dataset. CrossEntropyLoss is used as
loss-function and Adam as an optimizer, both provided by PyTorch. The com-
plexity of the classifier is comparatively small since there are much deeper ResNet
architectures in the bibliography like ResNet101 or ResNet152. The moderate
complexity of the model as well as the fact that a pre-trained version is used,
ensure that the computing time for training and testing is kept reasonable.

3 Neural Network Structure

The six-class-classifier that turned out with the best performing on the data is
a Deep Residual Network (ResNet). ResNet models are designed to counteract
the Vanishing Gradient Problem which describes a problem where the calculated
gradient for very deep layers gets so small that the training progress stagnates
[12]. In other words: adding more and more layers will most likely not improve
the neural network’s performance. This problem is visualized in Fig. 4 where the
flatter network achieves better results than the deeper network.
To counteract this problem, residual blocks are implemented which use short-
cut connections in the model where the input is unaltered without any weights
and gets passed to a deeper layer [12]. ResNets consist of many residual blocks
that enable very deep network architectures with way more than 100 layers.
20 T. Rörup et al.

Fig. 4. Training and testing error on CIFAR-10 dataset of two differently deep neural
networks [12]

Fig. 5. Residual block [12]

4 Results
4.1 Finding Best Classifier
In the first step, a classifier needs to be selected. Table 2 shows the performance
of the different neural network structures that were trained and evaluated by
only three of six classes. The amount of testing data is ten percent.

Table 2. Accuracy and robustness of different classifiers for image recognition

Models AlexNet VGG16 Densenet ResNet18 ResNet50


Accuracy [%] 99.78 99.89 99.93 99.89 99.89
Weak predictions 21 26 40 1 1

Since the accuracies on the testing data are very similar the selection is based
on the number of weak predictions. This value is defined as the number of testing
images where the neural network makes its prediction with a probability lower
than 90%. Hence, the ResNet-models performed the best. Since the complexity
of the data will be increased in the further proceeding, using all six classes, the
more complex ResNet model is selected. In this case, it is the ResNet50 model
due to its deeper network architecture.
Six-Class-Classification of Alzheimer Disease 21

4.2 Final Six-Class Classifier

Next, the ResNet50 classifier is trained with all 41309 coronal slices of the six
disease stages and analyzed with a 10-fold-Cross-Validation. In ten iterations
ten different apportionments of training and testing data were used to obtain an
average accuracy. The results can be seen in Table 3. Calculating the average
of all iterations leads to an accuracy of 99.99% and 1.3 weak predictions in
every iteration. These are tremendously promising results and have been double
checked by the authors. Not only the magnificent accuracy close to 100% but also
the extreme high robustness of the classifier, which is reflected in the fact that
the neural network only made one to two decisions on average with a certainty
less than 90%.

Table 3. Accuracy and robustness of different classifiers for image recognition

Iteration Accuracy [%] Weak predictions


st
1 iteration 99.95 0
2nd iteration 100 8
rd
3 iteration 100 0
4th iteration 100 0
th
5 viteration 100 0
6th iteration 100 0
th
7 iteration 100 0
8th iteration 100 0
9th iteration 100 1
th
10 iteration 99.98 4

4.3 Resistance to Deviations in the Images and Robustness

With regard to practical use it is recommendable to analyze the classifier’s


performance on deviated images. Thus, the classifier is evaluated on a set of
61166 foreign two-dimensional coronal brain slices. This dataset mainly contains
images with slice numbers that are different from the most relevant slices that
were turned out by the Multi-Objective Genetic Algorithm applied earlier. The
corresponding amount of slices per class can be seen in Table 4.
The foreign slices have the indices: 57, 62, 73, 83, 91, 105, 107, 115. They
slightly differ from the previous selected slices the ResNet model was trained
with, which corresponds a spatially shift along the coronal axis. Every model
from the 10-Fold-Cross-Validation is evaluated by the 61166 images. The results
that lead to an average accuracy of 98.62% can be seen in Table 5 below. This
results may be the most outstanding one since it demonstrates that the model
also works very reliable on images that slightly differ from the primary images.
22 T. Rörup et al.

Table 4. Amount of spatially shifted coronal brain slices per class and the correspond-
ing slice indice

Slice number CN SMC EMCI MCI LMCI AD


Amount of number 56 1276 576 484 2010 1237 1168
Amount of number 57 1276 576 487 2010 1239 1170
Amount of number 62 1276 578 498 2011 1251 1176
Amount of number 73 1276 584 509 2012 1249 1179
Amount of number 83 1276 587 509 2011 1251 1181
Amount of number 91 1276 586 511 2011 1253 1181
Amount of number 105 1276 585 508 2012 1258 1181
Amount of number 107 1276 582 504 2011 1251 1180
Amount of number 115 1276 582 504 2009 1250 1180

Table 5. Accuracy of the Cross-Validated classifier on foreign respectively spatially


shifted coronal brain slices

Iteration Accuracy [%]


st
1 iteration 98.09
2nd iteration 98.38
rd
3 iteration 98.31
4th iteration 98.69
th
5 iteration 98.90
6th iteration 98.90
th
7 iteration 98.65
8th iteration 98.80
th
9 iteration 98.66
10th iteration 98.86

5 Conclusion and Future Work

Since Alzheimer’s disease can affect everyone and is currently incurable it is


essential to detect the illness as early as possible. Therefore, it is desirable to
support and improve the diagnosis process. The six-class-classifier in this paper
does so by having a higher testing accuracy and considering more disease stages
than other state of the art classifiers. In particular, the introduction of the SMC
disease stage, which occurs at the very beginning of Alzheimer’s disease, gives
hope for an early diagnosis of the disease.
The implemented classifier in this paper can be evaluated as magnificently
reliable and robust in diagnosing six different disease stages of Alzheimer’s dis-
ease. This is reflected in its tremendously high accuracy of 99.99% and the
very low number of weak predictions. The most remarkable aspect is that the
Six-Class-Classification of Alzheimer Disease 23

neural network appears as extremely robust on deviated images, since it predicts


98.62% of spatially shifted coronal brain slices correctly. This is magnificent with
regard to use the implemented classifier in reality where a slight deviation in the
images is quite possible. Hence, there is a huge potential in artificial intelligence
to support the diagnosing and identification of Alzheimer’s disease stages.
Since the neural network in this paper is not suited for original MRI images,
but pre-processed ones, a combination of the pre-processing script and the neural
network implemented in this paper is eligible. A graphical user interface (GUI)
could unite these two aspects to enable application in reality. In this case, med-
ical experts could be supported by the GUI in diagnosing and identification of
different Alzheimer’s disease stages.

References
1. ADNI Homepage. https://www.alz.org/alzheimers-dementia/facts-figures.
Accessed 05 Oct 2020
2. ADNI Homepage. https://www.alz.org/alzheimers-dementia/diagnosis. Accessed
05 Oct 2020
3. ADNI Homepage. https://www.alz.org/alzheimers-dementia/research progress.
Accessed 05 Oct 2020
4. ADNI Homepage. http://adni.loni.usc.edu/study-design/. Accessed 31 Mar 2021
5. Mathew, N.A., Vivek, R.S., Anurenjan, P.R. (eds.): Early diagnosis of alzheimer’s
disease from MRI images using PNN. In: International CET Conference on Control,
Communication, and Computing (IC4). IEEE, Trivandrum (2018)
6. Herrera, L.J., Rojas, I., Pomares, H., et al. (eds): Classification of MRI images for
Alzheimer’s disease detection. In: International Conference on Social Computing.
IEEE, Washington (2013)
7. Baskar, D., Simon, B.C., Dr. Jayanthi, V.S. (eds): Alzheimer’s disease classifica-
tion using deep convolutional neural network. In: 9th International Conference on
Advances in Computing and Communication (ICACC). IEEE, Kochi (2019)
8. Thulasi, N.P.K., Varghese, D. (eds): A novel approach for diagnosing alzheimer’s
disease using SVM. In: 2nd International Conference on Trends in Electronics and
Informatics (ICOEI). IEEE, Tirunelveli (2018)
9. Farooq, A., Anwar, S.M., Awais, M., et al. (eds): A deep CNN based multi-class
classification of alzheimer’s disease using MRI. In: IEEE International Conference
on Imaging Systems and Techniques (IST). IEEE, Beijing (2017)
10. Valenzuela, O.: Multi-objective genetic algorithms to find most relevant volumes
of the brain related to alzheimer’s disease and mild cognitive impairment. Int. J.
Neural Syst. 28(09), 1850022 (2018)
11. Goschke, T.: VL Kognitionspsychologie: Denken, Problemlösen, Sprache - Meth-
oden der Kognitiven Neurowissenschaft: Kurze Einführung in die funktionelle
Bildgebung (Summer 2013). Technical University Dresden, Dresden (2013)
12. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition.
Cor-nell University (2015)
Detection of Tumor Morphology
Mentions in Clinical Reports in Spanish
Using Transformers

Guillermo López-Garcı́a1(B) , José M. Jerez1 , Nuria Ribelles2 , Emilio Alba2 ,


and Francisco J. Veredas1
1
Departamento de Lenguajes y Ciencias de la Computación,
Universidad de Málaga, 29071 Málaga, Spain
[email protected]
2
Unidad de Gestión Clı́nica Intercentros de Oncologı́a, Instituto de Investigación
Biomédica de Málaga (IBIMA), Hospitales Universitarios Regional y Virgen
de la Victoria, 29010 Málaga, Spain

Abstract. The aim of this study is to systematically examine the per-


formance of transformer-based models for the detection of tumor mor-
phology mentions in clinical documents in Spanish. For this purpose, we
analyzed 3 transformer models supporting the Spanish language, namely
multilingual BERT, BETO and XLM-RoBERTa. By means of a transfer-
learning-based approach, the models were first pretrained on a collection
of real-world oncology clinical cases with the goal of adapting trans-
formers to the distinctive features of the Spanish oncology domain. The
resulting models were further fine-tuned on the Cantemist-NER task,
addressing the detection of tumor morphology mentions as a multi-class
sequence-labeling problem. To evaluate the effectiveness of the proposed
approach, we compared the obtained results by the domain-specific ver-
sion of the examined transformers with the performance achieved by
the general-domain version of the models. The results obtained in this
paper empirically demonstrated that, for every analyzed transformer, the
clinical version outperformed the corresponding general-domain model
on the detection of tumor morphology mentions in clinical case reports
in Spanish. Additionally, the combination of the transfer-learning-based
approach with an ensemble strategy exploiting the predictive capabili-
ties of the distinct transformer architectures yielded the best obtained
results, achieving a precision value of 0.893, a recall of 0.887 and an
F1-score of 0.89, which remarkably surpassed the prior state-of-the-art
performance for the Cantemist-NER task.

Keywords: Transformers · Tumor morphology mentions · Natural


language processing · Deep learning · Oncology

1 Introduction
There is a significant demand for the automated analysis of the informa-
tion stored in electronic health records (EHRs) to improve patient care.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 24–35, 2021.
https://doi.org/10.1007/978-3-030-85030-2_3
Tumor Morphology Mentions Detection in Spanish Using Transformers 25

EHRs contain heterogeneous data whose volume is consistently growing, includ-


ing free-text documents that, using domain-specific vocabulary and terminol-
ogy, store crucial patient information about clinical examinations, radiology
reports, discharge summaries, etc. [2] However, the unstructured nature of the
texts makes it particularly challenging to directly extract the relevant medical
information these documents contain. In this way, there is a pressing need to
automatically transform unstructured clinical text into structured information,
which can subsequently serve as support in clinical decision-making and in opti-
mizing the administrative management of the resources of healthcare services,
improving many aspects of clinical care [3].
According to the World Health Organization (WHO), cancer is a leading
cause of mortality worldwide, producing around 10 million deaths in 2020 [23].
Diagnosis of cancer heavily relies on the pathological examination of tumor sam-
ples obtained from biopsies. The resulting observations made by physicians are
mainly reported in pathology reports, which correspond to clinical free-text doc-
uments stored in EHRs [17]. With the widespread adoption of EHRs as an essen-
tial element in oncology information systems, automatically extracting the infor-
mation contained in cancer-related EHR documents would not only facilitate
pathologists daily clinical practice, but also would permit large-scale analysis of
the relations between a concrete tumor case and its prognosis, its response to
specific treatments, and many other medical aspects [15].
Traditionally, natural language processing (NLP) techniques have been
applied to clinical notes with the aim of extracting relevant medical informa-
tion from free-text documents [8,16]. More specifically, these techniques have
also been adapted to process oncological textual data, contributing to obtain
structured representations of the information stored in cancer-related docu-
ments [18,29]. However, the majority of the previous works focus exclusively
on medical texts written in English, owing to the limited availability of anno-
tated corpora and additional clinical linguistic resources written in non-English
languages, such as Spanish. With nearly 489 million native speakers, Spanish
is the second most spoken language in the world in terms of number of native
speakers [25]. Given the enormous amount of clinical texts produced in hospitals
from Spanish-speaking countries around the globe, there is a considerable inter-
est both in industry and academia to boost the application of NLP technologies
to medical documents in Spanish.
With the aim of overcoming this issue, last year the CANcer TExt MIning
Shared Task (CANTEMIST) was carried out [15], constituting the first shared
task specifically focused on the development of automatic systems for extract-
ing relevant clinical information from oncology texts in Spanish. In particular,
Cantemist explored the named entity recognition (NER) of tumor morphology
mentions in oncology documents in Spanish. The organizers publicly released
the Cantemist corpus, a collection of 1301 oncological clinical case reports man-
ually annotated with mentions of tumor morphology. Additionally, the tumor
morphology mentions were mapped to a standardized coding vocabulary, specif-
ically the CIE-O—which is the Spanish equivalent of the ICD-O (International
Classification of Diseases for Oncology). Within the Cantemist track, three
26 G. López-Garcı́a et al.

different shared subtasks were proposed: Cantemist-NER, Cantemist-NORM


and Cantemist-CODING. Thus, given a free-text oncology document, the
Cantemist-NER task consisted in automatically detecting the tumor morphology
mentions contained in the text, whereas the Cantemist-NORM task additionally
required assigning the corresponding CIE-O codes to the identified mentions.
For its part, the Cantemist-CODING task consisted in assigning a ranked list of
CIE-O codes to each text in the Cantemist corpus.
In this work, we have tackled the problem of automatically detecting tumor
morphology mentions in oncology cases written in Spanish. For this purpose,
we adapted several transformer-based models to the distinctive features of the
Spanish oncology domain. By means of a transfer-learning (TL) approach, the
models were firstly pretrained on a private collection of real-world oncology clin-
ical cases written in Spanish. The resulting models were further fine-tuned on
the Cantemist-NER subtask, addressing the problem as a multi-class sequence-
labeling task. Although previous preliminary works have applied BERT-based
models to the problem of identifying tumor morphology mentions in clinical
documents in Spanish [7,27], to the best of our knowledge this is the first
study that systematically analyzes the performance of different transformer-
based architectures for the problem of tumor morphology mentions detection
using medical texts in Spanish. Following the proposed TL-based strategy, the
transformers analyzed in this work achieved new state-of-the-art (SOTA) per-
formance on the Cantemist-NER subtask. For reproducibility purposes, all the
code needed to replicate our work is publicly available at https://github.com/
guilopgar/TumorMorphNER.

2 Materials and Methods

2.1 Corpora

Galén Oncology Corpus. We further pretrained the transformer-based mod-


els analyzed in this study using a private corpus of de-identified oncology doc-
uments in Spanish retrieved from the Galén Oncology Information System [20].
The corpus corresponds to a compilation of 30.9K real-world clinical cases writ-
ten by oncologists from the Hospital Regional Universitario and the Hospital
Universitario Virgen de la Victoria in Málaga, Spain, comprising a total of
64.4M words and 437.6M characters.

Tumor Morphology Mentions Corpus. We used the Cantemist-NER cor-


pus to fine-tune the models on a tumor morphology mentions detection task. The
corpus comprises 1301 oncological cases written in Spanish, which were manu-
ally annotated by clinical experts with mentions of tumor morphology [15]. The
collection of documents was split into three subsets: the training set, which con-
tains 501 documents and 6396 tumor morphology annotations, the development
set, comprising 500 clinical cases and 6001 annotations, and the test set, con-
taining 300 documents and 3633 annotations. The annotations were distributed
Tumor Morphology Mentions Detection in Spanish Using Transformers 27

in BRAT standoff format [22]. Hence, for each annotated tumor morphology,
its mention string, its start character offset and its end character offset were
provided (see Fig. 1).

La semiología descrita conjuntamente con la radiología planteaban el


diagnóstico diferencial entre
hepatocarcinoma multifocal sobre hígado sano,
tumor germinal extragonadal hepático y metástasis endovesiculares frente
a metástasis hepáticas y endovesiculares sugestivos de melanoma por la
hipervascularización.
» Ante dichos hallazgos se realiza una PET-TC para estadificación y
búsqueda de un tumor primario, destacando la lesión tumoral sólida
hipermetabólica endovesicular, que se extiende hasta el hilio hepático y el
surco pancreatoduodenal y múltiples lesiones hepáticas hipermetabólicas,
siendo difícil la valoración del tumor primario, aunque dada la ausencia de
dilatación de la vía biliar y pancreática se orienta hacia un probable origen
vesicular como primera opción (menos probablemente del tipo
colangiocarcinoma o duodenal) y probables metástasis hepáticas.

Fig. 1. Illustration of the tumor morphology annotations from the Cantemist-NER


corpus distributed in BRAT format [22], using the cc onco93 clinical document from
the Cantemist-NER development subset.

2.2 Transformer-Based Models

In the last years, contextual embeddings have emerged as a new family of models
capable of creating a numerical representation of a word by considering the
particular context where the word occurs within the text. Among these new
context-aware language models, the Transformer [24] has undoubtedly stood
out as the new deep learning SOTA architecture in the field of NLP. BERT [6],
RoBERTa [13] and XLM-R [5] are examples of transformer-based models that
have become the new SOTA for question answering, text summarization or NER
tasks, also in the field of clinical NLP [1,21,28]. One of the main characteristics
of the Transformer architecture is the self-attention mechanism it uses, which
allows the model to parallelize a large part of the network architecture, increasing
computing efficiency. Additionally, another distinctive feature of transformer-
based models is that they can be pretrained on a general domain corpus and
further fine-tuned on a domain-specific corpus to resolve a particular NLP task,
following a TL approach.
In this study, we have systematically analyzed the performance of transform-
ers on the tumor morphology mentions detection problem in oncology documents
in Spanish. For this purpose, we have examined 3 transformer-based models that
support the Spanish language, namely mBERT [6], BETO [4] and XLM-R [5]. To
the best of our knowledge, the previous 3 models are the only publicly available
transformers including Spanish among their supported languages.

– mBERT : this multilingual transformer uses the same architecture as the


BERT-Base model [6], employing a multilingual WordPiece [9] vocabulary of
∼110K subwords. The total number of trainable parameters of the model is
∼177M.
28 G. López-Garcı́a et al.

– BETO: the Spanish-BERT model uses a similar architecture to the BERT-


Base model [4]. This transformer uses a Spanish vocabulary of ∼31K subto-
kens, and the total number of trainable parameters is ∼110M.
– XLM-R: the multilingual version of the RoBERTa-Base model [13] was pre-
trained following a modified version of the XLM approach [12], using a large
multilingual SentencePiece [10] vocabulary of ∼250K subtokens, and the total
number of trainable parameters is ∼278M.

2.3 CRF
Conditional Random Fields (CRF) [11] have been extensively used for sequence-
labeling task, such as the tumor morphology mentions detection problem tackled
in this study. In this paper, we have used a feature-based CRF model as a
competitive baseline with the aim of comparing the performance of transformer-
based models with the results obtained by a standard machine learning (ML)
method on the Cantemist-NER task.

2.4 Transfer-Learning Approach for Automatic Tumor Morphology


Mentions Detection
In this study, we have applied a TL approach to perform the automatic detec-
tion of tumor morphology mentions in Spanish using transformers. Our TL-
based strategy consists of two consecutive phases: firstly, the domain-specific
pretraining of the transformer-based models, and then the subsequent super-
vised fine-tuning of the resulting models. In the next paragraphs, both phases
are described.

Unsupervised Pretraining. The 3 transformer-based models examined in this


work were further pretrained on a collection of unlabeled real-world oncology
clinical cases. Specifically, the two BERT-based models, namely mBERT and
BETO, were pretrained on the basis of the Next Sentence Prediction (NSP)
task and the Masked Language Model (MLM) objective with the Whole-Word
Masking (WWM) modification [6]. On the other hand, the XLM-R model was
optimized using the MLM objective with the dynamic masking modification [5].

Supervised Fine-Tuning. In this study, we tackled the automatic detection


of tumor morphology mentions in Spanish using transformers. In this way, we
addressed this supervised learning task as a multi-class sequence-labeling prob-
lem, using the IOB2 [19] tagging scheme. Since the tumor morphology annota-
tions from the Cantemist-NER corpus were distributed in BRAT standoff format
(see Fig. 1), we firstly converted them into a different format compatible with the
IOB2 scheme. Thus, for each word in a document from the Cantemist-NER cor-
pus, we assigned the label “B” (“Beginning”) if it corresponded to the first word
of a tumor morphology mention, the label “I” (“Inside”) if the word was inside
an annotated mention, or the label “O” (“Outside”) if the word was not part of
Tumor Morphology Mentions Detection in Spanish Using Transformers 29

any mention. However, transformer-based models do not operate at word-level.


Instead, they further break down words into a sequence of subtokens, each model
using a specific tokenizer, e.g. XLM-R utilizes a SentecePiece tokenizer with a
vocabulary containing ∼250K subwords. In order to effectively leverage the pre-
dictive capabilities of transformers when applied to the Cantemist-NER task, we
have developed a five-phases approach that performs the supervised fine-tuning
of the transformer-based models using a sequence of subtokens annotated with
IOB2 labels as input to the models. Figure 2 shows a visual description of the
developed strategy, and each of its five stages is described in the next paragraphs.

Fig. 2. Workflow of the five-phases strategy developed to both fine-tune and evaluate
the performance of the transformer-based models on the Cantemist-NER task. For
illustration purposes, we used a 3-words fragment of text extracted from the cc onco93
clinical case from the Cantemist-NER development corpus (see Fig. 1) as input to
the model. The WordPiece tokenizer of the mBERT model was used to generate the
subtoken sequence from the input sequence of words. Additionally, the tokenizer added
two special tokens ([CLS] and [SEP]) at the first and last positions, respectively, of
the subwords sequence, which are further ignored by the output layer of the model at
the time of prediction.

1. Subtoken-level annotations. As it was previously specified, transformers


further segment words into a sequence of subtokens. For this reason, we con-
verted the IOB2 word-level annotations to subtoken-level. Thus, for every
word in a document from the Cantemist-NER corpus, its associated IOB2
label was assigned to all subtokens obtained from the same word.
2. Multi-class fine-tuning. Using the resulting Cantemist-NER corpus anno-
tated with IOB2 tags at the subtoken-level, each transformer was fine-tuned
on the automatic detection of tumor morphology mentions task. To perform
the supervised fine-tuning of the whole model architecture on a multi-class
sequence-labeling problem, the output representation encoded by the model
for each subtoken was fed into a final fully-connected layer with 3 softmax
units, representing the “I”, “O” and “B” tags, respectively, of the IOB2
scheme.
30 G. López-Garcı́a et al.

3. Subtoken-level predictions. Hence, at inference time, given an input


sequence of subwords as input to the model, the 3-tuple predicted for each
subtoken could be interpreted as the probability of the subtoken being part of
a word “inside” a tumor morphology mention (the “I” label), the probability
of the subword belonging to a word “outside” any tumor morphology men-
tion (the “O” label), and the probability of the subtoken being part of the
“beginning” word of a tumor morphology mention (the “B” tag), respectively.
4. Word-level predictions. From the previous step, a set of IOB2 labels prob-
abilities predicted by the model at the subtoken-level were obtained. How-
ever, in order to evaluate the predictive performance of the models on the
Cantemist-NER task, the models predictions had to be converted into BRAT
standoff format. Consequently, with the goal of transforming the IOB2 labels
probabilities into BRAT format, we firstly converted the predictions made on
the subtoken-level into word-level predictions. For this purpose, we applied
a maximum probability criterion to the predictions made on the sequence of
subtokens generated from each word. In this way, for the predictions made for
all subtokens obtained from a single word, the criterion consisted in selecting,
for each of the 3 IOB2 labels, the maximum predicted probability across the
corresponding subtokens.
5. Word-level tags. Subsequently, considering the word-level predictions
obtained from the previous step, each word was assigned the IOB2 tag pre-
dicted with the maximum probability. Then, using the IOB2 label associ-
ated to each word, the predictions made by the model were converted into
BRAT format in order to evaluate the performance of the transformer on the
Cantemist-NER task.

2.5 Experiments
We implemented our TL approach for tumor morphology mentions detection
in TensorFlow, using the transformers library developed by HuggingFace [26].
For all transformer-based models analyzed in this study, we set a maximum
input sequence length of 128 subwords. However, the majority of the clinical
cases from the Cantemist-NER corpus have a subtoken sequence length clearly
above 128 subwords. This represents a significant constraint when fine-tuning
transformers on the Cantemist-NER task, since, for most of the documents,
their whole sequence of subwords could not be used as input to the model.
To overcome this limitation, we have used the fragment-based segmentation
approach developed in [14]. In this way, each document from the Cantemist-NER
corpus was firstly split into sentences. Then, adjacent sentences were grouped
together in single fragments of text following a greedy strategy, in such a way that
the subtokens sequence length of each fragment did not surpass the maximum
input sequence length supported by the models. Finally, in order to fine-tune the
transformers on the Cantemist-NER subtask, each generated text fragment was
annotated with IOB2 labels at the subtoken-level, as described in the previous
section. On the other hand, in the case of the feature-based CRF model, since no
input sequence length limitation is imposed by this method, for every document
Tumor Morphology Mentions Detection in Spanish Using Transformers 31

in the Cantemist-NER corpus, its whole sequence of words annotated following


the IOB2 tagging scheme was used to train the model. We used the sklearn-
crfsuite1 library to implement the CRF model, using traditional text mining
features extracted from each word as input to the model, such as suffixes of 2
and 3 characters, boolean features indicating, for example, whether the word
corresponds to a digit, and several features extracting information from nearby
words. Finally, regarding the hardware resources employed, all experiments were
conducted using a single GeForce GTX 1080 Ti GPU.

3 Results

Table 1 shows the performance of the 3 transformers on the Cantemist-NER


task, as well as the performance of the baseline feature-based CRF model. For
each transformer-based model, we compared the original general-domain version
with the domain-specific version of the model adapted to the particularities of
the Spanish oncology domain (see Sect. 2.4). The official evaluation metrics of
the Cantemist-NER task [15]—precision, recall and F1-score—were employed
to evaluate the predictive performance of the models. For each transformer,
we fine-tuned 5 distinct randomly initialized instances. Comparing the perfor-
mance of the baseline model with the results obtained by the transformer-based
models, each transformer analyzed in this study significantly outperformed the
feature-based CRF model for each of the three classification metrics described
in Table 1. Among all models, mBERT-Galén, BETO-Galén and XLM-R-Galén
achieved the best performance according to each classification metric, with
the two domain-specific multilingual transformers—mBERT-Galén and XLM-
R-Galén—obtaining identical average values for each metric, namely a mean
precision of 0.867, an average recall of 0.869 and a mean F1-score of 0.868. On
its part, the BETO-Galén model also obtained the same average F1-score of
0.868, but a slightly lower average recall (0.865) and a slightly greater average
value for precision (0.872). Compared with the general-domain transformers, the
domain-specific version of the models improved the performance for the detec-
tion of tumor morphology mentions in clinical reports in Spanish. In this way, for
each transformer-based model, the clinical-domain version of the model outper-
formed the general-domain version in terms of the average values obtained for
each classification metric. Finally, when comparing the results obtained in this
study with the previously reported SOTA results, new SOTA performance was
achieved according to the maximum values of each metric. Thus, the XLM-R-
Galén model obtained a maximum precision value of 0.881, as well as a maximum
recall of 0.878, exceeding the prior SOTA performance reported by the organiz-
ers of the Cantemist-NER task for each of the former two metrics [15]. In the
case of the F1-score, the mBERT-Galén model surpassed the previous SOTA
performance, obtaining a maximum value of 0.876.

1
https://sklearn-crfsuite.readthedocs.io/.
32 G. López-Garcı́a et al.

Table 1. Models performances on the Cantemist-NER test set. The distribution of


the precision, recall and F1-score values obtained by the 5 distinct fine-tuned instances
of each model is described, by reporting the mean, standard deviation and maximum
values. For the maximum values column of every metric, the best obtained result is
bolded, while the second best is underlined.

Model Precision Recall F1-score


Mean ± Std Max Mean ± Std Max Mean ± Std Max
Baseline-CRF – .815 – .774 – .794
mBERT .85 ± .009 .861 .854 ± .007 .862 .852 ± .004 .858
mBERT-Galén .867 ± .008 .876 .869 ± .007 .877 .868 ± .004 .876
BETO .85 ± .006 .859 .858 ± .008 .869 .854 ± .004 .856
BETO-Galén .872 ± .008 .88 .865 ± .004 .869 .868 ± .002 .87
XLM-R .846 ± .014 .861 .858 ± .006 .863 .852 ± .005 .858
XLM-R-Galén .867 ± .009 .881 .869 ± .006 .878 .868 ± .003 .874
Prior SOTA – .871 – .871 – .87

3.1 Ensemble

Additionally, we proposed an ensemble approach to combine the different IOB2


labels predictions made by the models at word-level. Hence, given a sequence
of W words, as a result of fine-tuning 5 different instances of each model, the
fourth stage of our proposed workflow for performing tumor morphology men-
tions detection outputted 5 distinct IOB2 labels probability matrices of W × 3
dimension (see Fig. 2) for a single transformer model. To merge these matrices
into a single probability matrix, the proposed ensemble strategy consisted in
performing the element-wise product of the 5 different matrices. Furthermore,
our ensemble approach could also be employed to merge the IOB2 labels predic-
tions made by any number of different transformers, by plainly performing the
element-wise multiplication of all word-level IOB2 labels probability matrices
obtained from the distinct models.
Table 2 describes the performance of our developed ensemble approach
applied to merge both the word-level probabilities predicted by single models as
well as the word-level predictions made by multiple distinct transformers. The
ensemble combining the word-level predictions of the 3 transformer-based mod-
els adapted to the Spanish oncology domain—mBERT-Galén + BETO-Galén +
XLM-R-Galén—achieved the best performance among all models examined in
this work, obtaining a precision value of 0.893, a recall of 0.887 and a F1-score
of 0.89, which remarkably surpassed the prior SOTA performance according to
each classification metric.
Tumor Morphology Mentions Detection in Spanish Using Transformers 33

Table 2. Ensemble models performances on the Cantemist-NER test subset, according


to the precision, recall and F1-score metrics. For each metric, the best obtained result
is bolded, while the second best is underlined.

Ensemble Precision Recall F1-score


mBERT .873 .872 .872
mBERT-Galén .885 .881 .883
BETO .876 .873 .875
BETO-Galén .883 .873 .878
XLM-R .868 .874 .871
XLM-R-Galén .887 .879 .883
mBERT + mBERT-Galén .881 .876 .879
BETO + BETO-Galén .887 .878 .882
XLM-R + XLM-R-Galén .883 .88 .882
mBERT + BETO + XLM-R .882 .876 .879
mBERT-Galén + BETO-Galén + .893 .887 .89
XLM-R-Galén
Prior SOTA .871 .871 .87

4 Conclusion
In this work, we systematically examined the performance of 3 transformer-
based models to perform the detection of tumor morphology mentions in clinical
documents in Spanish. Using a TL-based strategy, the transformers were first
adapted to the particularities of the Spanish oncology domain by pretraining
the models on a real-world corpus of oncology clinical cases written in Spanish.
Subsequently, the resulting models were fine-tuned on the Cantemist-NER cor-
pus, following a multi-class sequence-labeling approach. For each analyzed trans-
former, the domain-specific version outperformed the general-domain version of
the model on the Cantemist-NER task. Finally, the combination of the TL-based
approach with an ensemble strategy that exploited the predictive capacities of
the 3 different transformers, yielded the best achieved results, which noticeably
improved the prior SOTA performance on the Cantemist-NER task. In future
works, given the promising results obtained in this paper, we will try to extend
the TL-based methodology to perform other downstream medical NLP tasks in
Spanish using transformers, such as the de-identification of a real-world clinical
corpus or the NER of cancer prognostic factors in medical records.

Acknowledgments. This work was partially supported by the project PID2020-


116898RB-I00, Ministerio de Ciencia e Innovación, Plan Nacional de I+D+i, the project
UMA-CEIATECH-01, Andalucı́a TECH, and the I Plan Propio de Investigación, Trans-
ferencia y Divulgación Cientı́fica of the Universidad de Málaga.
34 G. López-Garcı́a et al.

References
1. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings
of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78. Association
for Computational Linguistics, Minneapolis, June 2019
2. Baumann, L.A., Baker, J., Elshaug, A.G.: The impact of electronic health record
systems on clinical documentation times: a systematic review. Health Policy
122(8), 827–836 (2018)
3. Bronnert, J.: Preparing for the CAC transition. J. AHIMA 82(7), 60–1; quiz 62
(2011)
4. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish
pre-trained BERT model and evaluation data. In: Practical ML for Developing
Countries Workshop@ ICLR 2020 (2020)
5. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale.
arXiv [cs.CL], November 2019
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep
bidirectional transformers for language understanding. arXiv [cs.CL], October 2018
7. Garcı́a-Pablos, A., Perez, N., Cuadros, M.: Vicomtech at CANTEMIST 2020. In:
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 489–
498. CEUR Workshop Proceedings (2020)
8. Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using
convolutional neural networks. Stud. Health Technol. Inform. 235, 246–250 (2017)
9. Johnson, M., et al.: Google’s multilingual neural machine translation system:
enabling Zero-Shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
10. Kudo, T., Richardson, J.: SentencePiece: a simple and language independent sub-
word tokenizer and detokenizer for neural text processing. arXiv [cs.CL], August
2018
11. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: proba-
bilistic models for segmenting and labeling sequence data. In: Proceedings of the
Eighteenth International Conference on Machine Learning. ICML 2001, pp. 282–
289. Morgan Kaufmann Publishers Inc., San Francisco, June 2001
12. Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv [cs.CL]
(2019)
13. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv
[cs.CL] (2019)
14. López-Garcı́a, G., Jerez, J.M., Ribelles, N., Alba, E., Veredas, F.J.: ICB-UMA at
CANTEMIST 2020: automatic ICD-O coding in Spanish with BERT. In: Proceed-
ings of the Iberian Languages Evaluation Forum (IberLEF 2020), pp. 468–476.
CEUR Workshop Proceedings (2020)
15. Miranda-Escalada, A., Farré-Maduell, E., Krallinger, M.: Named entity recogni-
tion, concept normalization and clinical coding: overview of the cantemist track
for cancer text mining in Spanish, corpus, guidelines, methods and results. In:
Iberian Languages Evaluation Forum (IberLEF 2020), pp. 303–323. CEUR Work-
shop Proceedings, Málaga, Spain (2020)
16. Mujtaba, G., et al.: Clinical text classification research trends: systematic literature
review and open issues. Expert Syst. Appl. 116, 494–520 (2019)
17. National Cancer Institute: How Cancer Is Diagnosed (2019). https://www.cancer.
gov/about-cancer/diagnosis-staging/diagnosis. Accessed 23 Apr 2021
18. Qiu, J.X., Yoon, H.J., Fearn, P.A., Tourassi, G.D.: Deep learning for automated
extraction of primary sites from cancer pathology reports. IEEE J. Biomed. Health
Inform. 22(1), 244–251 (2018)
Tumor Morphology Mentions Detection in Spanish Using Transformers 35

19. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learn-
ing. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E.,
Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text,
Speech and Language Technology, vol. 11, pp. 157–176. Springer, Dordrecht (1999).
https://doi.org/10.1007/978-94-017-2390-9 10
20. Ribelles, N., et al.: Galén: Sistema de información para la gestión y coordinación
de procesos en un servicio de oncologı́a. RevistaeSalud 6(21), 1–12 (2010)
21. Si, Y., Wang, J., Xu, H., Roberts, K.: Enhancing clinical concept extraction with
contextual embeddings. J. Am. Med. Inform. Assoc. 26(11), 1297–1304 (2019)
22. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: a
Web-based Tool for NLP-assisted text annotation. In: Proceedings of the Demon-
strations at the 13th Conference of the European Chapter of the Association for
Computational Linguistics, pp. 102–107. Association for Computational Linguis-
tics, Avignon, April 2012
23. Sung, H., et al.: Global cancer statistics 2020: GLOBOCAN estimates of incidence
and mortality worldwide for 36 cancers in 185 countries. CA: Cancer J. Clin. 71,
209–249 (2021)
24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information
Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)
25. Vı́tores, D.F.: El español: una lengua viva. Informe 2020. Instituto Cervantes (2020)
26. Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language pro-
cessing. arXiv [cs.CL], October 2019
27. Xiong, Y., Huang, Y., Chen, Q., Wang, X., Nic, Y., Tang, B.: A joint model for
medical named entity recognition and normalization. In: Proceedings of the Iberian
Languages Evaluation Forum (IberLEF 2020), pp. 499–504. CEUR Workshop Pro-
ceedings (2020)
28. Yang, X., Bian, J., Hogan, W.R., Wu, Y.: Clinical concept extraction using trans-
formers. J. Am. Med. Inform. Assoc. 27(12), 1935–1942 (2020)
29. Zhu, F., et al.: Biomedical text mining and its applications in cancer research. J.
Biomed. Inform. 46(2), 200–211 (2013)
Enforcing Morphological Information
in Fully Convolutional Networks
to Improve Cell Instance Segmentation
in Fluorescence Microscopy Images

Willard Zamora-Cárdenas1 , Mauro Mendez1 , Saul Calderon-Ramirez1,2 ,


Martin Vargas1 , Gerardo Monge1 , Steve Quiros3 , David Elizondo2 ,
Jordina Torrents-Barrena4 , and Miguel A. Molina-Cabello5,6(B)
1
Computing School, Costa Rica Institute of Technology, Cartago, Costa Rica
[email protected]
2
Department of Computer Technology, De Montfort University, Leicester, UK
3
Tropical Diseases Research Center, Microbiology Faculty,
University of Costa Rica, San Jose, Costa Rica
4
Department of Computer Engineering and Mathematics,
Rovira i Virgili University, Tarragona, Spain
5
Department of Computer Languages and Computer Science,
University of Malaga, Málaga, Spain
[email protected]
6
Instituto de Investigación Biomédica de Málaga – IBIMA, Malaga, Spain

Abstract. Cell instance segmentation in fluorescence microscopy


images is becoming essential for cancer dynamics and prognosis. Data
extracted from cancer dynamics allows to understand and accurately
model different metabolic processes such as proliferation. This enables
customized and more precise cancer treatments. However, accurate cell
instance segmentation, necessary for further cell tracking and behav-
ior analysis, is still challenging in scenarios with high cell concentration
and overlapping edges. Within this framework, we propose a novel cell
instance segmentation approach based on the well-known U-Net archi-
tecture. To enforce the learning of morphological information per pixel,
a deep distance transformer (DDT) acts as a back-bone model. The
DDT output is subsequently used to train a top-model. The following
top-models are considered: a three-class (e.g., foreground, background
and cell border) U-net, and a watershed transform. The obtained results
suggest a performance boost over traditional U-Net architectures. This
opens an interesting research line around the idea of injecting morpho-
logical information into a fully convolutional model.

This work is partially supported by the following Spanish grants: TIN2016-75097-P,


RTI2018-094645-B-I00 and UMA18-FEDERJA-084. All of them include funds from the
European Regional Development Fund (ERDF). The authors acknowledge the fund-
ing from the Universidad de Málaga and the Instituto de Investigación Biomédica de
Málaga - IBIMA.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 36–46, 2021.
https://doi.org/10.1007/978-3-030-85030-2_4
Image Analysis by Ensembles of Convolutional Neural Networks 37

Keywords: Convolutional neural networks · Cell segmentation ·


Medical image processing · Deep learning

1 Introduction

The application of new image processing techniques is a burgeoning trend in


life sciences such as biology, chemistry, medicine, among others. Their imple-
mentation includes object measurement, 3D space exploration (e.g., magnetic
resonance/positron emission tomography) for surgical planning, dynamic pro-
cess analysis for time-lapse imaging in cell growth, movement and prolifera-
tion [5–7,16,31], detection and classification of blood cells [19], cancer diagnosis,
histopathology and detection of multiple diseases [1,3,8,9,20,21,30,32].
Accurate cell segmentation is crucial for robust heterogeneous cell dynamics
quantification (e.g., mitotic activity detection), tracking and classification, which
are often implemented as subsequent higher-level stages. For instance, intra-
tumoral heterogeneity contributes to drug resistance and cancer lethality [17].
In cancer research, cell biologists aim to monitor single-cell changes in response
to chemotherapies. Indeed, given the relevance of malignant cell proliferation,
the aforementioned changes need to be rigorously tracked along the progeny of
cancer cells through time-lapse microscopy. Manual cell segmentation is time-
consuming, prone to human subjective variation and biased by medical devices,
making (semi-)automatic cell segmentation approaches appealing.
Fully automatic segmentation of cell instances is widely tackled in the literature
(see Sect. 2). Image segmentation is defined as the assignment of a class or label to
a pixel (i.e., pixel-wise classification). More specifically, the problem of assigning
one out of multiple classes to a pixel is known as semantic segmentation. As for the
problem of assigning a different label to a pixel for different instances of a semantic
class, it is known as instance segmentation [15]. The latter is more challenging and
provides useful information in several application domains [2].
Cell counting solutions, a common application for automatic image analy-
sis, often avoids instance segmentation. For example, Weidi Xie et al. [28] used
counting estimation in cell clusters and a Convolutional Neural Network (CNN)
for cell density estimation. However, precise instance detection is essential in cell
tracking and individual cell behavior analysis [5].
Cell instance segmentation different challenges such as intensity saturation
and overlapping edges [2] that can hinder segmentation accuracy. Noise and
poor contrast caused by variable molecule staining concentration are common
drawbacks in cell imaging. Furthermore, the low number of manually annotated
samples is a frequent limitation, specially for CNN-based approaches, as expertś
knowledge is crucial to generate ground-truth data. Dealing with unbalanced
datasets is another recurrent issue, since foreground and overlapping cell pixels
are much lower than background pixels [12].
In this paper, we devise a novel CNN architecture based on the popular U-Net
[24] and the distance transform morphological operator [11] to jointly improve
instance segmentation accuracy. Section 2 presents related work in the field of
cell instance segmentation. Later, the proposed method is detailed in Sect. 3.
38 W. Zamora-Cárdenas et al.

Fig. 1. Rectangular mask (left) and its original distance transform (right). The inverse
of the distance transform is subsequently computed to provide high intensities to border
pixels.

Section 4 briefly describes the methods and datasets used to subsequently define
the experiments. Results are discussed in Sect. 5. Finally, conclusions and future
lines of research are described in Sect. 6.

2 Related Work
Image segmentation is a well-studied and documented problem among the
image processing, pattern recognition and machine learning communities. Sev-
eral methods have been proposed for cell segmentation [18], which rely on thresh-
olding [23] or active contours [29] approaches, among others.
Additionally, the Fully Convolutional Network (FCN) [15] inspired U-net,
which feeds up-sampling layers with the output of different down-sampling layers,
enforcing global information at fixed scales [24].
According to different approaches to mixing local and global information, in
this work, we combine the feed subsequent layers with data from previous convo-
lutions at different scales and pre-process morphological operators directly in the
model input. This way, our proposal enforces both local and global information
by learning morphological transformations from the data.

3 Proposed Method
Our segmentation method is based on the well-known U-net architecture [24].
We propose to enforce cell morphological information in a CNN by estimating
the inverse of the distance transform. Note that the original distance transform
calculates the closest Euclidean distance to a background pixel for each fore-
ground pixel. Figure 1 shows this transformation applied to a rectangular mask,
where border pixels have lower intensities. Differently, the inverse of the distance
transform is thus computed to provide high intensities to border pixels.
Specifically, an encoder-decoder classifier with a Rectified Linear Unit
(ReLU) activation is employed. This classifier is fed with fluorescence microscopy
Image Analysis by Ensembles of Convolutional Neural Networks 39

Table 1. Fine-tuned parameters for all models.

UNet1 DDT DDT + UNet1 DDT + UNet2


Training Time (m/epoch) ∼3.64 ∼3.61 ∼10.50 ∼7.53
Prediction Time (m/epoch) ∼0.35 ∼0.34 ∼1.42 ∼0.66
Batch Size 10 10 10 10
Training Size 3690 3690 3690 3690
Validation Size 930 930 930 930
Learning Rate 0.001 0.001 0.001 0.001
Loss Function CE MAE CE CE
Optimizer Adam Adam Adam Adam

images of cells. The proposed Euclidean distance estimator learns the inverse dis-
tance transform (Deep Distance Transformer (DDT) or backbone-model) of the
instance as a binary mask of the input image. The DDT encourages the U-net
to learn morphological information from the image.
The following top-models are integrated in the DDT back-model:
1. DDT + UNet1 : The DDT output is fed to a three-class (i.e., foreground,
background or border pixels) predictor using the Border Transform Ground-
truth (BTGT).
2. DDT + UNet2 : Considering that DDT + UNet1 could ignore relevant texture
information, an extra pipeline with two information channels is included.
The first and second channels are the DDT output and the original image,
respectively.
Table 1 shows the parameters used to train our proposed U-Net based
pipelines with an average log of the training and testing times in minutes per
epoch.
An advantage of our approach over [12] is the stability and speed provided
during the training stage. In our experiments, the loss functions proposed by
[12] hinders the model convergence. Our methodology is similar to [10] as mor-
phological pre-processing is also incorporated. Differently, the presented model
learns enriched image textures using pixel-wise labeled data. Instead of perform-
ing morphological operations directly in the input data [10], we train an U-net
architecture with the distance transform of the ground-truth, thus allowing the
model to estimate the same transform for all unlabeled samples.

4 Datasets
The Broad Bioimage Benchmark Collection BBBC006v1 dataset [14] is used in
this study. It consists of Human U2OS cells marked with Hoechst 33342. The
Hoechst 33342 is a cellular marker widely employed to stain genomic DNA in
fluorescence microscopy images. Due to its staining prowess, the Hoechst 33342
40 W. Zamora-Cárdenas et al.

Fig. 2. BBBC006 original image with enhanced contrast (left) and its corresponding
ground-truth (right).

Fig. 3. Original ground-truth image (left), DTGT image (middle) and BTGT image
(right).

is commonly used to visualize nuclei and mitochondria. It is also widely used


in diagnostics, apoptosis, single nucleotide polymorphism, nuclei acid quantifi-
cation, epilepsy and cancer behavior analysis [25].
This dataset is composed by 768 images with a resolution of 696×520 pixels in
16-bit TIF format. Each sample includes the corresponding ground-truth image,
encoding background pixels with 0. Foreground pixels use different values for
labeling each cell instance. To train our models, all samples were tiled to 256×256
pixels (6 tiles per sample = 4608 images). Figure 2 shows an original image of
the dataset, with its corresponding ground-truth and a copy of it with enhanced
contrast to visualize the image content.
Two transformations were also applied to the ground-truth. The first trans-
formation referred to the distance transform (Distance Transform Ground-truth
(DTGT)), while the second generated information corresponding to the border
between instances (BTGT). Specifically, BTGT marked pixels as borders if their
3 × 3 neighborhood contained elements from the outline of another instance.
Figure 3 displays the original ground-truth, DTGT and BTGT images.

5 Experiments and Results


We used a vanilla three-class U-net architecture to assess the DDT improvement
over the baseline classification. This model was trained with both the original
dataset and the BTGT.
Image Analysis by Ensembles of Convolutional Neural Networks 41

Table 2. Segmentation accuracy using border displacement error.

Pipeline BDE WDMC F1


UNet1 (baseline) 0.628 ± 0.199 0.930 ± 0.015 0.946 ± 0.006
DDT + UNet1 0.821 ± 0.149 0.924 ± 0.006 0.922 ± 0.010
DDT + UNet2 0.614 ± 0.215 0.938 ± 0.004 0.938 ± 0.004

The DDT was trained for 50 epochs through the Mean Absolute Error (MAE)
loss function. Empirical tests demonstrated that the MAE provides higher accu-
racy than the Mean Squared Error (MSE). The U-net1 (baseline), DDT + UNet1
and DDT + UNet2 models were trained end-to-end using the same training
observations, and the cross-entropy loss. As aforementioned, the BBBC006v1
dataset was utilized to train both U-net1 (baseline) and DDT models. The DDT
output was automatically fed into the DDT + UNet1 /UNet2 top-models. The
DDT + UNet2 approach used both feature extractors to enforce texture and
morphological information. A repository with the code can be found in its web-
site1 .
Inverse distance transformed images were normalized within the range {0, 1}.
All models were trained using 5-fold cross validation. The number of training
and testing images for each fold was 3687 and 921, respectively.
Edge estimation is crucial for cell instance segmentation, thus specific border
accuracy estimation metrics were selected. The Boundary Displacement Error
(BDE) [27] averages the displacement error of boundary pixels between two
images, in which lower values indicate better matches. The Weighted Dice Multi-
class Coefficient (WDMC) relies on the Sørensen–Dice similitude index [4] (over-
lapping), so values closer to 1 denote greater similitude. The WDMC fine-tuning
was conducted by calculating each class separately and performing a weighted
sum. In this work, more relevance was given to the correct prediction of bor-
der pixels using the following weights found experimentally: F oreground : 0.3,
Background : 0.3 and Border : 0.4.
Table 2 shows a comparison of the tested pipelines. The average and stan-
dard deviation are the best values for each metric. Results show that the DDT
+ UNet2 model yields a slightly higher accuracy when compared to the UNet1
(baseline) model, for both BDE and WDMC measures. These metrics demon-
strated to be more sensitive to edge estimation performance, in spite to more
common metrics as the F1-score, and intersection over the union in our empiri-
cal tests. Table 3 shows the average precision, recall and F1 score of each of the
proposed pipelines.

1
https://github.com/wizaca23/BBBC006-Instance-Segmentation.
42 W. Zamora-Cárdenas et al.

Table 3. Precision, recall and F1 scores for each U-Net pipeline.

Pipeline UNet1 DDT + UNet1 DDT + UNet2


Precision 0,944 ± 0,008 0,923 ± 0,013 0,931 ± 0,006
Recall 0,950 ± 0,007 0,928 ± 0,013 0,948 ± 0,002
F1 0,946 ± 0,006 0,922 ± 0,010 0,938 ± 0,004

Table 4. p-values of the Wilcoxon test.

BDE WDMC F1
P-value 0.02621 1.3261 × 10−05 2.3498 × 10−05

The DDT + UNet1 model performs slightly better than the UNet1 (base-
line) according to the WDMC, but it has a lower BDE accuracy. The DDT +
UNet2 model outperforms the others, which suggests that the combination of
texture information and inverse distance transform improves the overall model
performance.
To assess the statistical significance of the results, we performed a Wilcoxon
matched-pairs test. It helps to corroborate the similarity or difference between
the UNet1 (baseline) model and DDT + UNet1 /UNet2 model performance.
Specifically, we generated 26 random samples from a set of 921 testing images
(each with 300 images). Testing images were extracted from the same valida-
tion fold, so they were not used for training. Note that we selected the fold
where both models performed better. The Wilcoxon test was performed for the
BDE, WDMC. The null hypothesis was formulated as follows: one of the mod-
els performs better than the other. Table 4 shows the p-values computed for
each metric with 0.05 of significance. They indicate that there is a significant
difference between both models with p < 0.05. The BDE is close to be not statis-
tically conclusive, however the WDMC presents a more overwhelming statistical
difference.

6 Discussion and Conclusions


The proposed method outperforms the UNet1 (baseline), particularly when bor-
der prediction metrics (e.g., BDE and the WDMC) are computed. Both metrics
are more relevant and appropriate to measure segmentation accuracy, as track-
ing methods usually calculate the centroid based on the object edges, and also
post-processing effects can correct any holes in the object of interest. Using the
Dice coefficient or the F1-score gives th same weight to all the pixels, which can
give misleading interpretations of the image segmentation performance. How-
ever, the DDT output seems to be very noise sensitive (see Fig. 4). The DDT
often amplifies noise when few cells are present, thus hindering the top-model
predictions (see Fig. 5). These faulty predictions due to the amplified noise might
degrade the benefit of including morphological information.
Image Analysis by Ensembles of Convolutional Neural Networks 43

Fig. 4. Original image with enhanced contrast (left), prediction output (middle), and
current DTGT (right).

Fig. 5. Segmentations obtained from the U-Net1 (baseline) model (first), DDT + U-
Net1 (second ), DDT + U-Net2 (third ) and BTGT (fourth). The DDT + U-Net2 out-
performs the others due to the reduced number of false positives compared with the
U-Net1 (baseline) and DDT + U-Net1 models, respectively.

Adding the original image to the top model along with the inverse dis-
tance transform seems to improve the DDT output accuracy. Moreover, it also
boosts the overall model performance thanks to the texture information pro-
vided. Our results also suggest the possibility to include the DDT output as a
post-processing in order to remove artifacts. The Wilcoxon statistical test was
critical to properly analyse the significance of the obtained results.
The injection of morphological knowledge can benefit deep CNN architectures
to enhance instance segmentation accuracy [10]. Differently from [12], in which
morphological information was used to weight the loss function, our model still
remains simple and easy to train. The proposed back-bone model learns the
inverse distance transform by using pixel-wise labeling and adding instance-wise
morphological information. The work in [10] utilized simple and traditional pre-
processing methods.
The behavior of other CNNs [13] after injecting morphological information
will be addressed in the future. Indeed, additional experiments with different
datasets and CNNs are required to provide further evidence regarding the advan-
tages of enriching the image with instance-wise morphological information. To
reduce the labeling dependency, semi-supervised learning [22], data augmen-
tation techniques and generative models [26] will be employed to enlarge our
dataset and explore the advantages of adding morphological information.
44 W. Zamora-Cárdenas et al.

References
1. Alfaro, E., Fonseca, X.B., Albornoz, E.M., Martı́nez, C.E., Calderon-Ramirez, S.:
A brief analysis of u-net and mask R-CNN for skin lesion segmentation. In: 2019
IEEE International Work Conference on Bioinspired Intelligence (IWOBI), pp.
000123–000126. IEEE (2019)
2. Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 5221–5229 (2017)
3. Bermudez, A., et al.: A first glance to the quality assessment of dental photostim-
ulable phosphor plates with deep learning. In: 2020 International Joint Conference
on Neural Networks (IJCNN), pp. 1–6. IEEE (2020)
4. Bernard, O., Friboulet, D., Thevenaz, P., Unser, M.: Variational B-spline level-set:
a linear filtering approach for fast deformable model evolution. IEEE Trans. Image
Process. 18, 1179–1191 (2009)
5. Calderon-Ramirez, S., Saenz, A., Mora, R., Siles, F., Orozco, I., Buemi, M.:
DeWAFF: a novel image abstraction approach to improve the performance of a
cell tracking system. In: 2015 4th International Work Conference on IEEE Bioin-
spired Intelligence (IWOBI), pp. 81–88 (2015)
6. Calderon-Ramirez, S., Moya, D., Cruz, J.C., Valverde, J.M.: A first glance on the
enhancement of digital cell activity videos from glioblastoma cells with nuclear
staining. In: 2016 IEEE 36th Central American and Panama Convention (CON-
CAPAN XXXVI), pp. 1–6. IEEE (2016)
7. Calderon-Ramirez, S., Barrantes, J., Schuster, J., Mendez, M., Begera, J.: Auto-
matic calibration of the deceived non local means filter for improving the segmen-
tation of cells in fluorescence based microscopy. In: 2018 International Conference
on Biomedical Engineering and Applications (ICBEA), pp. 1–6. IEEE (2018)
8. Calderon-Ramirez, S., et al.: Improving uncertainty estimations for mammogram
classification using semi-supervised learning. In: International Joint Conference on
Neural Networks (IJCNN) (2021)
9. Calderon-Ramirez, S., et al.: Improving uncertainty estimation with semi-
supervised deep learning for covid-19 detection using chest x-ray images. IEEE
Access 9, 85442–85454 (2021)
10. Decencière, E., et al.: Dealing with topological information within a fully convolu-
tional neural network. In: Blanc-Talon, J., Helbert, D., Philips, W., Popescu, D.,
Scheunders, P. (eds.) ACIVS 2018. LNCS, vol. 11182, pp. 462–471. Springer, Cham
(2018). https://doi.org/10.1007/978-3-030-01449-0 39
11. Grevera, G.J.: Distance transform algorithms and their implementation and eval-
uation. In: Farag, A.A., Suri, J.S. (eds.) Deformable Models, pp. 33–60. Springer,
New York (2007). https://doi.org/10.1007/978-0-387-68413-0 2
12. Guerrero-Pena, F.A., Fernandez, P.D.M., Ren, T.I., Yui, M., Rothen-berg, E.,
Cunha, A.: Multiclass weighted loss for instance segmentation of cluttered cells.
arXiv:1802.07465 (2018)
13. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: IEEE International
Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
14. Ljosa, V., Sokolnicki, K.L., Carpenter, A.E.: Annotated high-throughput
microscopy image sets for validation. Nat. Methods 9, 637 (2012)
15. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3413–3440 (2015)
Image Analysis by Ensembles of Convolutional Neural Networks 45

16. Mahesh, M.: Fundamentals of medical imaging. Med. Phys. 38, 1735 (2011)
17. McGranahan, N., Swanton, C.: Clonal heterogeneity and tumor evolution: past,
present, and the future. Cell 168, 613–628 (2017)
18. Meijering, E.: Cell segmentation: 50 years down the road. IEEE Signal Process.
Mag. 29, 140–145 (2012)
19. Molina-Cabello, M.A., López-Rubio, E., Luque-Baena, R.M., Rodrı́guez-Espinosa,
M.J., Thurnhofer-Hemsi, K.: Blood cell classification using the hough transform
and convolutional neural networks. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo,
S. (eds.) WorldCIST’18 2018. AISC, vol. 746, pp. 669–678. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-77712-2 62
20. Molina-Cabello, M.A., Accino, C., López-Rubio, E., Thurnhofer-Hemsi, K.: Opti-
mization of convolutional neural network ensemble classifiers by genetic algorithms.
In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019. LNCS, vol. 11507, pp. 163–
173. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20518-8 14
21. Morgan, S., Watson, J., Twentyman, P., Smith, P.: Flow cytometric analysis of
Hoechst 33342 uptake as an indicator of multi-drug resistance in human lung can-
cer. Br. J. Cancer 60, 282 (1989)
22. Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly-and semi-
supervised learning of a deep convolutional network for semantic image segmenta-
tion. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 1742–1750 (2015)
23. Phansalkar, N., More, S., Sabale, A., Joshi, M.: Adaptive local thresholding for
detection of nuclei in diversity stained cytology images. In: 2011 International
Conference on IEEE Communications and Signal Processing (ICCSP), pp. 218–
220 (2011)
24. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
25. Sabnis, R.W.: Handbook of Biological Dyes and Stains: Synthesis and Industrial
Applications. Wiley, Hoboken (2010)
26. Sixt, L., Wild, B., Landgraf, T.: Rendergan: generating realistic labeled data. Front.
Robot. AI 5, 66 (2018)
27. Unnikrishnan, R., Pantofaru, C., Hebert, M.: Toward objective evaluation of image
segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 6, 929–944
(2007)
28. Xie, W., Noble, J.A., Zisserman, A.: Microscopy cell counting and detection with
fully convolutional regression networks. Comput. Methods Biomech. Biomed. Eng.:
Imaging Vis. 6, 283–292 (2018)
29. Zamani, F., Safabakhsh, R.: An unsupervised gvf snake approach for white blood
cell segmentation based on nucleus. In: 2006 8th International Conference on IEEE
Signal Processing, vol. 2 (2006)
30. Calvo, I., Calderon, S., Torrents-Barrena, J., Muñoz, E., Puig, D.: Assessing the
impact of a preprocessing stage on deep learning architectures for breast tumor
multi-class classification with histopathological images. In: Crespo-Mariño, J.L.,
Meneses-Rojas, E. (eds.) CARLA 2019. CCIS, vol. 1087, pp. 262–275. Springer,
Cham (2020). https://doi.org/10.1007/978-3-030-41005-6 18
46 W. Zamora-Cárdenas et al.

31. Sáenz, A., Calderón, S., Castro, J., Mora, R., Siles, F.: Deceived bilateral filter for
improving the automatic cell segmentation and tracking in the NF-kB pathway
without nuclear staining. In: Braidot, A., Hadad, A. (eds.) VI Latin American
Congress on Biomedical Engineering CLAIB 2014, Paraná, Argentina 29, 30 & 31
October 2014. IP, vol. 49, pp. 345–348. Springer, Cham (2015). https://doi.org/
10.1007/978-3-319-13117-7 89
32. Oala, L., et al.: Ml4h auditing: from paper to practice. In: Machine Learning for
Health, pp. 280–317. PMLR (2020)
Intelligent Computing Solutions
for SARS-CoV-2 Covid-19 (INClutions
COVID-19)
A Bayesian Classifier Combination Methodology
for Early Detection of Endotracheal Obstruction
of COVID-19 Patients in ICU

Francisco J. Suárez-Díaz1 , Juan L. Navarro-Mesa2(B) , Antonio G. Ravelo-García2 ,


Pablo Fernández-López3 , Carmen Paz Suárez-Araujo3 , Guillermo Pérez-Acosta4 ,
and Luciano Santana-Cabrera4
1 Hospital Universitario de Gran Canaria Doctor Negrín, Las Palmas de Gran Canaria, Spain
[email protected]
2 University Institute for Technological Development and Innovation in Communication
(IDeTIC), University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
[email protected]
3 University Institute for Cybernetic Science and Technologies (IUCTC), University of Las
Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
4 Complejo Hospitalario Universitario Insular Materno Infantil, Gran Canaria, Spain

Abstract. Due to COVID-19 related complications, many of the diagnosed


patients end up needing intensive care. Complications are often severe, to such
an extent that mortality rates in these patients may be high. Among the wide
variety of complications, we find necrotizing tracheobronchitis, which appears
suddenly with the obstruction of the endotracheal tube. This complication can
cause severe damage to the patient or even death. In order to help clinicians with
the management of this situation, we propose a Machine Learning-based method-
ology for detecting and anticipating the obstruction phenomenon. Through the
use of Bayesian classifiers, classifier combination, morphological filtering and a
track-while-scan detection mode we are able to establish an indicator function that
serves as a reference to clinicians. Our experiments show promising results and
lay the foundations of an intelligent system for early detection of endotracheal
obstruction.

Keywords: COVID-19 · Endotracheal obstruction · Machine learning · Bayes


classifier · Classifier combination · Track-while-scan detection · Missing data

1 Introduction

In December 2019, a series of acute atypical respiratory disease occurred in Wuhan,


China. This rapidly spread from Wuhan to other areas. It was soon discovered that a
novel coronavirus was responsible. The novel coronavirus was named as the severe
acute respiratory syndrome coronavirus-2 (SARS-CoV-2, 2019-nCoV) due to its high
homology (~80%) to SARS-CoV, which caused acute respiratory distress syndrome
(ARDS) and high mortality rates during 2002–2003.

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 49–60, 2021.
https://doi.org/10.1007/978-3-030-85030-2_5
50 F. J. Suárez-Díaz et al.

COVID-19 infection can cause severe or fatal complications and the need for invasive
ventilation in high-risk patients, and large part of COVID-19 patients evolves in such a
way that require an ICU admission. In this situation, it has been shown a mortality rate
close to 30% due to the complications presented by this kind of patients during the ICU
stay. Particularly, clinicians have observed several cases of necrotizing tracheobronchitis,
resulting in airway obstruction, with necrotic and hemorrhagic debris that obstruct the
trachea and bronchi. This complication, which we believe may influence a patient’s
prognosis, entails an increase in risk of contagion by healthcare personnel. It is one of the
procedures classified as of greatest risk for viral transmission from patients to healthcare
personnel, along with bronchoscopy, aerosol therapy, nebulization, and aspiration of
secretions.
This research work begins with the continuous observation of these complication by
physicians and their hypothesis that several observed parameters are correlated with it,
and has motivated an interdisciplinary investigation between engineers and physicians.
We have identified two groups of parameters, physiological and ventilatory mechanics
[1, 2]. Both are the subject of work in this research.
In this article we present an approach to the problem of early detection of endotracheal
tube obstruction. To the best of our knowledge, this is the first time that this problem
has been addressed in a methodological way thus constituting a baseline approach. In
particular, we apply machine learning techniques through the combined use of a Bayesian
classifier accompanied by a prior discretization stage using an optimal quantifier, and
a classifier combination strategy based on majority voting. A post-treatment based on
morphological filtering is applied to the detections to mitigate the effect of detection
errors. Finally, a track-while-scan technique is applied using a sliding window. The
result is the probability that the patient is pre-obstructing. The results obtained are very
promising, such as detection with high hit probabilities, high detection reliability and
good prediction of time-to-obstruction intervals.

2 The Basics of Invasive Mechanical Ventilation

Mechanical ventilation works by applying a positive pressure breath and is dependent


on the compliance and resistance of the airway system, which are affected by how much
pressure must be generated by the ventilator to provide a given tidal volume (TV).
The TV is the volume of air entering the lung during inhalation [3]. Compliance and
resistance are dynamic and can be affected by the disease states that lead to the intuba-
tion. Understanding the changes in respiratory parameters will help us in the labor of
analyzing, and further detecting and predicting the different patient states.
In invasive mechanical ventilation (IMV), the clinician may be presented with mul-
tiple different options on how to set up the ventilator. There are many kinds of modes of
ventilation, such as assist-control (AC), synchronized intermittent mechanical ventila-
tion (SIMV), and pressure support ventilation (PSV) [4]. In Fig. 1 we show an example
where there is a typical pressure waveform evolution over time during the SIMV ventila-
tor mode. In the following paragraphs we explain the relation among respiratory modes,
waveform, and the main parameters for the purposes of our research.
A Bayesian Classifier Combination Methodology for Early Detection of Endotracheal 51

Depending on the chosen respiratory mode a set of parameters must be set in the res-
pirator. The most common are respiratory rate, inspiratory flow rate, fraction of inspired
oxygen and positive end expiratory pressure.

Fig. 1. An example of SIMV ventilator mode

Once the inputs Respiratory Rate, Inspiratory flow Rate, Fraction of Inspired Oxygen
and Positive End Expiratory Pressure have been adjusted, the information obtained from
the mechanical ventilator (the ventilator-derived parameters) are peak (Ppeak), plateau
(Pplat) and mean pressures, resistance, and compliance (see Fig. 1).
Peak pressure is the maximum pressure measured at end inspiration. Ppeak includes
the elastic and resistive components (airway, lung tissue, and equipment, e.g., endo-
tracheal tube). Plateau pressure can be measured during an inspiratory pause when the
respiratory muscles are relaxed, and it is equal to the alveolar pressure when airflow is
zero. It is not affected by airflow and airway resistance changes. The difference between
Ppeak and Pplat pressures divided by the airflow is the airway resistance. In normal
patients, airway resistance values do not exceed 15–20 cmH2 O/L/s under controlled
mechanical ventilation. Several factors can modify Ppeak, such as endotracheal tube
diameter [5], airflow intensity, plugging, or bronchospasm.
The compliance of a system is defined as the change in volume that occurs per unit
change in the pressure of the system. Compliance is basically a measurement of the
elastic resistance of a system. Pulmonary compliance (C) is the total compliance of both
lungs. It is one of the most important concepts underpinning mechanical ventilation used
to manage patient in the intensive care unit (ICU) environment.
During the entire ICU stay parameters are sampled irregularly, with average sampling
intervals of 15 min, maximum concentration intervals below 20 min, and a maximum
of 90 min (very rarely).

3 Data Source and Description

3.1 Data Sources and Data Population

The data used in this study include health data referring to COVID-19 ICU admission in
Complejo Hospitalario Universitario Insular Materno – Infantil, Canary Islands recorded
between 8th March 2020 and 1st June 2020. We have combined two data sources. The
52 F. J. Suárez-Díaz et al.

first comes from Hospital Information System (HIS) and includes a set of values such
as gender, age, BMI, admission data, etc. The second consists of the data obtained by
querying the electronic health record in PICIS system [6], and includes a data matrix
in which there is a row per timestamp, and a column for each parameter measured. The
combination of the data from both sources forms our RAW data.
Our study is performed over 22 patients with COVID-19 derived Pneumonia, where
16 of them presented at least one episode of total or partial endotracheal obstruction
during their admission to the ICU. In Table 1 we show some demographic data and
information regarding the stay. Gender is given in percentage. Age, weight, height (in
cm.) and Body Mass Index (BMI) are given as mean and standard deviation (x ± σ ).
We also give information (x ± σ ) regarding days in ICU, under invasive mechanical
ventilation (IMV) and with orotracheal intubation (OTI).

Table 1. Demographic data and information regarding the stay

Gender Age Weight Height BMI ICU IMV OTI


Male (13) 65,2 85 169 30 24,5 23 14
Female (9) ± 9,5 ± 11,5 ± 0,1 ± 4,46 ± 4,5 ± 10,13 ± 4,32

3.2 Data Set Description


We must bear in mind that patients admitted to the ICU may be unconscious or have poor
communication skills with the attending physicians. This means that, from a scientific
point of view, we need to identify which parameters are most relevant in order to make
an introspection of the patient’s condition. We have identified two groups of parameters,
physiological and ventilatory mechanics. The group of physiological (PH) gives an
idea of the patient’s general condition, while the group of ventilatory (VM) show the
interaction of the assisted ventilation machine with the patient and how the patient reacts
to its action.
The following parameters have been taken for each patient throughout the ICU
stay and divided in two separate groups. Group 1 is formed by PH parameters: Heart
Rate, Systolic Pressure, Diastolic Pressure, Oxygen saturation (SatO2) and Temperature.
Group 2 is formed by MV parameters: Mean, Peak and Plateau Pressures, Compliance
and Resistance.
At the time of analyzing all the selected parameters, we found partial or total missing
data in some time intervals. For our study we must indicate that this data loss has
been marked in our raw data as absent and it is not used for our purposes. In Fig. 2
we graphically represent the amount of valid and missing data per patient for the PH
parameters. The data depicted are heart rate (1), systolic (2) and diastolic pressures
(3) and mean pressures (3), SatO2 (4) and temperature (5). We highlight the following
aspects: 1) The presence/absence of valid/missing data is irregular and without a defined
pattern. In addition, we have found that the intervals in which missings occur do not
follow a known pattern either; 2) Heart rate and SatO2 have (almost) no missing data;
3) There is only one patient (22) for whom most of the data is available.
A Bayesian Classifier Combination Methodology for Early Detection of Endotracheal 53

Fig. 2. Valid (black) and missing data per patient with endotracheal obstruction.

4 Methodology
4.1 Problem Description
The data obtained by querying the electronic health record in PICIS system [6] is pre-
T
sented as a time series of data vectors xt = [xti , . . . , xtj . . . , xtN ] , where ‘t’ is the sampling
instant and xtj is the jth component at ‘t’. For instance, in Groups 1 and 2 x1t is the mean
pressure and the heart rate, respectively. As we are studying two sets of parameters in
separate groups, each data vector dimension is x ∈ RN,1 with N = 5.
From the point of view of data sampling, there are two main issues to be considered
because they influence the design and methodology of this work. Firstly, one would
expect that the sampling of the parameters should be done with a constant period. How-
ever, this is not the case. Sampling rate is irregular, alternating intervals where samples
are taken every minute with intervals where they are taken every hour. In general, the
most common periods are between 10 and 20 min, with a 15-min predominance. Sec-
ondly, we could also expect that all the data are present at each sampling instant. This is
not the case either, as there are missing data in many patients, with no clear pattern as
to which data are missing and in which intervals it happens.
Physician and engineers made a consensus for careful manual selection of the non-
obstructive reference intervals and pre-obstruction intervals, as well as the obstructive
instants. Subsequently, for each type of parameter, two groups of samples were created,
and the temporal ordering is respected. We thus formed two classes, a ‘normal’ class (c
= 0) where the patient shows no signs of obstruction, and a pre-obstruction class (c =
1). The temporal ordering is such that each patient starts with a normal period at the end
of which pre-obstruction starts until the obstruction onsets.
The result of the selection described above is a time-ordered set of parameter vectors
T
xci,t = [xc,1i,t i,t
, . . . , xc,j i,t
. . . , xc,N ] and their corresponding labels si,tc , where i ∈{1, 2}, c ∈
{0,1} represents the group and class to which the vector belongs to, respectively. Time
stamp is represented by ‘t’. In general, the sampling times are such that for any triplete
[· · · , t k−1 , t k , t k+1 , · · · ] with t k−1 < t k < t k+1 we can find that tk −tk−1 = tk+1 −tk . For
the sake of simplicity, in the following section the vectors of parameters are represented as
54 F. J. Suárez-Díaz et al.

x = [x1 , . . . , xj . . . , xN ]T thus indicating that the treatment given to it has no dependence


on the group or class.
Finally, as we are interested in studying the discriminant capacity of every parameter,
for each group every xj is treated independently and a single classifier is obtained. Thus,
we have N = 5 single classifiers per group. This issue is presented in the following
section.

4.2 Bayesian Classifier as Base Learner


One of the first questions we can ask ourselves when facing a new classification problem
is what is theoretically the best classifier, assuming that the distribution of random
information (e.g., vectors) is given. This problem is statistical hypothesis testing, and
the Bayes classifier is the best classifier as it minimizes the probability of classification
error [7]. The probability of error is a key parameter in pattern recognition. The error
attributable to the Bayes classifier (the Bayes error), gives the smallest error we can
achieve from given distributions. Up to our knowledge, this work represents the first
attempt to address the problem of endotracheal tube obstruction detection, and, without
loss of generality, we make an approach starting from a Bayesian classifier. In turn, we
assume that the parameters under study are independent or, at least, can be used in a
differentiated manner. This allows us to pose our classifier based on the Naive Bayes
(NB) as base learner, which is a simple and promising technique in terms of classification
error.
The Naive Bayes applies the Bayes’ theorem with the “naive” assumption that any
pair of features are independent for a given class. The classification decision is made
based upon the maximum-a-posteriori (MAP) rule. Consider an object represented by
a feature vector x = [x1 , . . . , xj . . . , xN ]T that is to be assigned to one of ‘c’ predefined
classes ω1 , …, ωc . Minimum classification error is guaranteed if the class with the largest
posterior probability, P(ωi /x), is chosen. To estimate subsequent probabilities, the Bayes
formula is used with estimates of the prior probabilities, P(ωi ), and the class-conditional
pdf, p(x/ωi ):


c
P(ωi /x) = P(ωi )p(x/ωi )/p(c) = P(ωi )p(x/ωj )/ P(ωj )p(x/ωj ) (1)
j=1

Under the ‘naivety’ assumption, the joined pdf for a given class is the product of the
marginal pdfs is:
  
x n
p = p(xj /ωi ) (2)
ωi j=1

One of the main advantages of constructing a Bayes classifier this way is that accu-
rate estimates of the marginal pdfs can be obtained from both small and large amounts.
This makes the Naive Bayes classifier so attractively simple. The assumption of con-
ditional independence among features may look too restrictive. Nonetheless, NB has
demonstrated robust and accurate performance across various domains, even when this
assumption is false or has not been demonstrated yet, which is our case.
A Bayesian Classifier Combination Methodology for Early Detection of Endotracheal 55

The feature’s pdfs can be estimated in different ways. For continuous-valued features,
we can use a parametric or a nonparametric estimate of p(x/ωi ). For the nonparametric
approach, each feature is first discretized, and the associated probability mass function
(pmf) is estimated for all features. This can be done, for example, by a probability
histogram with K spaced bins (e.g., equally). It is the option adopted in this paper. Using
the Maximum Likelihood Estimation method, the estimate of the class-conditional pfm
is p(x/ωi ) = ljc /lc , where l jc is the number of times the j-th term appears among feature
vectors in class c, and l c is the total number of terms in class c.
In general, once the pdfs have been learned and given a new feature vector xT to be
classified, we apply the following decision rule to make a classification:
 
c∗ = arg max log(p(x/ωi )) + log(p(c)) (3)
{ω1 ,...,ωc }

Lloyd’s Quantization [8, 9]: As stated before, each element xj (J = 1, …, N) of the


feature vector is treated independently and discretized, and its associated pdf is estimated.
discretization is performed by applying the Lloyd’s algorithm. In this algorithm, the
associated quantization levels and intervals of an optimum finite quantization scheme
are satisfied. The optimization criterion used is that the average quantization noise power
should be at minimum. In the Lloyd quantizer there are K–1 decision thresholds exactly
half-way between representative levels, and K representative levels at the centroid of the
pdf between two successive decision thresholds. Thus, histograms have K bins.

Model Generation: Model training starts by estimating an optimal quantizer per class
j j
‘c’ and parameter ‘j’, Qc . That is, a class specific Qc is trained feature estimation and is
prior used for probability histogram estimation. These features are obtained by applying
the corresponding parameter value xj to the quantizer, Thus obtaining a level qj . This
information is used to estimate the class-conditional Pdfs. Finally, our models for the
binary Bayesian classifier Dj are composed of two main elements: 1) A quantizer per
j
class ‘c’ and parameter ‘j’, Qc , and 2) Two class-conditional Pmf p(xj /ωc ) per parameter.
j
Thus, we have ‘N’ model tuples Qc and p(xj /ωc ), J= 1, … , N.

4.3 Classifier Combination Strategy


In classifier ensembles, several classifiers are employed to make a classification deci-
sion about the feature vector submitted at the input, and the individual decisions are
subsequently aggregated [10]. The output of the ensemble is a class label for the object.
In this work we make ensembles by means of classifier combination. By combining
classifiers, we are aiming at an accurate classification decision, which is achievable
using single trainable classifiers. We look for the best set of base learners (e.g., BN)
and then for the best combination method. The general understanding is that classifier
ensembles work well thanks to the following general reasons [11]: 1) Statistical, 2)
Computational, and 3) Representational. Classifier ensembles enable the approximation
of complex classification boundaries with a desired precision.
From a taxonomic point of view [10], the combiner we use in our research is trainable:
the classifiers are trained independently, we are able to introduce diversity by partitioning
56 F. J. Suárez-Díaz et al.

the training set (e.g., with a leave-one-out strategy), the ensemble size is fixed in advance,
and it is universal as any base classifier model can be used. Apart from the previous
taxonomy, we can group classifier ensemble methods regarding the overall strategy that
governs a specific design. Classifier fusion and classifier selection are two such strategies.
In the present work we have chosen classifier fusion, where each ensemble member is
supposed to have knowledge of the entire feature space.
At the time of combining the single classifiers outputs in the ensemble there are
several options: Class labels, ranked class labels, numerical support, and oracle. We
have adopted class labels where each classifier Dj produces a class label sj , j = 1, … , N.
Thus, for any object x = [x1 , . . . , xN ]T to be classified, theN classifier outputs define a
vector s = [s1 , . . . , sN ]T .
In this work, we have adopted majority voting as the classifier combination. Let’s
assume
 that the label outputs of the classifiers are given as c-dimensional binary vectors
dj,1 , . . . , dj,c ∈ {0, 1}c , j = 1, … , N, where d j,k = 1 if Dj labels x in ωk and 0
otherwise. The majority voting rule will return ωk if:
N N
dj,k = arg max dj,i (4)
j=1 {i=1,...,c} j=1

As we have a binary (c = 2) classification scheme, it will coincide with the simple


majority (50% of the votes + 1), and according to Eq. 4, the majority vote gives an
accurate class label if at least (N/2 + 1) classifiers give correct answers.

Missing Data Treatment: At a given sampling instant there are 0 ≤ M ≤ N valid data
in each feature vector. Equation 4 is performed over the M base classifiers for which
data is valid. That is, we do not perform any restoration at instants of missing data and
the base classifiers associated to them are excluded from the ‘vote’.

4.4 Event Tracking and Expected Time to Obstruction

This is the final module of our detection system, and it is located at the output of the
classifier combiner. This module is performed in three sub-modules acting sequentially:
1) Event detection, 2) Probability of pre-obstruction, and 3) Expected time to obstruction.
Let’s look at each part separately.

Event Detection. Pre-obstruction is the ‘event of interest’. If the specificity of the detec-
tions is sufficiently high, we can expect very low probability of false alarm (e.g., pfa <
0.1). In turn, if the sensitivity is also sufficiently high, we can expect very low probability
of omission (e.g., pom < 0.1). We have experimentally found that, in addition to low
pfa and pom , false alarms and omissions appear in very short bursts at the output of the
combiner. In any case, the effect of these detection errors must be minimized. In order to
achieve this minimization, we have applied a two-step sequential morphological filtering
with structuring element width L s . In the first step a closing is applied with L s /2 samples
to suppress bursts of false alarms and omissions. In the second an opening is applied
with L s samples to restore the time width of the detections that exceed the closing.
A Bayesian Classifier Combination Methodology for Early Detection of Endotracheal 57

Probability of Pre-obstruction. From a practical point of view, one of the outcomes


that can facilitate the work of physicians is to have an indicator of pre-obstruction. In
this work we give this indicator as the probability that a sufficiently high number of
detections took place in a given time window. This window is sliding and shifts each
time a new parameter vector enters the system. We operate in a track-while-scan mode,
wherein a given event is either detected or not on each search scan [e.g., [12]]. As the
reception of scan proceeds, we apply a sliding window of size M over the detections
from the ’Event detection’ sub-module, which moves forward on each sampling time,
dropping the oldest detection and adding the newest as the search process progresses.
Under these conditions, we wish to understand the behavior of the system acquiring or
losing track.
Let pd be the probability of detection. Then the cumulative detection probability
M


P d, M after M consecutive sampling times is given by P d,M  = 1 − 1 − pd where


k=1
M’ (0 ≤M  ≤M) is the number of events detected in the window. Cumulative probabilities
are monotonically increasing as long as M  → M in the sliding window and decreasing
as M  → 0. Thus, P d,M is our indicator function. The pd is part of the final model and
it is estimated during training.

Expected Time to Obstruction. Once the system detects an event and estimates the
probability of being in pre-obstruction, we proceed to give an estimate of the expected
time to obstruction. Let t o,p the time given in days from pre-obstruction starting to
obstruction onset in patient ‘p’. This time varies in the range to,p ∈ [1 22, 21 49] with
average and standard deviation 8 83 ± 4 48 (t̄ + σ t ). If we adopt a very conservative
strategy, we can take as a reference the minimum time t0,min = 1 22 days it took for a
patient to present an obstruction. We can, therefore, expect that in the worst-case scenario
a given patient will reach obstruction 1 22 days after starting pre-obstruction. In this way,
we obtain a countdown prediction to the instant of a potential obstruction.

5 Experiments and Results


The main objective of our experiments is to evaluate the proposed methodology in a
realistic situation that physicians meet every day in the ICU. We do that along four
secondary objectives: Estimate the quality of the combination strategy and study the
performance of the three final sub-modules.
The experiments involve the 16 obstructive patients which have been described in
Sect. 3. We have performed separate experiments for parameter Groups 1 (PH) and 2
(MV). No pre-processing is performed on the valid data before entering the base learner,
and missing data and temporal sampling are left as is. Particularly, the presence of
missing data is not dealt with until the end of the classifier combination, as explained in
Sect. 4.3. Data is provided to the detection system as described in Sect. 4.1.
Quality of the Combination Strategy. In these experiments we take the valid data
from each class without distinguishing which patient they come from. In this way we can
estimate the quality of the base classifiers and how they affect the combination strategy.
58 F. J. Suárez-Díaz et al.

The experiments are repeated in 1000 trials where 70% of data is randomly selected for
training in each trial. Quality is assessed in terms of accuracy, specificity, sensitivity and
f1-score metrics and they are shown in box plots. In Fig. 3, upper graphs, we show the
performance for a base classifier from Heart Rate (left) and the combiner considering
all 5 PH parameters. In the lower graphs we show the performance for a base classifier
from Resistance (left) and the combiner considering all 5 MV parameters. In both cases
the base learners are presentative of all in each group. As we can see, there is a great
improvement in quality when using the combiners, no matter what metric is used. This
is because majority voting is a good combination when the base classifiers are good
enough and fail randomly for different inputs.

Fig. 3. Quality measures obtained with PH (upper) and MV (lower) parameters

The experiments for the three final sub-modules have been carried out following a
leave-one-out cross-validation strategy. The process is repeated 16 times so that each
time 15 patients are used to apply the methodology while the remaining patient is used
for testing.

Event Detection. In Fig. 4 we show two examples of detection for different patients.
For patient 11 the base classifier is learned from PH parameters and for patient 15 it is
learned from MV parameters. Time is given in the horizontal axis, and it is expressed
in days in ICU. The straight line marks the pre-obstruction periods and the circle dots
mark detections. We can see that combiners provide continuous detections in which
short bursts of errors may occur (false alarms and omissions).

Probability of Pre-obstruction and Expected Time to Obstruction. In Fig. 5 we


show the same patients as in Fig. 3 after estimating the (cumulative) detection probability
A Bayesian Classifier Combination Methodology for Early Detection of Endotracheal 59

Pd,M (upper parts), the indicator function. We can see that it rapidly increases to 1 and
remains constant at high levels during the pre-obstruction periods. In the lower parts of
the graphs, we show two regression lines which indicate the (decreasing) expected time
to obstruction since detection of pre-obstruction started. The lower is very conservative,
that is, taking the patient with tref = to,min = 1 22 as reference.

Correct detec ons Correct


detec ons

Normal period Pre-obstruc on period Normal period Pre-obstruc on period

Omissions Omissions

Fig. 4. Examples of event detection with base classifiers learned from PH and MV

The upper line has been estimated taking tref = t̄ − 1 5σt , which is less restrictive,
and letting the physician take into consideration a confidence interval. For instant, for
patient 11, pre-obstruction starts at the beginning of day 4 and we can expect with high
probability that he/she will be obstructed between days 5 22 and 6 25. We must highlight
that for all patients, as soon as the cumulative probability reaches a high level (e.g. 0.8),
the physician is able to ascertain the patient’s pre-obstruction status over a prolonged
period of time.

Fig. 5. Examples of indicator functions and regression lines of expected time to obstruction

6 Conclusions
Our work represents the first approach to the problem of early detection of endotracheal
tube obstruction from a technological point of view. As such, the field is open and we have
60 F. J. Suárez-Díaz et al.

proposed a Bayesian classifier combination methodology. We believe that we have laid


the foundation for an intelligent system that provides clinicians with reliable indicators
of risk of endotracheal tube obstruction and that allows prediction of the instant of
obstruction. Further research can take our methodology as a reference approach. New
lines of work will open, ranging from the use of other base classifiers to new ensemble
strategies, and the development of improved indicators. In addition, we will place special
emphasis on expanding the patient database to test the credibility and reproducibility of
our results.

Acknowledgement. This research work has been founded by the University of Las Palmas
de Gran Canaria through the call “Proyectos de Investigación COVID-19”. Our project title is
“Aplicación de técnicas de machine learning para la detección temprana de obstrucción del tubo
endotraqueal en pacientes COVID-19 en UCI”, ref. COVID 19–11.

References
1. Pérez Acosta, G., Navarro Mesa, J.L., Blanco López, J., Santana Cabrera, L., Suárez Araujo,
C.P., Martín González, J.C.: Prediction model of endotracheal obstruction, in patients with
severe pneumonia due to COVID 19, analyzing clinical parameters using intelligent comput-
ing. In: 33rd Annual Congress of the European Society of Intensive Care Medicine, vol. 8, 2
(2020)
2. Pérez Acosta, G., Navarro Mesa, J.L., Blanco López, J., Santana Cabrera, L., Suárez Araujo,
C.P., Martín González, J.C.: Prediction model of endotracheal obstruction, in patients with
severe pneumonia due to COVID-19, analyzing ventialtory parameters using intelligent com-
putation. In: 33rd Annual Congress of the European Society of Intensive Care Medicine
(ESICM) (2020)
3. Pham, T., Brochard, L.J., Slutsky, A.S.: Mechanical ventilation: state of the art. Mayo Clin.
Proc. 92(9), 1382–1400 (2017). https://doi.org/10.1016/j.mayocp.2017.05.004
4. Silva, P.L., Pelosi, P., Rocco, P.R.M.: Optimal mechanical ventilation strategies to minimize
ventilator-induced lung injury in non-injured and injured lungs. Expert Rev. Resp. Med.
10(12), 1243–1245 (2016). https://doi.org/10.1080/17476348.2016.1251842
5. Bock, K.R., Silver, P., Rom, M., Sagy, M.: Reduction in tracheal lumen due to endotracheal
intubation and its calculated clinical significance. Chest 118(2), 468–472 (2000). https://doi.
org/10.1378/chest.118.2.468
6. PICIS: ICU Care Manager Software|Patient Care Unit System|Picis (2021). https://www.
picis.com/en/solution/clinical-information-system-suite/critical-care-manager/. Accessed 25
Apr 2021
7. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press (1990)
8. Max, J.: Quantizing for minimum distortion. IRE Trans. Inf. Theory 6(1), 7–12 (1960). https://
doi.org/10.1109/TIT.1960.1057548
9. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137
(1982). https://doi.org/10.1109/TIT.1982.1056489
10. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. 2nd edn. John Wiley
& Sons, Inc. (2014)
11. Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3),
21–44 (2006). https://doi.org/10.1109/MCAS.2006.1688199
12. Worsham, R.: The probabilities of track initiation and loss using a sliding window for track
acquisition. In: IEEE National Radar Conference - Proceedings, pp. 1270–1275 (2010).
https://doi.org/10.1109/RADAR.2010.5494424
Toward an Intelligent Computing Solution
for Endotracheal Obstruction Prediction
in COVID-19 Patients in ICU

Pablo Fernández-López1 , Carmen Paz Suárez-Araujo1(B) , Patricio García-Báez2 ,


Francisco Suárez-Díaz3 , Juan L. Navarro-Mesa4 , Guillermo Pérez-Acosta5 ,
and José Blanco-López5
1 University Institute for Cybernetic Science and Technologies (IUCTC), University of Las
Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
{pablo.fernandezlopez,carmenpaz.suarez}@ulpgc.es
2 Department of Computer Engineering and Systems, University of La Laguna,
San Cristóbal de La Laguna, Spain
[email protected]
3 Hospital Universitario de Gran Canaria Doctor Negrín, Gran Canaria, Spain
[email protected]
4 University Institute for Technological Development and Innovation in Communication
(IDeTIC), University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
[email protected]
5 Complejo Hospitalario Universitario Insular Materno Infantil, Gran Canaria, Spain

{gperaco,jblalop}@gobiernodecanarias.org

Abstract. Nowadays there is a world pandemic of a challenging respiratory ill-


ness, COVID-19. A large part of COVID-19 patients evolves to severe or fatal
complications and require an ICU admission. COVID-19 mortality rate approaches
30% due to complications such as obstruction of the trachea and bronchi of patients
during the ICU stay.
An endotracheal obstruction occurring during any moment in a COVID-19
patient ICU stay is one of the most complicated situations that clinicians must face
and solve. Therefore, it is very important to know in advance when a COVID-19
patient could enter in the pre-obstruction zone.
In this work we present an intelligent computing solution to predict endotra-
cheal obstruction for COVID-19 patients in ICU. It is called the Binomial Gate
LSTM (BigLSTM), a new and innovative deep modular neural architecture based
on the recurrent neural network LSTM. Its main feature is its ability to handle
missing data and to deal with time series with no regular sample frequency. These
are the main characteristics of the BigLSTM information environment. This ability
is implemented in BigLSTM by an information redundancy injection mechanism
and how it copes with time control.
We applied BigLSTM with first wave COVID-19 patients in ICU of Com-
plejo Hospitalario Universitario Insular Materno Infantil. Encouraging results,
even while working with a very small data set, indicate that our developed com-
puting solution is going forwards towards an efficient intelligent prediction system
which is very appropriate for this kind of problem.

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 61–73, 2021.
https://doi.org/10.1007/978-3-030-85030-2_6
62 P. Fernández-López et al.

Keywords: COVID-19 · Endotracheal obstruction · Forecasting · LSTM ·


Recurrent neural networks · Deep learning · Missing data

1 Introduction
The novel Coronavirus which has been designated SARS-CoV-2 appeared in December
2019 and initiated a pandemic of respiratory illness known as COVID-19. COVID-19
has proved itself to be a tricky illness that can emerge in various forms and levels of
severity with the risk of organ failure and death. It ranges from a mild, self-limiting
respiratory tract illness to severe progressive pneumonia, multiorgan failure, and death
[1].
COVID-19 infection can cause severe or fatal complications in high-risk patients and
the need for invasive ventilation for a large part of COVID-19 patients results in required
ICU admission. Several cases of necrotizing tracheobronchitis have been observed by
clinicians, which causes airway obstruction, with necrotic and haemorrhagic debris that
obstruct the trachea and bronchi. The mortality rate in this situation is close to 30% [1]
due to the complications encountered during the ICU stay.
In this work we face a high incidence problem in ICU patients, the endotracheal
obstruction. A very important and necessary aid to deal with this tricky complication
would be the prediction of such an obstruction. It is crucial not just for the prognosis
of the patient, but also to avoid a patient’s death and to decrease the risk for viral trans-
mission from patients to healthcare personnel. Using Invasive Mechanical Ventilations
(IMV) variables we propose a smart computing solution. It has been developed using an
innovative modular neural recurrent architecture, tolerant of missing data and the non-
existence of a sampling period in time series, and is named the Binomial Gate LSTM
(BigLSTM).
The BigLSTM is a deep modular neural architecture, based on the recurrent neural
network LSTM [2]. Its high tolerance of missing data is possible because of an approach
that is different than those used by other architectures [3–5], such as the incorporation
of information redundancy injection.
The BigLSTM performance has attained very promising results according with
clinicians, despite working with a very small data set regarding number of patients.

2 Concepts on Invasive Mechanical Ventilation


Mechanical ventilation works by applying a positive pressure breath and depends on
the compliance and resistance of the airway system, which is affected by how much
pressure must be generated by the ventilator to provide a given tidal volume (TV) [7].
Compliance and resistance are dynamic and can be affected by the disease states that
could lead to the intubation. Understanding the changes in respiratory parameters will
help us to analyze, detect and predict different patient states, such as, endotracheal tube
obstruction.
Invasive mechanical ventilation (IMV) includes many kinds of modes of ventilation,
such as assist-control (AC), synchronized intermittent mechanical ventilation (SIMV),
and pressure support ventilation (PSV) [8].
Toward an Intelligent Computing Solution for Endotracheal Obstruction Prediction 63

Depending on the chosen respiratory mode a set of parameters have to be set in


the respirator and once all these input parameters of the device have been established,
the information obtained from the mechanical ventilator are: Peak Pressure, Average
Pressure, Plateau Pressure, Compliance and Resistance.
Peak pressure (Ppeak) is the maximum pressure measured at end inspiration. Ppeak
includes the elastic and resistive components (airway, lung tissue, and equipment, e.g.,
endotracheal tube). Several factors can modify Ppeak, such as endotracheal tube diameter
[9], airflow intensity, plugging, or bronchospasm. The mean value over time of the Ppeak
is the Average Pressure (Pave).
Plateau pressure (Pplat) can be measured during an inspiratory pause when the
respiratory muscles are relaxed and is equal to alveolar pressure when airflow is zero.
The airway Resistance (R) is the difference between Ppeak and Pplat divided by the
airflow. In normal subjects, airway resistance values do not exceed 15–20 cmH2 O/L/s
under controlled mechanical ventilation [8].
The Compliance (C) of a system is defined as the change in volume that occurs per
unit change in the pressure of the system. It is, basically, a measurement of the elastic
resistance of a system. It is one of the most important concepts underpinning mechanical
ventilation used to manage patient in the intensive care unit setting.
The time dynamic of these parameters constitute the signals of our problem space,
and they are irregularly sampled in all ICU patients and there is a high rate of missing
data.

3 Data Set and Methods

3.1 Data Set

The data set used in this study is made up of signals from the mechanical ventilation of
22 hospitalized COVID-19 patients in the ICU of the Complejo Hospitalario Universi-
tario Insular Materno Infantil (C.H.U.I.M.I.) between March 8, 2020 and June 1, 2020.
Signals are characterized by the outputs, or derived parameters, of an IMV device.
The data set is a time series register of these signals, obtained by querying the
electronic health record in PICIS system [10]. These registers do not include a sample
period, and a high number of the registers are missing data. Distribution of the data is
shown in Fig. 1. It is noteworthy that the Pplat, C and R attain the highest index. As seen
in Fig. 1a, these signals reach up to 100% of the values that make up the missing data
in several of the patients. It is also noteworthy that some patients also have practically
100% of the missing data in all of the signals with the exception of Ppeak.
The features of this data set lead to a highly complex prediction problem. This
section presents the obtained results from the intelligent estimator based on the proposed
recurrent neural network, BigLSTM, when an intubated COVID-19 patient experiences
a pre-obstruction state in the endotracheal tube.
A selection procedure is carried out on the patients which ultimately becomes the data
set in our study, where the inclusion/exclusion of a registered obstruction is a parameter.
An analysis of the signals from the mechanical ventilator (with its clinical equipment)
is performed, and its result allows areas of pre-obstruction to be identified as well as
64 P. Fernández-López et al.

carrying out a labelling procedure in normal and pre-obstruction areas. These areas are
only found in 16 of the 22 patients, while the absence of this phenomenon indicated how
to predict which parameter was used for the exclusion of patients. Thus, our information
setting will be made up of the mechanical ventilator signals from 16 COVID-19 patients
in the ICU, and they will be catalogued by the normal zone and the pre-obstruction zone
of the endotracheal tube.

Fig. 1. Characterization of the ventilation mechanics data. a) Level of Data Missing, white color
indicate 100% data missing. b) Characterization of the Ppeak data by median (dotted line),
interquartile range (height box) and data dispersion for all patients.

The next stage in the signal pre-processing leading to the data set to be used as
input for the BigLSTM network is to determine the time magnitude of the convergence
dynamic towards the pre-obstruction area. This convergence is seen in the observations
of the normal area from the variable signals from the mechanical ventilator of the patients
included in the study. This magnitude quantifies the time in which a patient will enter
into the aforementioned pre-obstruction area.
The irregular and high dispersion of the associated values of the signals is another
consideration warranting attention. For example, in Fig. 1b we notice data associated
with the Ppeak signal for all patients included in our data set. This Ppeak signal has a
median that varies between 15 and 35, where three patients have values lesser than 20,
three other patients with values between 20 and 25, nine patients with values in the 25
to 30 range, and only one patient whose value is greater than 30.
Lastly, the variability of the outliers is also evident in Fig. 1b. The number of patients
with no or few outliers as well as those with a high number of outliers can be observed.

3.2 Methods
In this section we present our smart computing solution for endotracheal obstruction
prediction, the BigLSTM. It is a modular neural recurrent architecture, which proposes
a different approach for tolerance of missing data and the non-existence of a sampling
period in time series (Fig. 2). BigLSTM incorporates information redundancy injection
and an effective mechanism to deal with time control.
Toward an Intelligent Computing Solution for Endotracheal Obstruction Prediction 65

At present, the high significance of redundancy in brain is recognised [11], essen-


tially in its plasticity and capacity for input patterns reconstruction, especially in a miss-
ing information environment [12]. Therefore, BigLSTM is a deep neural architecture
with some biological plausibility. BigLSTM not only captures the long-term temporal
dependencies in time series, but also achieves better prediction results.

Fig. 2. Functional and modular diagram of BigLSTM.

The modular structure of BigLSTM is made up of five interconnected modules, with


a specific objective, and it works following an explicit decomposition scheme [13] of its
functions.
Four of the BigLSTM modules are information processing modules and the remain-
ing one is dedicated to some control tasks. The function and responsibility of each
module and its information processing and transmission style will be explained in the
following paragraphs.
We define the concepts Observation and Prediction in the context of BigLSTM.
They will help to deal with the description of the modular architecture BigLSTM.
Definition 1.- We define an Observation (O) as the set of values taken by the input
signals to the system, at an instant of time t i . Therefore, Oi = {S 0 , S 1 , S 2 , …, S j , …,
S n-1 ; E 0 , E 1 , E 2 , …, E k , …, E m-1 ; t i }.
Where, S j is a real value, or missing value, which will be indicated by *), E k are
also values belonging to R, which corresponds to the signal label, for the BigLSTM
supervised learning scheme, t i is the temporal moment of the observation.
Definition 2.- We define a Prediction (P) as the set of values that are outputs of the
system, for an input Oi , which is only composed of problem signals {S j }. Then, Pi =
{p0 , p1 , p2 , …, pk , …, pm-1 ; t i }, where pk are also values given by real numbers.
66 P. Fernández-López et al.

Information Distribution Module


The Information Distribution Module (IDM) consists of a binomial configuration of
gates and it is responsible for receiving the observations Oi at each time instant, ti .
The dynamics of the IDM gates sends the Oi , organized by Indexed Observations
Tracks, to the next phase of computation, the Central Computation Module (CCM).
Definition 3.- We define a IOT as a set of Oi observations that are grouped by a gate
when all of them share the same pattern of non-missing data, without losing in any case,
their temporal reference, Eq. (1).

IOTw = {(O0 , t0 ); (O1 , t1 ); . . . ; (Oi , ti ); . . . ; (Oh , th )} (1)

The different IOTs are generated following of BigLSTM Information Redundancy


Scheme, which is responsible for Oi that compose IOT.
Definition 4.- The Information Redundancy Scheme (IRS) determines the different
IOTs, and therefore the way in which the information is delivered to the different LSTM
computation cells of the BigLSTM. It is also responsible for the adequate redundancy of
information in the system so that it can be a prediction system tolerant of missing data.
The IRS makes it possible to determine the redundancy index with which the
BigLSTM system will work. It does so by means of the Redundancy Factor (RF) of
the observations.
Definition 5.- We define the Redundancy Factor (ORF) associated to each observa-
tion, as a parameter that will measure the information redundancy index produced by
the applied redundancy scheme. Thus, for an IRS of maximum redundancy, an Oi , com-
posed of n possible values of basic signals, and where r of them correspond to missing
data, the ORF has the following expression:
 
ORF(Oi ) = 2n−r 2r − 1 (2)

By virtue of the above definition, the System Redundancy Factor (SRF), when
working with h observations, is defined as the following expression:
h
SRF({Oi }) = 2n−ri (2ri − 1) (3)
j=0

Where r i corresponds to the number of missing data that the Oi has.


We define the IRS with minimum information redundancy, when IOTs are configured
as follows: two observations belong to the same IOT if and only if both share the same
missing data pattern.

Central Computing Module


The Central Computation Module (CCM) is formed by an array of LSTM cells. The
LSTM is considered a gating network, since it incorporates gates that modify the data
flows in it. It is used in the treatment of sequences of data [14]. It is able to deal with the
vanishing gradient problem [15], by using Constant Error Carousel (CEC) units.
The inputs to LSTM cells follow a temporal window process in the observations
belonging to their IOTs.
The objective of each LSTM cell is the association of the basic signals {S 0 , S 1 ,
S 2 , …, S j , …, S n }, corresponding to the observations of the IOT, with the Prediction
Toward an Intelligent Computing Solution for Endotracheal Obstruction Prediction 67

Signals {E 0 , E 1 , E 2 , …, E k , …, E m }, for different time instants t j . It is the number of


observations in the IOT, together with the window size, that determines the number of
predictions per LSTM cell.
The number of LSTM cells needed in this module is a function of the IRS used
and the rate of missing data in the observations. We will also see in a later section, the
BigLSTM has a control intelligence for turning the cells on and off, depending on their
contribution to the predictions.
The learning process of this module is performed by the LSTM Algorithm [2].
The output of this module is an essential part of the output system, namely of the
prediction sought, which it computes through the set of partial predictions made by the
network.

Predictive Module
The Predictive Module
  (PM) takes the output of the CCM and computes the final predic-


tion of the system pi . This module also sends information to the LSTM Cell On/Off
Control Module regarding the contribution that each cell is making to the predictions.
The operating dynamics of this module is given by the Predictive Convergence
Strategy selected and will be conditioned by the IRS.
Definition 6.- We define Predictive Convergence Strategy (PCS) as the mechanism
by which all the partial predictions generated in the LSTM cells are processed to convert
them into the final prediction, and it is given by the function F, Eq. (4).
Thus, working with an IRS that injects maximum redundancy, the predictions sent
by the CCM are processed to fairly contribute to the final predictions of the system.
 
(. . . , {ρk }h , . . . ) = pi (4)

where {ρk }h corresponds


  
to with the partial predictions set ρ k performed by the
LSTM cell h, and pi the final predictions set, computed by this expression:
ORF(Oi )
p̂i = ρk (5)
k=0 ORF(Oi )

Where Oi is the observation i for which we are computing its prediction and ORF(Oi )
the Redundancy Factor of such Observation.

LSTM Cells On/Off Control Module


The On/Off Control Module (on/off CM) of LSTM cells verifies the contribution to the
final prediction of each LSTM cell. It works in coordination with the PM, and receives
from the latter the measure of the difference between the partial predictions made by
each LSTM cell and the final predictions.
  
 
 . . . , ({ρk }h , pk h ), . . . = (. . . , δh , . . . ) (6)

 {ρk }h corresponds to the partial predictions set ρ k performed by the LSTM


Where


cell h, pk h refers to the final predictions subset to which LSTM cell h is contributing
and ψ is the function that receives the previous set of tuples, from all the LSTM cells
68 P. Fernández-López et al.

of CCM and calculates the difference vector (…, δ h , …) according to the following
expression:

 2 

δh = ρk − pk (7)
k

Finally, the on/off CM makes the decision to turn off, or keep on, each of the LSTM
cells of the CCM, depending on the value taken by the (…, δ h , …) relative to the threshold
vector (…, ηh , …), following Eq. (8).

cell h off, δh < ηh


χ (δ0 , . . . , δh , . . . , δ2 −1 ; η0 , . . . , ηh , . . . , η2 −1 ) =
n n (8)
cell h on, δh ≥ ηh

Time Axis Processing Module


The Time Axis Processing Module (TAPM) allows the BigLSTM to have the capability
of working with sampled observations without a constant sampling rate.
The TAPM performs the prediction of the time moments that are associated with
the predictions. Therefore, the output values of this module index the predictions on the
time axis, forming part of the final output of the BigLSTM, Eq. (9)
  


OUTPUTBigLSTM = pi , t i (9)

4 Results and Discussions


The analysis of the behavior of the BigLSTM architecture within this context to predict
the time when a patient will enter into a pre-obstruction area is carried out by means
of several simulations (see Tables 1 and 2) leading to the optimal configuration for its
functioning with 15 process units in each LSTM cell, a processing window size of 15 in
each LSTM cell and a IRS using maximum redundancy. The on/off CM kept all of the
LSTM cells activated in all of the performed simulations.
A drawback of the data set is its size; it is very small and has an important impact
on the acquisition of an optimal configuration for our system, in terms of means and
bounded terms. An efficient and reliable performance is sought. Hence, a cross validation
procedure is used, specifically the Leave-One-Out Cross-Validation (LOOCV) [6].
Use of this procedure and a number of epochs equal to 40 leads to a good Mean
Squared Error (mse) for the training and validation system, Fig. 3a and standard devia-
tion, Fig. 3b, which is rather promising. The mse curves shown in Fig. 3a reveal values
less than 0.1 in the first five epochs and less than 0.05 starting with the 15 epochs, arriv-
ing at the final stage of training to values of 0.02 for the training mse and 0.03 for the
validation mse.
A look at Fig. 3b shows the evolution of the standard deviation of the training mse
and the validation mse throughout the training epochs in our system. This evolution
shows the way in which these metrics vary among all of the training intervals, allowing
us to conclude that, in this case, the training mse shows lower variability in each training
Toward an Intelligent Computing Solution for Endotracheal Obstruction Prediction 69

Fig. 3. a) Train and validation mse in leave-one-out cross-validation process, b) Train and
validation standard deviation of mse in leave-one-out cross-validation process

instance, since starting with epoch 5 its value is lesser than 0.01, and reaches a value
approximately equal to zero at the end of the training. This is not the case for the
validation mse.
The calculated mse values in the normalized system functionality space (with values
between 0 and 1) must also be accompanied by the ability to reference stated time value
errors, according to their measurable time magnitude and √ be compared to a clinical
environment. These magnitudes are provided by means mse, in a hh:mm:ss format,
where hh, mm, ss correspond to hours, minutes, and seconds, respectively.

Table 1. Values of the error metrics for different system configurations.


√ √ √
mse mse(*) mselagged (*) mseanticipated (*)
Configuration 1.- inputWitdh: 5, 0.03106 01:14:50 00:26:52 00:51:29
number LSTM Units: 5, and an IRS
with maximum redundancy
Configuration 2.- inputWitdh: 10, 0.02427 01:13:12 00:19:34 00:44:40
number LSTM Units: 10, and an IRS
with maximum redundancy
Configuration 3.- inputWitdh: 15, 0.02273 01:02:24 00:18:54 00:38:42
number LSTM Units: 15, and an IRS
with maximum redundancy
(*). hh:mm:ss format, where hh corresponds to hours, mm to minutes and ss to seconds.

Just as our aim was to quantify performance of our BigLSTM to predict the time
remaining for a patient
√to enter the pre-obstruction area, we have also defined two metrics
that are derived from mse, which express the BigLSTM produced error in the prediction
when entry into the pre-obstruction zone of the patient occurs, and the exact moment
when the patient actually goes into this pre-obstruction area. These error metrics are:

anticipated mean prediction error mseanticipated , which quantifies the produced error
when the BigLSTM architecture predicts the patients entry into the pre-obstruction area
70 P. Fernández-López et al.


just before it actually occurs, and the lagged mean prediction error mselagged , for the
situation when the opposite occurs. Bearing in mind these error results and the objective
of
√ the proposed prediction system, the aim is not only to obtain a low mse, and hence
mse, but also to minimize the lagged prediction error.
Table 1 gathers the values of different error metrics in their respective units, from the
proposed prediction system for obstruction of the endotracheal tube for several studied
configurations. As previously commented, the best values for these metrics are obtained
using Configuration 3 (window size 15, 15 LTSM internal cell units, maximal redundancy
divergence scheme).
Another performance measure of the proposed system is the Confidence Index (CI).
This index collects the importance and the relation between predicted lagged error and
anticipated predicted error, which is obtained from the following expression:
√ 
Confidence Index[ε] = mselagged γε − √mseanticipated 

Where, γε is a constant defined using ε, which is the length of the new term Prediction
interval.
Prediction Interval (PI) is defined as a time interval during which the prediction must
be made. For example, a value of γ24 for a 24 h CI establishes that there is an acceptable
margin of 24 h, as long as the actual moment of entry into the pre-obstruction area is
greater at this time than established by the prediction in less tsshan 24 h.
Lastly, configuration 3 of the BigLSTM based system, is shown to be the optimal
configuration as a predictor, and we carried out an analysis of behavior leading to predic-
tions for different patients, starting with entry into the ICU until the first pre-obstruction
area.
Table 2 summarizes the IC value for different success intervals and configurations.

Table 2. Percentages obtained for the CI

Confidence Index (%)


48 h 24 h 12 h 6h 3h
Configuration 1.- inputWitdh: 5, number LSTM Units: 5, 84.74 56.88 0.0 0.0 0.0
and an IRS with maximum redundancy
Configuration 2.- inputWitdh: 10, number LSTM Units: 10, 92.59 82.17 39.87 0.0 0.0
and an IRS with maximum redundancy
Configuration 3.- inputWitdh: 15, number LSTM Units: 15, 92.51 82.31 44.57 0.0 0.0
and an IRS with maximum redundancy

In Fig. 4a we see the error profile of Patient 12 and its CI value at 24 h is 99,88%,
compared to the error profile of Patient 08, Fig. 4b, whose CI value
√ at 24 h is 78.98%.
While carrying out a comparison of the estimated metric mse generated from
our system for both patients is not that far apart (00:33:02 for Patient 12 compared to
00:39:50 for Patient 08). However, the same result cannot be seen comparing mean
Toward an Intelligent Computing Solution for Endotracheal Obstruction Prediction 71


lagged prediction error mselagged for both patients (00:00:03 for Patient 12 compared
to 00:37:27 for Patient 08).
Bearing in mind these results we come to the conclusion that our system found
it much more difficult to carry out estimates for Patient 08 than for Patient 12. This
conclusion can be justified when we observe the missing data in both patients (Fig. 1a),
since we have already noticed that Patient 12 has lower percentages of missing data in
all of the signals, while Patient 08 is missing almost 100% of the data in all signal types
except for Ppeak.

Fig. 4. Error profile in a) patient_12 and b) patient_08 predictions.

5 Conclusions
In this work we have faced the challenging medical problem of endotracheal obstruction
in COVID-19 patients in ICU. We have developed BigLSTM, a new modular deep
neural computing architecture to predict in advance, such an obstruction, that is, when
the patient exactly enters in a pre-obstruction zone.
The information environment are temporal series of values from signals of mechan-
ical ventilation that are recorded in irregular form, and where there is a high rate of
missing data.
The main ability of BigLSTM is the incorporation of information redundancy
injection and deal with time control.
In order to assess the performance of our proposed system, we have defined some
√ √
metrics, as error of prediction in advance mseanticipated , lagged prediction mselagged ,
and a first approximation to a Confidence Index (CI). The results given by our system are,
according with ICU clinicians, pretty good results: msev = 0.02273, when BigLSTM
works in a normalized space, and a CI [48h.] of 92.51% and a CI [24h.] of 82.31%.
Regarding the individual predictions for each patient, we have obtained prediction error
profiles, observing very good CI values. We have been able to verify that the low value
in CI of one patient, is correlated with the total absence of observations in 4 of the 5
signals that we are using.
72 P. Fernández-López et al.

The efficiency of this new architecture, working even with a very small data set,
allows us to believe that our system can be a very appropriate smart computing solution
to perform more effective and reliable healthcare. These outcomes encourage future
work to improve this new recurrent neural architecture. We will use bigger data sets
from new COVID-19 outbreaks.
Finally, the feasibility of solving the missing data problem by means of injecting
information redundancy is observed, as well as the necessity of processing the time axis
when we work with an irregular sampling of observations.

Acknowledgement. This research work has been funded by the University of Las Palmas de Gran
Canaria, through the project COVID 19–11, “Aplicación de técnicas de machine learning para la
detección temprana de obstrucción del tubo endotracheal en pacientes COVID-19 en UCI”.

References
1. Ferrando, C., et al.: Patient characteristics, clinical course and factors associated to ICU
mortality in critically ill patients infected with SARS-CoV-2 in Spain: a prospective, cohort,
multicentre study. Rev. Esp. Anestesiol. Reanim. 67(8), 425–437 (2020)
2. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780
(1997). https://doi.org/10.1162/neco.1997.9.8.1735
3. Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B., Vandewalle, J.: Least Squares
Support Vector Machines. World Scientific Pub. Co., Singapore (2002)
4. Cho, K., et al.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation (2014). eprint arXiv:1406.1078
5. Che, Z., Purushotham, S., Cho, K., et al.: Recurrent neural networks for multivariate time
series with missing values. Sci. Rep. 8, 6085 (2018)
6. Sammut, C., Webb, G.I.: Leave-one-out cross-validation. In: Sammut, C., Webb G.I. (eds)
Encyclopedia of Machine Learning. Springer, Boston, MA (2011). https://doi.org/10.1007/
978-0-387-30164-8_469
7. Pham, T., Brochard, L.J., Slutsky, A.S.: Mechanical Ventilation: State of the Art. Mayo Clinic
Proc. 92(9), 1382–1400 (2017). https://doi.org/10.1016/j.mayocp.2017.05.004
8. Silva, P.L., Pelosi, P., Rocco, P.R.M.: Optimal mechanical ventilation strategies to minimize
ventilator-induced lung injury in non-injured and injured lungs. Expert Rev. Resp. Med.
10(12), 1243–1245 (2016)
9. MacIntyre, N.R.: Evidence-based guidelines for weaning and discontinuing ventilatory sup-
port: a collective task force facilitated by the American college of chest physicians the Amer-
ican association for respiratory care and the American college of critical medicine. Chest.
120(6), 375–395 (2001). https://doi.org/10.1378/chest.120.6_suppl.375s
10. ICU Care Manager Software | Patient Care Unit System | Picis. https://www.picis.com/en/sol
ution/clinical-information-system-suite/critical-care-manager/. Accessed 25 Apr 2021
11. Kandel, E.: In search of memory: the emergence of a new science of mind. FASEB J. 20,
1043–1044 (2006). https://doi.org/10.1096/fj.06-0604ufm
12. Chariker, L., Shapley, R., Young, L.-S.: Rhythm and synchrony in a cortical network model.
J. Neurosci. 38(40), 8621–8634 (2018)
13. Sharkey, A.J.C. (ed.): Combining Artificial Neural Nets: Ensemble and Modular Multi-Net
Systems. Springer Science & Business Media, Heidelberg (2012)
Toward an Intelligent Computing Solution for Endotracheal Obstruction Prediction 73

14. Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: Mozer,
M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems 9
(NIPS 9), pp. 473–479. MIT Press, Cambridge (1997)
15. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field
Guide to Dynamical Recurrent Neural Neworks. IEEE Press (2001)
Advanced Topics in Computational
Intelligence
Features Spaces with Reduced Variables
Based on Nearest Neighbor Relations
and Their Inheritances

Naohiro Ishii1(B) , Kazunori Iwata2 , Naoto Mukai3 , Kazuya Odagiri3 ,


and Tokuro Matsuo1
1
Advanced Institute of Industrial Technology, Tokyo 140-0011, Japan
[email protected], [email protected]
2
Department of Business Administration, Aichi University, Nagoya 453-8777, Japan
[email protected]
3
Sugiyama Jyogakuen University, Nagoya 464-8662, Japan
{naoto,kodagiri}@sugiyama-u.ac.jp

Abstract. Generation of useful variables in the features spaces is an


important issue throughout the neural networks, the machine learning
and artificial intelligence for their efficient and discriminative compu-
tations. In this paper, the nearest neighbor relations are proposed for
the minimal generation and the reduced variables for the feature spaces.
First, the nearest neighbor relations are shown to be minimal indepen-
dent and inherited for the construction of the feature space. For the anal-
ysis, convex cones are made of the nearest neighbor relations, which are
independent vectors for the generation of the reduced variables. Then,
edges of convex cones are compared for the discrimination of variables.
Finally, feature spaces with the reduced variables based on the nearest
neighbor relations are shown to be useful for the real documents classi-
fication.

Keywords: Nearest neighbor relation · Independent vectors · Convex


cones · Inheritance of the relation

1 Introduction

In recent years, neural networks play an important role in the intelligent and
useful functions for AI and machine learning. Convolutional network with orthog-
onal weights is proposed for the improvement of networks stability [1,11]. In the
feature space, dimensionality reduction is a topic in machine learning [2,10,12].
Further, feature vectors are considered to be as orthogonal as possible between
different classes [9]. These studies show some expectations to enhance the net-
work ability for the intelligent processing. For the construction of the intelligent
feature spaces, the independence of the feature input vectors will be useful for
the stable and efficient processing followed by the networks for generation and

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 77–88, 2021.
https://doi.org/10.1007/978-3-030-85030-2_7
78 N. Ishii et al.

classification of the objects. First, to draw the discriminative ability of the fea-
ture input vectors, we developed fundamental schemes of the nearest neighbor
relations and their analysis for the application to classification in the threshold
networks. Then, the feature spaces with reduced variables are generated, which
are followed by the classification in the networks. The nearest neighbor relations
are shown to be related to feature independent vectors. Next, the convex cones
of the nearest neighbor relations are developed for the reduction of variables [5].
Then, the operations on the degenerate convex cones using nearest neighbor rela-
tions, are developed for the discrimination of variables using the their dependent
properties called the inheritance. The dependent relations and algebraic opera-
tions of the edges on the degenerate convex cones in the subspaces are derived
for the reduced variables. Finally, experiments are performed to verify the effec-
tiveness of reduced variables for the classification of the Reuters collection [8].
It is shown that the reduced variables based on the nearest neighbor relations
and their properties are effective for the classification of the real data.

2 Nearest Neighbor Relations


As the application stage from the threshold function, the nearest neighbor
relation with minimal distance is introduced here for the general data in the
Euclidean space. The relation with minimal distance plays an important role in
the reduction of variables for the general data.
Definition 1. A nearest neighbor relation with minimal distance is a set of pair
of instances, which are described as

{(xi , xj ) : d(xi ) = d(xj ) ∧ |xi − xj | ≤ δ} (1)


where |xi − xj | shows the distance between xi and xj . Further, d(xi ) is a decision
function and δ is the minimal distance. Then, xi and xj in Eq. (1) are called to
be in the nearest neighbor relation with minimal distance δ.
We assume here a linear subspace to characterize nearest neighbor relations.
Instances of the nearest neighbor relations defined by (1) are separated by a
hyperplane in the linear subspace. Assume the hyperplane f (u) = w1 u1 +w2 uw +
· · ·+wn un −θ = 0 in the weight parameters W (= {wi , i = 1 ∼ n}) and a constant
θ. In this linear subspace, we assume two classes, the class f + (u) > 0 and the
class f − (u) < 0. Then, the boundary vertex v ∈ Ω (Ω is the set of instances), is
defined as follows:
Definition 2. The boundary vertex v is defined to be the vertex which satisfies

|W v − θ| ≤ |W v  − θ| for the v, v  ∈ Ω (2)

in the different classes, i.e., d(v) = d(v  ).


Theorem 1. The boundary vertex v becomes an element of nearest neighbor
relation.
Features Spaces with Reduced Variables 79

This is shown by adjusting δ in Eq. (1) to be obtained for the other instance v 
paired with v. If there exist no v  , only one class is assumed, which contradicts
the problem here. In the linear subspace, the characteristic properties of the
inequalities are shown in Ky Fans theory [4,6] as shown in Theorem 2.
Theorem 2. A necessary and sufficient condition of the linear separability for
the n weights and a threshold {(n + 1) variables} is the existence of the (n + 1)
inequalities, which are independent inequality equations.
Since these (n + 1) independent inequalities are replaced to the equality
equations having the value + or −, the (n + 1) vertices corresponding to these
equations become boundary vertices. Applying Theorem 2 to the nearest neigh-
bor relations in the linear feature space, the following theorems are derived.
Theorem 3. There exist at least n nearest neighbor relations in the linear fea-
ture space with n input variables. Further, at least one boundary vertex is included
in the respective nearest neighbor relation. Thus, the nearest neighbor relations
composes of at least n independent vectors.
Since the nearest neighbor relations contains independent vectors from the
data, they will be useful for the classification and the learning using the inde-
pendent characteristics.

2.1 Reduced Variables in the Feature Spaces


An example of the decision table is shown in Table 1. The left side data in the
column in Table 1 as shown in, {x1 , x2 , x3 , . . . , x7 } is a set of instances, while
the data {a, b, c, d} on the upper row, shows the set of attributes of the instance.
In Table 1, boolean variables of the nearest neighbor relations are shown in the
gray elements, which are derived in Table 1. In Fig. 1 (A), linearly separated
classifications are shown for the data in Table 1. As the first step, one hyper-
plane A1 divides the instances of data {x3 , x5 , x7 , x4 } shown with × from those
of {x2 , x6 } shown with •. But, it misclassifies the instance {x1 } with • from
{x3 , x5 , x7 , x4 , x1 } with ×. As the second step, another hyperplane A2 divides
the instance {x1 } with • from {x3 , x5 , x7 , x4 } with ×. The hyperplane A1 in Fig. 1
(A) is computed from independent instances of the nearest neighbor relations.
The hyperplane A1 separates instances {x3 , x5 , x7 , x4 } and {x2 , x6 }. Among
these instances, the nearest neighbor relations {(x3 , x6 ) , (x5 , x6 ) , (x7 , x6 )} hold.
The hyperplane A2 separates the instances {x3 , x5 , x7 , x4 } and {x1 } as shown
in Fig. 1 (A). When instances {xi } are described on the geometrical objects, the
capital letters {Xi } are used, while {X  i } are those with reduced variables.

2.2 Boolean Representation of Nearest Neighbor Relations on the


Convex Cone
Since the nearest neighbor relations are minimal discrimination information,
their Boolean product of edges in the convex cone in Fig. 1 (B) shows the minimal
discrimination. Thus, the Boolean product becomes (b + c) · (c + d) = c + bd [5].
This is generalized as follows,
80 N. Ishii et al.

Table 1. Decision table of data example (instances)

Attribute a b c d class
x1 1 0 2 1 +1
x2 1 0 2 0 +1
x3 2 2 0 0 −1
x4 1 2 2 1 −1
x5 2 1 0 1 −1
x6 2 1 1 0 +1
x7 2 1 2 1 −1

{ } { }
{ }

Fig. 1. Instances in the space and convex cone generated by nearest neighbor relations

Theorem 4. The Boolean product of edges of nearest neighbor relations on the


convex cone shows a minimal discrimination information.
Nearest neighbor relation shows the relation between the nearest data which
belongs to the different classes, respectively. Then, another data except those of
the nearest neighbor relations is described in the following data chaining, which
we call here the inheritance of nearest neighbor relations.

2.3 Inheritances of Nearest Neighbor Relations


The instance of the nearest neighbor relations is characterized by the chaining
of the difference between data Xi and Xj in the different class, which we call
here the pathway from Xi to Xj .
 
(Xi − Xj ) = (Xi − Xk1 ) + (Xk1 − Xk2 ) · · · + Xkm − Xj (3)
 
When Xkm − Xj shows the difference of a nearest neighbor relation with
variables u, . . . , w, (3) is represented using the instances {Xi } with reduced
variables u, . . . , w.
Features Spaces with Reduced Variables 81

       
Xi − Xj u...w
= Xi − Xk 1 u...w + Xk 1 − Xk 2 u...w · · · + Xk m − Xj
u...w
  (4)
The left of (4), Xi − Xj u...w shows the difference value with components
 (Xi − Xj ) has not zero in the components u, . . . , w. Then,
u, . . . , w between
the difference Xi − Xj u...w is inherited from the nearest neighbor relation
 
Xk m − Xj . Thus, the following lemma holds
u...w
 
Lemma 1. The difference Xi − Xj u...w is inherited from the nearest neighbor
 
relation Xk m − Xj , if the following equation holds,
u...w

  
m−1   
Xi − Xk 1 u...w
+ Xk i − Xk i+1 = − Xk m − Xj u...w (5)
u...w
i=1

This is proved as follows. If (5) does not hold, the equality


 holds
 in (5). Thus,
the right side in (4) becomes zero. Thus, the left side Xi − Xj u...w becomes
 
to be zero. This contradicts the assumption Xi − Xj u...w to be nonzero in the
components u, . . . , w.  
When the nearest neighbor relation is Xk m − Xj u...w given in advance, the
 
existence of the pathway for the Xi − Xj u...w is shown in the next theorem.
 
Theorem 5. For the nearest neighbor relation Xk m − Xj u...w , the pathway of
  
the difference Xi − Xj u...w exists if the following equation holds,
   
Xi − Xk m u...w
= − Xk m − Xj u...w (6)
 
Since Xi = Xj , Eq. (6) is always satisfied. This shows Xi − Xj u...w is
  
inherited from the nearest neighbor relation Xkm − Xj u...w .
 
Corollary 1. The Boolean expression of the difference Xi − Xj u...w is
  
absorbed by the nearest neighbor relation Xkm − Xj u...w .

As an example of the inheritance of the nearest neighbor relation is described


as follows. We consider the data difference, X2 − X5 in the Table 1. Since X2
and X5 are different in the components c and d, the difference between X2 and
X5 with the degenerate variables {c, d} is considered in (X2 − X5 ){c,d} . From
Theorem 5,

(X2 − X5 ){c,d} = (X6 − X5 ){c,d} (7)


holds, where is a nearest neighbor relation in Fig. 1 (B). Then, the pathway exists
as follows,

 
(X2 − X5 ){c,d} = (X2 − X7 ){c,d} + (X7 − X6 ){c,d} + X6 − X5 (8)
{c,d}
82 N. Ishii et al.

where the underline indicates the nearest neighbor relation. 


From Corollary
 1,
   
(X2 − X5 ){c,d} is removed from the nearest neighbor relation X6 − X5 in
{c,d}
the Boolean representation.

3 Inheritances of Nearest Neighbor Relations


for Multiple Clusters
Multi-classification with more than 3 classes is important. Then, the nearest
neighbor analysis for the multiple clusters become basics on multi-classification
analysis. In Table 2, we add another data of the different cluster to that in
Table 1.

Table 2. Decision table of additional data (2nd cluster)

Attribute a b c d class
z1 1 1 1 1 −1
z2 1 0 1 2 −1
z3 0 1 2 1 +1
z4 2 2 1 1 +1
z5 2 2 2 2 +1

3.1 Generation of Pathways Through Nearest Neighbor Relations


From the any data zi in the second cluster to the data xi in the first cluster
set, a path connecting both data is made through nearest neighbor relation with
several steps (Fig. 2).

Z1
X3
X5
X7
{a, c}
{b, c} {c, d} {a, b}
{c, d}
Z4
X6 Z3

Fig. 2. Convex cones made of two clustered data


Features Spaces with Reduced Variables 83

X7

X4

{b, d}
{a, d}

Z5

Fig. 3. Convex cones made of crossed data

As an example, the difference between z4 in the second cluster and x3 in the


first cluster, is stated as (z4 − x3 ), which is rewritten to the right equation of
(9).

(Z4 − X3 ){c,d} = (Z4 − X7 ){c,d} + (X7 − X6 ){c,d} + (X6 − X3 ){c,d} (9)

where (Z4 − X3 ){c,d} shows the difference is defined in the attributes c and d of
Z4 and X3 . In the second terms in (9), if the following inequality holds in both
attributes c and d,

(Z4 − X7 ){c,d} + (X6 − X3 ){c,d} = (X6 − X7 ){c,d} (10)
then the first term in (9)

(Z4 − X3 ){c,d} = 0{c,d} (11)

Equation (11) shows the nearest neighbor relation is inherited from


(X6 − X7 ){c,d} in the first cluster, since (X6 − X7 ){c,d} = 0 holds as a near-
est neighbor relation. Since only one X6 is inserted In (9), we call here one step
propagation of the nearest neighbor relation {X6 , X7 }. By this propagation of
nearest neighbor relation, the pair of the difference information of {z4 , x3 } is
removed, since the nearest neighbor relation, {x6 , x7 } has its difference informa-
tion (c, d). Generally, we extend the chaining of paths as fol-lows. Between the
different clusters {Zi } and {Xi }, assume here the start data Zs ∈ {Zi } and the
target data Xt ∈ {Xi } exist between clusters. First, we consider the difference
of the two data, (Zs − Xt ), which is described as the chaining pathway, which
connects Zs and Xt .
 
(Zs − Xt ) = (Zs − Xl ) + (Xl − Xm ) + · · · + Xn − Xt (12)
84 N. Ishii et al.

where we assume {Xn , Xt } underlined term in the right is a nearest neighbor


relation and non-zero components of (Xn − Xt ) to be {k} in the components.
Then,
 
(Zs − Xt )k = (Zs − Xl )k + (Xl − Xm )k + · · · + Xn − Xt k (13)
Second, if the following equation holds in the right of (13),


n
 
(Zs − Xt )k + Xi − Xj k
= − (Xn − Xt )k (14)
i=l,j=m

the left side of (13) becomes

(Zs − Xt )k = 0 (15)
Equation (15) shows that the difference of the two data, (Zs − Xt ) inherits
that of the nearest neighbor relation. Conversely, if (15) is satisfied, the same
difference information of the nearest neighbor relation with k components exists
through the chaining pathway. This is generalized in the following theorem.

Theorem 6. Assume that the components (l1 , l2 , . . . , lm ) are different in the


value between the starting data S and the target data, T which is different to S in
the class. The chaining pathway between S and T has the difference information
of the nearest neighbor relation with the non-zero component (li , . . . , lk ), which
is included (l1 , l2 , . . . , lm ).

Then, the difference information between data in the different clusters in


Boolean expression are removed as the same difference information of the near-
est neighbor relations. Theorem 6 is applied to the difference of data between
clusters, iteratively for the processing of cross terms between clusters, iteratively.
After the iterative processing, the remained cross data terms are utilized to gen-
erate a convex cone for the cross terms of two clustered data. In Fig. 3, a convex
cone for the cross terms generated from two clustered data is shown. Edges of
the convex cone show the variables {a, d} and {b, d}. Further, by processing step,
a new variable {a} is generated. Thus, the Boolean expression of the cross terms
between {zi } and {xi } becomes

a · (a + d) · (b + d) = a · b + a · d (16)
By combining three convex cones in Fig. 1 and in Fig. 3, the Boolean expres-
sion of total clusters set becomes

(a · b + a · d) · (b · c + a · b · d) = a · b · c + a · b · d (17)
Thus, by increase of the added data cluster, variables used for the discrimi-
nation of data are increased.
Features Spaces with Reduced Variables 85

4 Multi-classification Using Reduced Variables


for Documents Categorizations
For evaluating the reduced variables of the previous Sect. 3, the classification of
the Reuters collection (Reuters-221578) is computed in our experiments, which
is publicly available [8]. Features of data are obtained by removing the SGML
tags, suffix stripping [7] and entropy-weighting for the documents in the Reuters
collection. Features consist of characteristic words of the documents. Six different
corpora of the Reuters collection were used in our experiment. They are Alum,
Cocoa, Copper, Cpi, Gnp and Rubber [8]. Thus, these data consist of six classes.

4.1 Classification Using Reduced Variables


In this paper, the classification of data for classes, k-Nearest Neighbor method
(kNN) [3] is applied. Then, the nearest neighbor classification is based on the
features similarity as vectors inner product. To compare the reducing variables
developed in the Sect. 3, the multiple reduced variables followed by the kNN
classification system is proposed in Fig. 4 in which, R1 = {r1 , r2 , . . . , rn1 } shows
the first set of the reduced variables r1 , r2 , . . . , rn1 . The m-th number of the
reduced variables are assumed here. Since the Reuters collection experimented
in this paper has six classes, the computation for the reduced variables needs
that of the multiple clusters extended in the Sect. 3. To improve the classification
accuracy, multiple reduced variables {R1 , R2 , . . . , Rm } followed by the respective
kNN classification system is proposed.

Fig. 4. Classification system with reduced variables – kNN

Features consist of 87 words for m = 5 as gross (241), cocoa (3), growth (246),
cost (129), icco (1122), march (67), ton (68), estim (37), effect (102), . . . . . . ,
which are selected among 2049 keywords. Further, to estimate k value for the
86 N. Ishii et al.

Reuters collection, the document classification experiment was carried out using
kNN classification. The evaluation measures; recall, precision and accuracy are
well known in the classification. These measures are computed through changing
k values for the classification. Experimental results for changing k values are
shown in the kNN classification in Fig. 5. For the k value, three measures are
stable over the k = 5. So, experiments are carried out using k = 10. Among
neighbors with k = 10, the similarity computations are carried out at every
computing data point. Set {R1 , R2 , . . . , Rm } of the multiple reduced variables in
Fig. 4 becomes to be important. To compare the reduced variables of the Reuters
collection, the classification accuracy is computed among these data set as shown
in Table 3.

Fig. 5. Classification accuracy for the k value

Since the difference of the classification accuracy of the set of the reduced
variables are observed among the Reuters collection, the numerical data are
changed to the graphical representation. These values in Table 3 are shown in
line graph in Fig. 6. In Fig. 6, a thin solid line shows the document called alum, a
small dotted line shows the document called cocoa, a medium dotted line shows
copper, a large dotted line shows cpi, a broken line shows gnp, a medium solid
line shows rubber, and a thick solid line shows the average among these data.
Figure 6 shows the classification accuracy for the combination of the set of
reduced variables shows a flatness Red.V.3 to Red.V.5 (three sets of reduced
variables to five sets of ones) for each data in Fig. 6. So, three or four sets of
the reduced variables are take, since the classification accuracy is flat and the
combination of the small number of the reduced variables. Further, the average
value of the classification accuracy of six classes shows 0.90–0.95 in the thick
line in Fig. 6. Since all the reduced variables, which are same to those words,
use almost 90 word among the total over 2000 words, the multiple reduced-kNN
system proposed in Fig. 6 is evaluated to be effective for the classification of
documents.
Features Spaces with Reduced Variables 87

Table 3. Classification accuracy using reduced variables

alum coco. copp. cpi gnp rubb. aver.


Red. V. 1 0.83 0.98 0.99 0.76 0.84 0.97 0.89
Red. V. 2 0.88 0.97 0.98 0.81 0.87 0.98 0.91
Red. V. 3 0.88 0.92 0.99 0.84 0.90 0.99 0.92
Red. V. 4 0.86 0.94 0.99 0.84 0.90 0.98 0.92
Red. V. 5 0.90 0.92 0.98 0.86 0.88 0.99 0.92

Fig. 6. Classification accuracy using set of reduced variables

5 Conclusion

We developed fundamental schemes for the application of the nearest neighbor


relations and their inheritances to feature spaces. The reduction of data variables
and their classification are expected to have more intelligent functions in the neu-
ral networks for their efficiency. First, by using geometrical reasoning of convex
cones, it is shown that the nearest neighbor relations and their inheritances are
useful for the generation of the feature spaces with reduced variables. Next,
the extended application of the nearest neighbor relations is analyzed for the
multi-clusters classifications. It is shown in the experiments that feature spaces
with reduced variables perform effectively the categorization of the multiclass
documents.
88 N. Ishii et al.

References
1. Bansal, N., Chen, X., Wang, Z.: Can We gain more from orthogonality regular-
izations in training deep networks? In: Bengio, S., Wallach, H.M., Larochelle,
H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural
Information Processing Systems 31: Annual Conference on Neural Informa-
tion Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal,
Canada, pp. 4266–4276 (2018). https://proceedings.neurips.cc/paper/2018/hash/
bf424cb7b0dea050a42b9739eb261a3a-Abstract.html
2. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science
and Statistics). Springer, Heidelberg (2006). https://www.springer.com/jp/book/
9780387310732
3. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. The-
ory 13(1), 21–27 (1967)
4. Hu, S.T.: Threshold Logic. University of California Press, Berkeley (1965)
5. Ishii, N., Torii, I., Iwata, K., Odagiri, K., Nakashima, T.: Generation of reducts
and threshold functions using discernibility and indiscerniblity matrices. In: 2017
IEEE 15th International Conference on Software Engineering Research, Manage-
ment and Applications (SERA), pp. 55–61 (2017). https://ieeexplore.ieee.org/
document/7965707
6. Kuhn, H.W., Tucker, A.W.: On systems of linear inequalities. In: Linear Inequali-
ties and Related Systems (AM-38), vol. 38, pp. 99–156. Princeton University Press
(1966)
7. Porter, M.F.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst.
40(3), 130–137 (1980). https://www.emerald.com/insight/content/doi/10.1108/
eb046814/full/html
8. Reuters-21578 Text Categorization Collection. https://kdd.ics.uci.edu/databases/
reuters21578/reuters21578.html
9. Shi, W., Gong, Y., Cheng, D., Tao, X., Zheng, N.: Entropy and orthogonality based
deep discriminative feature learning for object recognition. Pattern Recogn. 81, 71–
80 (2018). https://www.sciencedirect.com/science/article/pii/S0031320318301262
10. Skowron, A., Polkowski, L.: Decision algorithms: a survey of rough set - theoretic
methods. Fundam. Informaticae 30, 345–358 (1997)
11. Wang, J., Chen, Y., Chakraborty, R., Yu, S.X.: Orthogonal convolutional neural
networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2020
12. Zhang, S., Jiang, H., Dai, L.: Hybrid orthogonal projection and estimation
(HOPE): a new framework to learn neural networks. J. Mach. Learn. Res. 17(37),
1–33 (2016). http://jmlr.org/papers/v17/15-335.html
High-Dimensional Data Clustering with Fuzzy
C-Means: Problem, Reason, and Solution

Yinghua Shen1(B) , Hanyu E2 , Tianhua Chen3 , Zhi Xiao1 , Bingsheng Liu4 ,


and Yuan Chen5
1 School of Economics and Business Administration, Chongqing University, Chongqing, China
{yinghua,xiaozhi}@cqu.edu.cn
2 Department of Electrical and Computer Engineering, University of Alberta, Edmonton,
Canada
[email protected]
3 Department of Computer Science, University of Huddersfield, Huddersfield, UK
[email protected]
4 School of Public Affairs, Chongqing University, Chongqing, China
5 College of Management and Economics, Tianjin University, Tianjin, China

[email protected]

Abstract. Fuzzy C-Means (FCM) clustering algorithm is a popular unsupervised


learning approach that has been extensively utilized in various domains. However,
in this study, we point out a major problem faced by FCM when it is applied to the
high-dimensional data, i.e., quite often the obtained prototypes (cluster centers)
could not be distinguished with each other. Many studies have claimed that the
concentration of the distance (CoD) could be a major reason for this phenomenon.
This paper has therefore revisited this factor, and highlight that the CoD could not
only lead to decreased performance, but sometimes also positively contribute to
enhanced performance of the clustering algorithm. Instead, this paper point out the
significance of features that are noisy and correlated, which could have a negative
effect on FCM performance. Hence, to tackle the mentioned problem, we resort
to a neural network model, i.e., the autoencoder, to reduce the dimensionality of
the feature space while extracting features that are most informative. We conduct
several experiments to show the validity of the proposed strategy for the FCM
algorithm.

Keywords: Fuzzy C-means · High-dimensional data · Autoencoder

1 Introductory Note

Clustering is one of the most important techniques used to explore the structure of data.
It intends to gather those data points close (in terms of distance, similarity, functionality,

This work was supported in part by the National Natural Science Foundation of China under
Grant 72001032, Grant 72071021, Grant 72002152; in part by Natural Science Foundation of
Chongqing under Grant cstc2020jcyj-bshX0013.

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 89–100, 2021.
https://doi.org/10.1007/978-3-030-85030-2_8
90 Y. Shen et al.

etc.) to each other into a group and distributes those far apart from each other into the
different groups. Many different kinds of clustering concepts and algorithms have been
proposed so far, which could be roughly classified into partition-based methods [1–3],
graph-based methods [4, 5], hierarchy-based methods [6, 7], and density-based methods
[8, 9]. Among these methods, the fuzzy partition-based methods, e.g., Fuzzy C-Means
(FCM) [2, 3, 10–12], which bring the concept of fuzzy set [13] into clustering, have
seen a rapid development in both theory and real-world applications. By assigning the
cluster membership degree, which is a value in the interval [0, 1], to a certain data point,
the structure of data could be described by some overlapped clusters which are more
suitable to represent and handle the complex phenomena in real world. We briefly review
the concept and algorithm of the FCM as follows.
Suppose that we have the data set as X = (x1 , x2 ,…, xN )T , xk is the k-th data point
in the n dimensional feature space Rn . The generic version of the FCM algorithm [3]
minimizes the following objective function with the (weighted) Euclidean distance as


c 
N
Q= m
uik xk − vi 2 (1)
i=1 k=1

with the distance expressed as


n
(xkj − vij )2
xk − vi  = 2
(2)
j=1
σj2

where σ j is a standard deviation of the j-th variable of the data, and fuzzification coeffi-
cient m is usually greater than 1. The data is partitioned into c clusters coming in the form
of the partition matrix U = [uik ]c×N , i = 1, 2,…, c; k = 1, 2,.., N, with a collection of
prototypes represented as V = (v1 , v2 , …, vc )T . The k-th data is described in terms of the
k-th column membership grades in the partition matrix. By the alternating optimization
(AO) algorithm in [14], each element in the partition matrix is calculated as
1
uik =  2/(m−1) (3)
c xk −vi 
j=1 xk −vj 
and each entry in the prototype is obtained as
N mx
k=1 uik kt
vit =  N
(4)
m
k=1 uik

where t = 1, 2,…, n.
However, one problem with FCM is that it may not work well when the dimensional-
ity n of the feature space is high, because the prototypes found by the algorithm could be
quite similar to each other. To illustrate this phenomenon, we use two high-dimensional
data sets from the UCI machine learning data repository, i.e., Isolet (with 1560 samples
and 617 features) and Hand (with 1800 samples and 3000 features). We apply the FCM
to these two data sets with both the cluster number and fuzzification coefficient set to 2.
High-Dimensional Data Clustering with Fuzzy C-Means 91

We observe that for each data set, the obtained two prototypes are exactly the same. Due
to space limit, we only select the clustering results of the first 10 features, as documented
in Table 1. Clearly, these are not desired results, thus motivating this research to resort to
the autoencoder to mitigate this issue when applying FCM in high-dimensional feature
space.
In the following, we first analyze why sometimes the FCM does not work well in
the high-dimensional feature space in Sect. 2. We use the autoencoder, a neural network
model, to reduce the feature space in Sect. 3. In Sect. 4, we conduct several experimental
studies to demonstrate the validity of the propose strategy to make FCM work in the
high-dimensional feature space. Finally, we conclude the paper and point out some future
studies in Sect. 5.

Table 1. Results of the first 10 features of the obtained prototypes.

Datasets Prototypes Features


f1 f2 f3 f4 f5 f6 f7 f8 f9 f 10
Isolet v1 0.35 0.55 0.67 0.72 0.68 0.63 0.60 0.54 0.49 0.46
v2 0.35 0.55 0.67 0.72 0.68 0.63 0.60 0.54 0.49 0.46
Hand v1 0.30 0.40 0.53 0.71 0.28 0.50 0.49 0.70 0.45 0.44
v2 0.30 0.40 0.53 0.71 0.28 0.50 0.49 0.70 0.45 0.44

2 Reasons for Failure of FCM

Concentration of distance (CoD) has been discovered as one of the major aspects of
the curse of dimensionality [15–17]. The general statement for this phenomenon is that,
under certain assumptions such as data points are obtained from the independent and
identical distributions, data points will become close to each other making them indis-
tinguishable. Hence, a follow-up question is that will this concentration seriously affect
the algorithms which are based on the distance measure? In spite of evidences that CoD
may have made the K-nearest neighbor (KNN) unstable [15, 18, 19], research has been
undergoing to identify its comprehensive effects on classification and clustering algo-
rithms. Specifically, [20] did find that the CoD can be used to improve the classification
accuracy of the algorithm. For the clustering algorithm, [21] observed that CoD does
not always have a negative effect on clustering. In case each feature of the data sets con-
tributes to the clusters contained therein, CoD is helpful in distinguishing the clusters;
however, when the generated clusters mainly result from a small number of features,
with the remaining being noisy features (e.g., those satisfying the normal distribution),
CoD can make the clusters merged together.
It seems that, performance of the clustering or classification algorithm does not
totally depend on the CoD. It is determined by the relationship between the embedding
dimension and intrinsic dimension. In fact, when we have a real-world high-dimensional
92 Y. Shen et al.

data, the concentration degree is not necessarily high with Table 1 in [16] as an example.
It has been pointed out that high-dimensional (in terms of the embedding dimension)
real-world data usually has a much lower intrinsic dimensionality, which is a consensus
in the high-dimensional data analysis community [22]. In fact, many approaches have
been proposed to estimate this intrinsic dimension, from columns d and d mle in Table 1
in [23] we can catch a glimpse of the relationship between the embedding and intrinsic
dimensions. As will be shown in our experiment, the size of intrinsic dimensions does
not directly contribute to the occurrence of the CoD. Also, we will see that CoD could
be beneficial or detrimental to clustering, but high intrinsic dimension is beneficial for
clustering while high embedding dimension with low intrinsic dimension is not good
for clustering.
To illustrate this, in the following we design and implement 3 experiments (with
generated synthetic data) which corresponds to 3 scenarios. We want to check the sepa-
rability of the clusters in a data set in relation to the increasing dimensionality. Scenario
1: each feature has a multimodal distribution (a mixture of two Gaussian distributions),
features are independent with each other. Scenario 2: the first feature has a multimodal
distribution, while other features have the same Gaussian distributions. Scenario 3: all
the features are linearly related, and each feature has a multimodal distribution. The
design of the experiments are inspired by that in [21], where Scenarios 1 and 2 (the
histogram parts) are repeated for two examples in [21].

Scenario 1: Suppose we have a data set X with N = 2000 observations and n features.
X consists two equal-sized clusters satisfying the multivariable Gaussian distribution,
with the cluster centers as center1 = [0, 0, ..., 0]1×n and center2 = [1, 1, ..., 1]1×n ,
with the covariance matrix (to model the spread of each cluster) derived by multiplying
an n-dimensional identity matrix by the constant variance = 1. An example of such data
set when n = 2 is given in Fig. 1. This setting makes the formed data set X have the same
size of intrinsic and embedding dimensions because each newly formed feature provides
the information of two clusters. We measure the Euclidean distance between each point
in X and the origin of the n-dimensional space (i.e., the Euclidean norm), then give the
histogram of these distances. We illustrate the results in Fig. 2 when n is set as 1, 100,
and 1000, respectively.

Fig. 1. Data set X when n = 2.


High-Dimensional Data Clustering with Fuzzy C-Means 93

Fig. 2. Histogram of the distances to the origin when (a) n = 1; (b) n = 100; and (c) n = 1000.

Obviously, when n is small, distances from these two clusters to the origin have a
similar distribution, and large portion of the distributions are overlapped with each other.
However, this overlap becomes increasingly smaller as the dimensionality increases.
When n is 1000, these two distributions are completely separated with each other, which
means that the two clusters are well separated. In fact, we could also consider the
distances among each cluster (intra cluster distances) and those among the clusters
(inter cluster distances). We show the histogram of these distances with the increasing
feature dimensionality in Fig. 3 when n is 1, 100, and 1000, respectively. Obviously,
with the increasing dimensionality, intra cluster distances or inter cluster distances are
increasing, but latter increase more rapidly than the former.

Inter cluster Inter cluster Inter cluster

(a) (b) (c)


Fig. 3. Histogram of the inter cluster and intra cluster distances when (a) n = 1; (b) n = 100; and
(c) n = 1000.

Fig. 4. Concentration degree for the whole data set and two clusters.
94 Y. Shen et al.

In the following, we intend to check the CoD phenomenon in this data set. The
sufficient and necessary condition to incur the CoD was raised in [18] and [24], that
is, the relative variance has to converge to 0 (or the relative contrast has to converge to
0 in probability 1) when feature dimensionality increases to infinite. Here, the relative
variance is derived by diving the standard deviation of the distances among data points by
the expectation of these distances. Experimentally we could show that in this scenario,
when we range the feature dimensionality n from 1 to 5000 with a step size of 10, the
relative variance will converge to a positive constant around 0.1 in Fig. 4(a). However,
as for the two clusters, this relative variance reaches the value around 0.01 in Figs. 4(b)
and (c), and obviously relative variance still has a decreasing trend in both cases. This
result suggests that the CoD in each cluster is much more serious than that for the entire
data set, which potentially makes the clusters easier to be separated in high-dimensional
feature space.

Scenario 2: Suppose that we have a data set X with N = 2000 observations and n
features. Observations in the first feature of X are composed of two equal-sized clusters
with the Gaussian distribution, with the cluster centers as center1 = 0 and center2 = 8,
the variance of each cluster equal to 1. For the remaining n–1 features, observations
satisfy the Gaussian distribution with center = [0, 0, ..., 0]1×n−1 and covariance matrix
as the n–1 dimensional identity matrix times the constant variance = 1. In this case,
only the first feature makes contribution to clusters in the data, other features serve as the
noisy information. We set n as 1, 100, and 1000, respectively, and show the histogram of
the distance between points and origin in Fig. 5. This scenario suggests, as the feature
dimensionality rises, clusters tend to merge together. Distributions of the intra cluster
and inter cluster distances are given in Fig. 6, where we see distribution of the inter
cluster distances is approaching that of the intra cluster distances.

Fig. 5. Histogram of the distances to the origin when (a) n = 1; (b) n = 100; and (c) n = 1000.

The indexes of the CoD of the whole data set and each cluster are also given in Fig. 7.
In this scenario, both the cluster and the whole data set get a high concentration degree
because the values of their relative variances are quite close to 0 with the increasing
feature dimensionality.

Scenario 3: Now let us see a data set with highly correlated features. The construc-
tion of the data set is similar to that in Scenario 1, X is composed of two equal-
sized clusters satisfying the multivariable Gaussian distribution, with the centers as
High-Dimensional Data Clustering with Fuzzy C-Means 95

Inter cluster Inter cluster Inter cluster

(a) (b) (c)


Fig. 6. Histogram of the inter cluster and intra cluster distances when (a) n = 1; (b) n = 100; and
(c) n = 1000.

Fig. 7. Concentration degree for the whole data set and two clusters.

center1 = [0, 0, ..., 0]1×n and center2 = [8, 8, ..., 8]1×n , the covariance matrix is the
same as the one in Scenario 1, except that all non-diagonal entries has a value of 0.9.
This setting indicates that features of the data are highly linear correlated. Similarly, we
show the 3 groups of results in Figs. 8, 9, and 10.

Fig. 8. Histogram of the distances to the origin when (a) n = 1; (b) n = 100; and (c) n = 1000.

In this scenario, we see that with the increasing dimensionality, the clusters are
always separable. In fact, from Fig. 10 since the relative variance is very large (> 0.5)
and does not have a trend of decreasing to 0, we can say that the CoD phenomenon does
not happen in this case.

From the current experimental results of the 3 scenarios, it is fair to conclude that high
intrinsic dimension is beneficial for distinguishing clusters because the CoD occurred
96 Y. Shen et al.

Inter cluster Inter cluster Inter cluster

(a) (b) (c)


Fig. 9. Histogram of the inter cluster and intra cluster distances when (a) n = 1; (b) n = 100; and
(c) n = 1000.

Fig. 10. Concentration degree for the whole data set and two clusters.

in this case is helpful. However, keep in mind that high intrinsic dimension barely exists
for real-world data sets. Features in real-world data sets tend to be noisy and correlated
with each other. This makes the clusters contained in the intrinsic dimension merged
together with the increasing feature dimensionality (results in Scenario 2), or repeatedly
represented in the increasing feature space (results in Scenario 3), which greatly increases
the computing burden for clustering but is unnecessary at all. These findings motivate us
to choose a low number of (yet informative) features when performing a clustering task
in real world. Note that it is not the occurrence of the CoD that prompts to reduce the
feature dimensionality, but the existence of noisy and correlated features (which makes
the clustering non-effective and inefficient).

3 Clustering Based on Autoencoder


An autoencoder [25] is a type of feedforward neural network which is trained to replicate
its input at its output. Its structure is illustrated in Fig. 11, which is composed of the
encoder part and the decoder part. In the encoder part, the values of the inputs are first
linearly combined and sent to a hidden neuron in the hidden layer, then a nonlinear
function is used to further encode the combination. In the decoder part, the output of
these hidden neurons are linearly combined and sent to the output neurons. The weights
of the connections between the neurons are adjusted such that the errors between the
inputs and outputs are minimized. The crux of the autoencoder is that the inputs are finally
represented by a small number of hidden neurons, which contains the most important
and relevant information of the data set. Since the inception of the autoencoder, many
High-Dimensional Data Clustering with Fuzzy C-Means 97

different versions of the network have been proposed to make it more robust. The one
we use in this study is enhanced by the L 2 and sparse regularization, whose cost function
of the network is represented as

1 
n N
E= (xkj − x̂kj )2 + λweights + βsparsity (5)
n
j=1 k=1

where λ is the coefficient for the L 2 regularization term and β is the coefficient
for the sparsity regularization term. One may find the detailed formulas used for two
regularization items in [26].
Given the dimensionality reduction an autoencoder enables to serve, it is therefore
utilized in this paper to reduce the high-dimensional feature space to mitigate the issue
associated with FCM as discussed in the preceding section.

Decoder

Input layer Hidden layer Output layer

Encoder

Fig. 11. Structure of the autoencoder.

4 Experimental Studies
In this section, we apply the proposed strategy to the two UCI data sets mentioned in the
introduction. As for the experimental settings, 60% of the original data points are used
as the training data, while the remaining as the testing data. We use the trainAutoencoder
function provided in MATLAB to train this network, and the parameters of this function
are remained as the default setting, except that the number of hidden neurons is ranged
from 2 to 20 with a stepsize of 2. We record the changing trend of the reconstruction error
98 Y. Shen et al.

E with respect to this number. Then we focus on a selected number of hidden neurons,
which serve as the inputs to FCM for further analysis.
First, we show the performance of the autoencoder with respect to the different
number of hidden neurons in Fig. 12. Generally, with a larger number of hidden neurons,
small reconstruction error is observed. However, the decreasing trend tends to be slow
down with the increasing number of hidden neurons, and we can even see that the
reconstruction error tends to be rebound back when this number is large for data set
Hand. Hence, for both data sets let us focus on the scenario where only 6 neurons are
considered, that is only 6 new features of each data set are used. We only show the
clustering results related to the prototypes in Table 2. Clearly, now for both data sets the
obtained prototypes are distinguishable with each other.

Fig. 12. Trends of the reconstruction error of data sets (a) Isolet and (b) Hand.

Table 2. Results of the new features of the obtained prototypes.

Datasets Prototypes Features


f1 f2 f3 f4 f5 f6
Isolet v1 0.17 0.24 0.29 0.16 0.19 0.13
v2 0.36 0.16 0.14 0.24 0.51 0.28
Hand v1 0.22 0.24 0.23 0.28 0.20 0.25
v2 0.05 0.04 0.07 0.05 0.03 0.12

5 Conclusion
In this paper, we first pointed out the problem faced by the FCM when this clustering
algorithm is performed in high-dimensional feature space, which potentially results in
indistinguishable prototypes. A detailed analysis of the reason for this failure is carefully
High-Dimensional Data Clustering with Fuzzy C-Means 99

discussed under several different scenarios. We highlighted that the well-known concen-
tration of distance (CoD) may not necessarily lead to a bad performance of the clustering
algorithm; rather it is the noisy and redundant (correlated) features that could lead to the
poor performance. Hence, we applied the autoencoder, a powerful dimensionality reduc-
tion technique, to seek for the most relevant features (newly formed features) contributing
to the structure of the data. The experimental results demonstrate the effectiveness of
the autoencoder in supporting FCM generating well separated prototypes.
Note that many state-of-the-art methods [27] have been proposed to cluster the high-
dimensional data with FCM, say, those based on sparse regularity [28] and unsupervised
feature selection [29]. As a future study, it is interesting to compare the proposed method
in this study with those in the current literature. Besides, since FCM is a popular building
block for some other system modeling techniques, say, the fuzzy rule-based model [30–
32], in the future study we intend to research how the autoencoder could help to improve
the performance (e.g., accuracy) of the prediction model.

References
1. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666
(2010)
2. Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104
(1974)
3. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput.
Geosci. 10(2–3), 191–203 (1984)
4. Päivinen, N.: Clustering with a minimum spanning tree of scale-free-like structure. Pattern
Recogn. Lett. 26(7), 921–930 (2005)
5. Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and its
application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 11, 1101–1113
(1993)
6. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J.
26(4), 354–359 (1983)
7. Karypis, G., Han, E.-H.S., Kumar, V.: Chameleon: Hierarchical clustering using dynamic
modeling. Comput. (Long. Beach. Calif.) 8, 68–75 (1999)
8. Kriegel, H., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Wiley Interdiscip.
Rev. Data Min. Knowl. Discov. 1(3), 231–240 (2011)
9. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters
in large spatial databases with noise. Kdd 96(34), 226–231 (1996)
10. Shen, Y., Pedrycz, W.: Collaborative fuzzy clustering algorithm: Some refinements. Int. J.
Approx. Reason. 86, 41–61 (2017)
11. Shen, Y., Pedrycz, W., Wang, X.: Clustering homogeneous granular data: formation and
evaluation. IEEE Trans. Cybern. 49(4), 1391–1402 (2019)
12. Shen, Y., Pedrycz, W., Chen, Y., Wang, X., Gacek, A.: Hyperplane division in fuzzy c-means:
clustering big data. IEEE Trans. Fuzzy Syst. 28(11), 3032–3046 (2020)
13. Zadeh, L.A.: Fuzzy sets-information and control-1965. Inf. Control. (1965)
14. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer
Science & Business Media, Berlin (2013)
15. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful?
In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer,
Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_15
100 Y. Shen et al.

16. François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans.
Knowl. Data Eng. 19(7), 873–886 (2007)
17. Kumari, S., Jayaram, B.: Measuring concentration of distances—an effective and efficient
empirical index. IEEE Trans. Knowl. Data Eng. 29(2), 373–386 (2016)
18. Hsu, C.-M., Chen, M.-S.: On the design and applicability of distance functions in high-
dimensional data space. IEEE Trans. Knowl. Data Eng. 21(4), 523–536 (2008)
19. Pestov, V.: Is the k-NN classifier in high dimensions affected by the curse of dimensionality?
Comput. Math. with Appl. 65(10), 1427–1437 (2013)
20. Pal, A.K., Mondal, P.K., Ghosh, A.K.: High dimensional nearest neighbor classification based
on mean absolute differences of inter-point distances. Pattern Recognit. Lett. 74, 1–8 (2016)
21. Klawonn, F., Höppner, F., Jayaram, B.: What are clusters in high dimensions and are they
difficult to find? In: Masulli, F., Petrosino, A., Rovetta, S. (eds.) CHDD 2012. LNCS, vol.
7627, pp. 14–33. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48577-4_2
22. Levina, E., Bickel, P.J.: Maximum likelihood estimation of intrinsic dimension. In: Advances
in Neural Information Processing Systems, pp. 777–784 (2005)
23. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Hubs in space: Popular nearest neighbors
in high-dimensional data. J. Mach. Learn. Res. 11(Sept), 2487–2531 (2010)
24. Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’meaningful: a converse theorem and
implications. J. Complex. 25(4), 385–397 (2009)
25. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks.
Science (80-). 313(5786), 504–507 (2006)
26. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy
employed by V1? Vision Res. 37(23), 3311–3325 (1997)
27. Deng, Z., Choi, K.-S., Jiang, Y., Wang, J., Wang, S.: A survey on soft subspace clustering.
Inf. Sci. (Ny) 348, 84–106 (2016)
28. Chang, X., Wang, Q., Liu, Y., Wang, Y.: Sparse regularization in fuzzy c-means for high-
dimensional data clustering. IEEE Trans. Cybern. 47(9), 2616–2627 (2016)
29. Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity.
IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 301–312 (2002)
30. Shen, Y., Pedrycz, W., Jing, X., Gacek, A., Wang, X., Liu, B.: Identification of fuzzy rule-based
models with output space knowledge guidance. IEEE Trans. Fuzzy Syst. 99, 1–1 (2020)
31. Hu, X., Shen, Y., Pedrycz, W., Li, Y., Wu, G.: Granular Fuzzy Rule-Based Modeling With
Incomplete Data Representation. IEEE Trans. Cybern. 99, 1–1 (2021)
32. Chen, T., Shang, C., Yang, J., Li, F., Shen, Q.: A new approach for transformation-based fuzzy
rule interpolation. IEEE Trans. Fuzzy Syst. 28(12), 3330–3344 (2019)
Contrastive Explanations for Explaining
Model Adaptations

André Artelt(B) , Fabian Hinder(B) , Valerie Vaquet(B) , Robert Feldhans(B) ,


and Barbara Hammer(B)

CITEC - Cognitive Interaction Technology, Bielefeld University,


33619 Bielefeld, Germany
{aartelt,fhinder,vvaquet,rfeldhans,bhammer}@techfak.uni-bielefeld.de

Abstract. Many decision making systems deployed in the real world


are not static - a phenomenon known as model adaptation takes place
over time. The need for transparency and interpretability of AI-based
decision models is widely accepted and thus have been worked on exten-
sively. Usually, explanation methods assume a static system that has
to be explained. Explaining non-static systems is still an open research
question, which poses the challenge how to explain model adaptations.
In this contribution, we propose and (empirically) evaluate a frame-
work for explaining model adaptations by contrastive explanations. We
also propose a method for automatically finding regions in data space
that are affected by a given model adaptation and thus should be
explained.

Keywords: XAI · Contrastive explanations · Model adaptation

1 Introduction
Machine learning (ML) and artificial intelligence (AI) based decision making sys-
tems are increasingly affecting our daily life - e.g. predictive policing [22] and loan
approval [11,26]. Given the impact of many ML and AI based decision making
systems, there is an increasing demand for transparency and interpretability [13]
- the importance of these aspects was also emphasized by legal regulations like
the EUs GDPR [17]. In the context of transparency and interpretability, fairness
and other ethical aspects become relevant [6,14].
As a consequence, the research community extensively worked on these
topics and came up with methods for explaining ML and AI based decision
making systems and thus meeting the demands for transparency and inter-
pretability [9,21,23]. Popular explanations methods [15,23] are feature rele-
vance/importance methods [7] and examples based methods [1]. Instances of
We gratefully acknowledge funding from the German Federal Ministry of Educa-
tion and Research (BMBF) through the projects EML4U (01IS19080 A) and TiM
(05M20PBA), and the VW-Foundation for the project IMPACT funded in the frame
of the funding line AI and its Implications for Future Society.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 101–112, 2021.
https://doi.org/10.1007/978-3-030-85030-2_9
102 A. Artelt et al.

example based methods are counterfactual explanations [24,25] and prototypes


& criticisms [12] - these methods use a set or a single example for explain-
ing the behavior of the system. Counterfactual and contrastive explanations in
general [15,24,25] are popular instances of example based methods for locally
explaining decision making systems [15] - popularity comes from the fact there
exists strong evidence that explanations by humans (which they try to mimic)
are often counterfactual in nature [5]. While some of these methods are global
methods - i.e. explaining the system globally - most of the example based meth-
ods are local methods that try to explain the behavior of the decision making
system at a particular instance or in a “small” region in data space [15,18,20]
The majority of the proposed explanation methods in literature assume fixed
models - i.e. explaining the decisions of a fixed decision making system. However,
in practice decision making systems are usually not fixed but evolving over time
- e.g. it is adapted or fine tuned on new data [16]. In this context, it is relevant to
explain the changes of the decision making system - in particular in the context of
Human-Centered AI (HCAI) which, besides explainability, is another important
building block in ethical AI [27]. HCAI allows the human to “rule” AI systems
instead of being “discriminated” or “cheated” by AI. Given the complexity of
many modern ML or AI systems (e.g. Deep Neural Networks), it is usually
difficult for a human to understand the decision making system or the impact
of adaptations applied to a given system. Yet, understanding the impact of such
changes is crucial for mediating changes that violate the user expectations, some
(ethical) guidelines or (legal) constraints.
For example, consider the scenario of a complex loan approval system that
automatically accepts or rejects loan applications, and assume that the decision
making process of this system is difficult to inspect from the outside: Consider
the rejection of a loan application with the argument of a low income and a
bad payback rate in the past - this latter explanation perfectly meets the bank
internal guidelines for accepting or rejecting a loan. However, after adapting the
loan approval system on new data, it could happen that the same application that
was rejected under the “old” system (before the adaptation), is now accepted by
the new system. In such a case we would like to reject the adaptation because the
resulting system violates the bank policy by exposing it to a higher risk of loosing
money. We need a mechanism that makes the impact of model adaptations
transparent so that it can be “approved” by a human.
Although there exists work that is aware of the challenge of explaining chang-
ing ML systems [24], how to explain model adaptations is a widely open research
question which we aim to address in this contribution. In this work, we propose a
framework that uses contrastive explanations for explaining model adaptation,
in particular to match settings where the function itself changes only slightly
on the given data, but the underlying mechanisms might be changed. We argue
that inspecting the differences of contrastive explanations which are caused by
the adaptation is a reasonable proxy for explaining model adaptations. More
precisely, our contributions are twofold:
Contrastive Explanations for Explaining Model Adaptations 103

1. We propose to compare contrasting explanations for locally explaining model


adaptations.
2. We propose a method for finding relevant regions in data space which are
affected by a given model adaptation and should be explained to the user.
The remainder of this work is structured as follows: After briefly reviewing lit-
erature and the foundations, in particular contrasting explanations (Sect. 2.2)
and model adaptations (Sect. 2.1), we introduce our proposal for comparing con-
trastive explanations for locally explaining model adaptations in Sect. 3. We then
study counterfactual explanations as a specific instance of contrastive explana-
tions in Sect. 4 - their changes (Sect. 4.1), and propose a method for finding
relevant regions in data space (Sect. 4.2). Finally, we empirically evaluate our
proposed methods in Sect. 5 and close our work with a summary and discussion
in Sect. 6.
Due to space constraints and for the purpose of better readability, we put all
proofs and derivations in an extended version on arXiv.

Related Work. While incremental model adaptation as well as transparency (i.e.


methods for explaining decision making systems) have been extensively stud-
ied separately, the combination of both has received much less attention so far.
Counterfactual explanations are a popular instance of example based explana-
tion methods. Yet existing methods rely on the assumption of a static underly-
ing model - a strategy for counterfactual explanations of model adaptations is
still missing [24]. A method called “counterfactual metrics” [4] can be used for
explaining drifting feature relevances of metric based models. In contrast to a
counterfactual explanation, it focuses on feature relevance rather than change
of counterfactual examples. In [10], contrastive explanations (in particular coun-
terfactuals) are used for explaining concept drift. Unlike our approach, its focus
lies on an explanation of the different temporal characteristics rather than the
change of their underlying principles as indicated by explanations.

2 Foundations
2.1 Model Adaptions
We assume a model adaptation scenario in which we are given a prediction
function (also called model) h(·) and a set of (new) labeled data points D.
Adapting the model h(·) to the data D means that we want to find a model h (·)
which is as similar as possible to the original model h(·) while performing well
on labeling the (new) samples in D [16]. Model adaptation can be formalized as
an optimization problem [16] as stated in Definition 1.
Definition 1 (Model adaptation). Let h : X → Y, h ∈ H be a prediction
function (also called model) and D = {(xi , yi ) ∈ X × Y} a set of (new) labeled
data points. Adapting the model h(·) to the data D is formalized as the following
optimization problem:
arg min θ(h, h ) s.t. h (xi ) ≈ yi ∀ (xi , yi ) ∈ D (1)
h ∈H
104 A. Artelt et al.

where θ : H × H → R+ denotes a regularization that measures the similarity


between two given models 1 and ≈ refers to a suitable prediction error (e.g.
zero-one loss or squared error) which is minimized by h (·).
Note that for large D, e.g. caused by abrupt concept drift [8], one could com-
pletely retrain h(·) and abandon the requirement of closeness. In such situations
it is still interesting to explain the difference of h (·) and h(·).

2.2 Contrastive Explanations


Counterfactual explanations are a popular instance of contrastive explanations.
A counterfactual explanation - often just called counterfactual - states a change
to some features of a given input such that the resulting data point (called
counterfactual) has a different (specified) prediction than the original input.
The rational is considered to be intuitive, human-friendly and useful because it
tells the user which minimum changes can lead to a desired outcome [15,25].
Formally, a (closest) counterfactual can be defined as follows:
Definition 2 ((Closest) Counterfactual Explanation [25]). Assume a pre-
diction function h : Rd → Y is given. Computing a counterfactual xcf ∈ Rd for
a given input x ∈ Rd is phrased as an optimization problem:
 
arg min  h(xcf ), y  + C · θ(xcf , x) (2)
x cf ∈ Rd

where (·) denotes a loss function, y  the target prediction, θ(·) a penalty for
dissimilarity of xcf and x, and C > 0 denotes the regularization strength.

Remark 1. In the following we assume a binary classification problem. In this


case we denote a (closest) counterfactual xcf according to Definition 2 of a given
sample x under a prediction function h(·) as xcf = CF(x, h) as the desired target
is uniquely determined.

3 Explaining Model Adaptations


An obvious approach for explaining a model adaptation would be to explain
and communicate the regions in data space where the prediction of the new
and the old model are different - i.e. {x ∈ X | h(x) = h (x)}. However, in
particular for incremental model adaptations, this set might be small and its
characterization not meaningful. Hence, instead of finding these samples, we
aim for an explanation of possible differences of the learned generalization rules
underlying the models. Note that this objective is even meaningful if the models
completely coincide on the given data set.
In the following, we propose that a contrastive explanation can serve as a
proxy for the model generalization at a given point; hence a comparison of the
1
In case of a parameterized model, one possible regularization measures the difference
in the parameters.
Contrastive Explanations for Explaining Model Adaptations 105

possibly different contrastive explanations of two models at a given point can


be considered as an explanation of how the different underlying principles based
on which the models propose a decision. As it is common practice, we thereby
look at local differences, since a global comparison might be too complex and
not easily understood by a human [5,15,24]. The computation of such differ-
ences of explanations is tractable as long as the computation of the contrastive
explanation under a single model itself is tractable - i.e. no additional computa-
tional complexity is introduced when using this approach for explaining model
adaptations.

3.1 Modeling

We define this type of explanation as follows:


Definition 3 (Explaining Model Differences). We assume that we are
given a set of labeled data points D∗ = {(x∗i , yi∗ ) ∈ X × Y} whose labels are
correctly predicted by both models h : X → Y and h : X → Y.
For every (x∗i , yi∗ ) ∈ D∗ , let δ ih ∈ X be a contrastive explanation under
h(·) and δ ih ∈ X under the new model h (·). The explanation of the model
differences at point (x∗i , yi∗ ) and its magnitude is then given by the comparison
of both explanations:
ψ(δ ih , δ ih ) and Ψ (δ ih , δ ih ) (3)
where ψ(·) denotes a suitable operator which can compare the information con-
tained in two given explanations and Ψ (·) denotes a real-valued distance measure
judging the difference of explanations.

Remark 2 Note that the explanation as defined in Definition 3 can be more


generally applied to compare two classifiers h (·) and h(·) w.r.t. given input
locations, albeit h (·) does not constitute an adaptation of h(·). For simplicity,
we assume uniqueness of contrastive explanations in the definition, either by
design such as given for linear models or by suitable algorithmic tie breaks.

The concrete form of the explanation heavily depends on the comparison func-
tion ψ(·) and Ψ (·) - this allows us to take specific use-cases and target groups into
account. In this work we assume X = Rd and ψ(·) is given by the component-
wise absolute value |(δ ih )l − (δ ih )l |l , and we consider two possible realizations of
Ψ (·):

Euclidean similarity
Ψ (δ h , δ h ) = δ h − δ h 2 (4)

Cosine similarity
Ψ (δ h , δ h ) = cos (∠δ h , δ h ) (5)
106 A. Artelt et al.

4 Counterfactual Explanations for Explaining Model


Adaptations

Counterfactual explanations are a popular instance of contrastive explanations


(Sect. 2.2). In the following, we study counterfactual explanations for explain-
ing model adaptations as proposed in Definition 3. We first relate the difference
between two linear classifiers to their counterfactuals and, vice versa, the change
of counterfactuals to model change. Then we propose a method for finding rele-
vant regions and samples for comparing counterfactual explanations.

4.1 Counterfactuals for Linear Models

First, we highlight the possibility to relate the similarity of two linear models
at a given point to their counterfactuals. We consider a linear binary classifier
h : Rd → {−1, 1}:
h(x) = sign(w x) (6)
and assume w.l.o.g. that the weight vector w ∈ Rd has unit length, w2 = 1.
We assume for an adaptation h (x) = sign(w∗ x) with unit weight vector w ∗

Theorem 1 (Cosine Similarity of Linear Models). Let h(·) and h (·) be


two linear models, and x a data point. Let xcf = CF(x, h) and xcf∗ = CF(x, h )
be the closest counterfactual of a data point x ∈ Rd under the original model
resp. the adapted model h (·). Then

x   
cf xcf∗ + x x − xcf x − xcf∗ x
cos(∠w, w∗ ) =   (7)
     
xcf xcf + x x − 2xcf x (xcf∗ xcf∗ + x x − 2xcf∗ x)

Since every (possibly nonlinear) model can locally be approximated linearly,


this result also indicates the relevance of counterfactuals to characterize local
differences of two models.
Conversely, it is possible to limit the difference of counterfactual explanations
by the difference of classifiers as follows:
Theorem 2 (Change of a Closest Counterfactual Explanation). Let
h(·) be a binary linear classifier Eq. (6) and h (·) be its adaptation. Then,
the difference between the two closest counterfactuals of an arbitrary sample
(x, y) ∈ Rd × {1, 1} can be bounded as:
√  1/2
CF(x, h) − CF(x, h )2 ≤ 8x2 1 − cos(∠w, w∗ ) (8)
Contrastive Explanations for Explaining Model Adaptations 107

4.2 Finding Relevant Regions in Data Space


Since we are interested in local explanations, the question occurs which position
x is most informative to highlight different principles of two models. In the
following, we formalize the notion of characteristic samples in the context of
model change to enable an algorithmic selection of such interesting points.
The idea is to provide an interest function, i.e. a function that marks the
regions of the data space that are of interest for our consideration. This function
i(·) should have the following properties:
1. For every pair of fixed models h, h ∈ H it maps every point x ∈ X in the data
space to a non-negative number indicating the interest - i.e. i : X ×(H×H) →
R+ .
2. It should be continuous with respect to the classifiers and in particular
i(x, h, h ) = 0 for all x if and only if h = ±h .
3. Points that are “more interesting” as measured by a difference of local expla-
nations should take on higher values.
4. Regions where the classifiers “coincide structually”, i.e. their local explana-
tions coincide, are not of interest.
An obvious choice for i(·) is to directly use the explanation Definition 3 itself
together with a difference measurement Ψ (·) as stated in Eq. (4) and Eq. (5):
 
i(x, h, h ) = Ψ CF(x, h), CF(x, h ) (9)

It is easy to see that the four properties are fulfilled, if we assume that Ψ (·) is
chosen suitably. Besides the Euclidean distance Eq. (4), the cosine similarity Eq.
(5) is a potential choice for comparing two counterfactuals in Ψ (·). Since the
cosine always takes values between −1 and 1, we scale it to a positive codomain:
   
Ψ CF(x, h), CF(x, h ) = 2 − cos ∠CF(x, h), CF(x, h ) (10)

However, the measure Ψ (·) as suggested in Eq. (10) is discontinuous if we app-


roach the decision boundary of one of the classifiers. This problem can be resolved
by using an relaxed version of Eq. (10):

CF(x, h) − x, CF(x, h ) − x
Ψ (·) = 2 − (11)
CF(x, h) − x2 CF(x, h ) − x2 + ε
for some small ε > 0.

Approximation for Gradient Based Models. While the definition of the interest
function Eq. (9) captures our goal of identifying interesting samples, the compu-
tation of (closest) counterfactuals can be computationally challenging for many
models [3] hence finding local maxima of i(·) is infeasible. In these cases, an
efficient approximation is possible, provided the classifier h(·) is induced by a
differentiable function f (·) in the form h(x) = sign (f (x)). Then the gradient of
f (·) enables an approximation of the counterfactual in the form

CF(x, h) = x − ηh(x)∇x f (x) (12)


108 A. Artelt et al.

for a sufficient η > 0. In this case Eq. (12), the cosine similarity approach Eq.
(10) works particularly well because it is invariant with respect to the choice
of η - i.e. η can be ignored. Under some smoothness assumptions, this model-
ing admits simple geometric interpretations since the gradient points towards
the closes point on the decision boundary in this case. This way it (locally)
reduces the interpretation to linear classifiers for which counterfactuals are well
understood [3].
In the remainder of this work, we use the gradient approximation together
with the cosine similarity for computing the interest of given samples.

5 Experiments
We empirically evaluate each of our proposed methods separately. First, we
demonstrate the usefulness of comparing contrastive explanations for explaining
model adaptations. Then we evaluate our method for finding relevant regions
in data space that are affected by the model adaptations and thus are inter-
esting candidates for illustrating the corresponding difference in counterfactual
explanations.
All experiments are implemented2 in Python. We use CEML [2] for comput-
ing counterfactuals and use MOSEK as a solver for all mathematical programs.

5.1 Data Sets

We use the following data sets in our experiments:

Gaussian Blobs Data Set. This artificial toy data set consists of a binary classi-
fication problem. It is generated by sampling from two different two dimensional
Gaussian distributions - each class has is its own Gaussian distribution. The drift
is introduced by changing the Gaussian distributions between the two batches.
In the first batch the two classes can be separated with a threshold on the first
feature, whereas in the second batch the second feature must be also considered.

Coffee Data Set. The data set consists of hyperspectral measurements of three
types of coffee beans measured at three distinct times within three month of
2020. Samples of Arabica, Robusta and immature Arabica beans were measured
by a SWIR 384 hyperspectral camera. The sensor measures the reflectance of
the samples for 288 wavelengths in the range between 900 and 2500nm. For our
experiments, we standardize and subsample the data by a factor of 5. It is known
that the data distribution is drifting between the measurement times.

2
https://github.com/andreArtelt/ContrastiveExplanationsForModelAdaptations.
Contrastive Explanations for Explaining Model Adaptations 109

Human Activity Recognition Data Set. The human activity recognition (HAR)
data set by [19] contains data from 30 volunteers performing activities like walk-
ing, walking downstairs and walking upstairs. Volunteers wear a smartphone
recording the three-dimensional linear and angular acceleration sensors. We use
a time window of length 64 to aggregate the data stream and compute the
median per sensor axis and time window. We create drift by putting half of all
samples with label walking or walking upstairs into the first batch - i.e. walking
vs. walking upstairs - the other half of walking together with samples labeled as
waking downstairs into the other batch - i.e. walking vs. walking downstairs.

5.2 Comparing Counterfactuals for Explaining Model Adaptation

Gaussian Blobs. We fit a Gaussian


Naive Bayes classifier to the first batch
and then adapt the model to the sec-
ond batch of the Gaussian blobs data set.
Besides the both batches, we also gen-
erate 200 samples (located between the
two Gaussians) for explaining the model
changes. We compute counterfactuals for
all test samples under the old and the
adapted model. The differences of the
counterfactuals are shown in the Fig. 1. Fig. 1. Changes in counterfactuals for
We observe a significant change in the sec- the Gaussian blob data set
ond feature of the adapted model - which
makes sense since we know that, in con-
trast to the first batch, the second feature is necessary for discriminating the
data in the second batch.

Coffee. We consider the model drift between a model trained with the data
collected on the 26.06 and another model based on the data from 14.08. As we
know that the drift in the data set is abrupt, we train a logistic regression clas-
sifier on the training data collected at the first measurement time (model 1),
and another on the second measurement time (model 2). We compute counter-
factual explanations for all the samples in test set of the first measurement time
that are classified correctly by model 1 but misclassified by model 2. The tar-
get label of the explanation is the original label. This way, we analyze how the
model changes for the different measurement times. The differences are shown
Fig. 2. We observe that surprisingly there are only a few frequencies which are
consistently treated different by both models.
110 A. Artelt et al.

t2 = 14th August. t2 = on 28th August.

Fig. 2. Mean difference in counterfactual explanations for two updated models.

Loan Approval. We fit a deci-


sion tree classifier to the first batch
and completely refit the model
to the first and second batch of
the credit data set. The test data
from both batches is used for com-
puting counterfactual explanations
for explaining the model changes.
The changes in the counterfactual
explanations for switching from
“reject” to “accept” is shown in the
Fig. 3. We observe that after adapt-
ing the model to the second batch,
there are a couple of cases where
increasing the credit amount would Fig. 3. Changes in counterfactuals when turn-
turn a rejection into an acceptance ing a “reject” into an “accept”.
- we consider this as inappropriate
and unwanted behaviour.

5.3 Finding Relevant Regions in Data Space


Human Activity Recognition. We fit a Gaussian Naive Bayes classifier to
the first batch and then adapt the model to the second batch of the HAR data
set. In the first setting, we use the test data from both batches for explaining
the model changes using counterfactual explanations, and in the second setting
we only use approx. 31 of all samples sorted by relevance as determined by our
method proposed in Sect. 4.2. In Fig. 4 we plot the changes of the counterfac-
tual explanations for both cases when switching from walking up-/downstairs
to walking straight. We observe the same effects in both cases but with much
less noise in case of using only a few relevant samples - this suggests that we
successfully identify relevant samples for explaining the model changes.
Contrastive Explanations for Explaining Model Adaptations 111

Fig. 4. Left: Changes in counterfactuals considering all test samples. Right: Changes
in counterfactual explanations considering the most relevant test samples.

6 Discussion and Conclusion

In this work we proposed to compare contrastive explanation as a proxy for


explaining and understanding model adaptations - i.e. highlighting differences
in the underlying decision making rules of the models. In this context, we also
proposed a method for finding samples where the explanation changed signifi-
cantly and thus might be illustrative for understanding the model adaptation.
We empirically demonstrated the functionality of all our proposed methods.
In future research, we would like to study the benefits of comparing con-
trastive explanations for explaining model adaptations from a psychological per-
spective - i.e. studying how people perceive model adaptations and how useful
they find these explanations for understanding and assessing model adaptations.

References
1. Aamodt, A., Plaza., E.: Case-based reasoning: foundational issues, methodological
variations, and system approaches. AI Commun. (1994)
2. Artelt, A.: CEML: counterfactuals for explaining machine learning models - a
python toolbox (2019–2021). https://www.github.com/andreArtelt/ceml
3. Artelt, A., Hammer, B.: On the computation of counterfactual explanations - a
survey. CoRR abs/1911.07749 (2019)
4. Artelt, A., Hammer, B.: Efficient computation of counterfactual explanations and
counterfactual metrics of prototype-based classifiers (2021)
5. Byrne, R.M.J.: Counterfactuals in explainable artificial intelligence (XAI): Evi-
dence from human reasoning. In: IJCAI-19 (2019)
6. Caton, S., Haas, C.: Fairness in machine learning: a survey. CoRR abs/2010.04053
(2020)
7. Fisher, A., Rudin, C., Dominici, F.: All models are wrong but many are useful:
variable importance for black-box, proprietary, or misspecified prediction models,
using model class reliance. arXiv e-prints arXiv:1801.01489, January 2018
112 A. Artelt et al.

8. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4) (2014)
9. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.:
A survey of methods for explaining black box models. ACM Comput. Surv. 51(5)
(2018)
10. Hinder, F., Hammer, B.: Counterfactual explanations of concept drift (2020)
11. Khandani, A.E., Kim, A.J., Lo, A.: Consumer credit-risk models via machine-
learning algorithms. J. Banking Finan. 34(11) (2010)
12. Kim, B., Koyejo, O., Khanna, R.: Examples are not enough, learn to criticize! Crit-
icism for interpretability. In: Advances in Neural Information Processing Systems,
vol. 29 (2016)
13. Leslie, D.: Understanding artificial intelligence ethics and safety. CoRR
abs/1906.05684 (2019)
14. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on
bias and fairness in machine learning. CoRR abs/1908.09635 (2019)
15. Molnar, C.: Interpretable Machine Learning (2019)
16. Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong
learning with neural networks: a review. Neural Netw. 113 (2019)
17. European Parliament and of the Council: General data protection regulation: Reg-
ulation (EU) 2016/679 of the European parliament (2016)
18. Pedapati, T., Balakrishnan, A., Shanmugam, K., Dhurandhar, A.: Learning global
transparent models consistent with local contrastive explanations. In: Larochelle,
H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural
Information Processing Systems, vol. 33. Curran Associates, Inc. (2020)
19. Reyes-Ortiz, J., Oneto, L., Samà, A., Parra, X., Anguita, D.: Transition-aware
human activity recognition using smartphones. Neurocomputing 171 (2016)
20. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the
predictions of any classifier. In: KDD 2016. ACM, New York (2016)
21. Samek, W., Wiegand, T., Müller, K.: Explainable artificial intelligence: under-
standing, visualizing and interpreting deep learning models. CoRR abs/1708.08296
(2017)
22. Stalidis, P., Semertzidis, T., Daras, P.: Examining deep learning architectures for
crime classification and prediction abs/1812.00602 (2018)
23. Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (XAI): towards
medical XAI. CoRR abs/1907.07374 (2019)
24. Verma, S., Dickerson, J., Hines, K.: Counterfactual explanations for machine learn-
ing: a review (2020)
25. Wachter, S., Mittelstadt, B.D., Russell, C.: Counterfactual explanations without
opening the black box: automated decisions and the GDPR. CoRR abs/1711.00399
(2017)
26. Waddell, K.: How algorithms can bring down minorities’ credit scores. The Atlantic
(2016)
27. Wortman Vaughan, J., Wallach, H.: A human-centered agenda for intelligible
machine learning, February 2021
Dimensionality Reduction: Is Feature Selection
More Effective Than Random Selection?

Laura Morán-Fernández(B) and Verónica Bolón-Canedo

CITIC, Universidade da Coruña, A Coruña, Spain


{laura.moranf,veronica.bolon}@udc.es

Abstract. The advent of Big Data has brought with it an unprecedented and
overwhelming increase in data volume, not only in samples but also in available
features. Feature selection, the process of selecting the relevant features and dis-
carding the irrelevant ones, has been successfully applied over the last decades
to reduce the dimensionality of the datasets. However, there is a great number of
feature selection methods available in the literature, and choosing the right one
for a given problem is not a trivial decision. In this paper we will try to determine
which of the multiple methods in the literature are the best suited for a particular
type of problem, and study their effectiveness when comparing them with a ran-
dom selection. In our experiments we will use an extensive number of datasets
that allow us to work on a wide variety of problems from the real world that need
to be dealt with in this field. Seven popular feature selection methods were used,
as well as five different classifiers to evaluate their performance. The experimen-
tal results suggest that feature selection is, in general, a powerful tool in machine
learning, being correlation-based feature selection the best option with indepen-
dence of the scenario. Also, we found out that the choice of an inappropriate
threshold when using ranker methods leads to results as poor as when randomly
selecting a subset of features.

Keywords: Dimensionality reduction · Feature selection · Filters ·


Classification

1 Introduction
Driven by recent advances in algorithms, computing power, and big data, artificial intel-
ligence has made substantial breakthroughs in the last years. In particular, machine
learning has great success because of its impressive ability to automatically analyze
large amounts of data. One of the most important tasks in machine learning is classi-
fication, which allows to predict events in a plethora of applications; from medicine to
This research has been financially supported in part by the Spanish Ministerio de Economı́a y
Competitividad (research project PID2019-109238GB-C2), and by the Xunta de Galicia (Grants
ED431C 2018/34 and ED431G 2019/01) with the European Union ERDF funds. CITIC, as
Research Center accredited by Galician University System, is funded by “Consellerı́a de Cultura,
Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds,
ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by “Secretarı́a Xeral
de Universidades” (Grant ED431G 2019/01).

c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 113–125, 2021.
https://doi.org/10.1007/978-3-030-85030-2_10
114 L. Morán-Fernández and V. Bolón-Canedo

finances. However, some of the most popular classification algorithms can degrade their
performance when facing a large number of irrelevant and/or redundant features. This
phenomenon is known as curse of dimensionality and is the reason why dimensionality
reduction methods play an important role in preprocessing the data.
One of such dimensionality reduction techniques is feature selection, which can be
defined as the process of selecting the relevant features and discarding the irrelevant
or redundant ones. There are considerable noisy and useless features that are often col-
lected or generated by different sensors and methods, which also occupy a lot of compu-
tational resources. Therefore, feature selection performs a crucial role in the framework
of machine learning of removing nonsense features and preserving a small subset of
features to reduce the computational complexity.
There are several applications in which it is necessary to find the relevant features: in
bioinformatics (e.g. to identify a few key biomolecules that explain most of an observed
phenotype [5]), in respect to the fairness of decision making (e.g. to find the input fea-
tures used in the decision process, instead of focusing on the fairness of the decision
outcomes [9]), or in nanotechnology (e.g. to determine the most relevant experimental
conditions and physicochemical features to be considered when making a nanotoxicol-
ogy risk assessment [8]). A shared aspect of these applications is that they are not pure
classification tasks. In fact, an understanding of which features are relevant is as impor-
tant as accurate classification, as these features may provide us with new insights into
the underlying system.
However, there is a large amount of feature selection methods available, and most
researchers agree that the best feature selection method simply does not exist [3]. On
top of this, new feature selection methods are appearing every year, which makes us
ask the questions: do we really need so many feature selection methods? Which ones
are the best to use for each type of data? In light of these issues, the aim of this paper
is to perform an analysis of the most popular feature selection methods using the ran-
dom selection as baseline in two scenarios: real datasets and DNA microarray datasets
(characterized by having a much larger number of features than of samples). Our goal
is to determine if there are some methods that do not obtain significantly better results
than those achieved when randomly selecting a subset of features.
The remainder of the paper is organized as follows: Sect. 2 presents the different
feature selection methods employed in the study. Section 3 provides a brief description
of the 55 datasets used to reduce data dimensionality. Section 4 details the experimental
results carried out. Finally, Sect. 5 contains our concluding remarks and proposals for
future research.

2 Feature Selection
Feature selection methods have received a great deal of attention in the classification lit-
erature, which can be described according to their relationship with the induction algo-
rithm in three categories [10]: filters, wrappers and embedded methods. Since wrapper
and embedded methods interact with the classifier, we opted for filter methods. Besides,
filter methods are a common choice in the new Big Data scenario, mainly due to their
low computational cost compared to the wrapper or embedded methods. Below we
describe the seven filters used in the experimental study.
Dimensionality Reduction 115

– Correlation-based Feature Selection (CFS) is a simple multivariate filter algo-


rithm that ranks feature subsets according to a correlation-based heuristic evaluation
function [12].
– The INTERACT (INT) algorithm is based on symmetrical uncertainty and it also
includes the consistency contribution [23].
– Information Gain (IG) filter evaluates the features according to their information
gain and considers a single feature at a time [11].
– ReliefF algorithm (RelF) [13] estimates features according to how well their values
distinguish among the instances that are near to each other.
– Mutual Information Maximisation (MIM) [15] ranks the features by their mutual
information score, and selects the top k features, where k is decided by some prede-
fined need for a certain number of features or some other stopping criterion.
– The minimum Redundancy Maximum Relevance (mRMR) [20] feature selection
method selects features that have the highest relevance with the target class and
are also minimally redundant. Both maximum-relevance and minimum-redundancy
criteria are based on mutual information.
– Joint Mutual Information (JMI) [22] is another feature selection method based on
mutual information, which quantifies the relevancy, the redundancy and the comple-
mentarity.

3 Datasets
In order to evaluate empirically the effect of feature selection, we employed 55 real
datasets. 38 datasets were downloaded from the UCI repository [1], with the restriction
of having at least 9 features. Additionally, 17 microarray datasets were used due to
their high dimensionality [17]. Tables 1 and 2 profile the main characteristics of the real
datasets used in this research in terms of the number of samples, features and classes.
Continuous features were discretized, using an equal-width strategy in 5 bins, while
features already with a categorical range were left untouched.

4 Experimental Results
The different experiments carried out consist of making comparisons between the appli-
cation of the seven feature selection methods individually, as well as the random selec-
tion (represented as “Ran” in the tables/figures), which will be the baseline for our
comparisons. While two of the feature selection methods return a feature subset (CFS
and INTERACT), the other five (IG, ReliefF, MIM, JMI and mRMR) are ranker meth-
ods, so a threshold is mandatory in order to obtain a subset of features. In this work we
have opted for retaining the top 10%, 20% and log2 (n) of the most relevant features of
the ordered ranking, where n is the number of features in a given dataset. In the case of
microarray datasets, due to the mismatch between dimensionality and sample size, the
thresholds selected the top 5%, 10% and log2 (n) features, respectively. We computed
3 × 5-fold cross validation to estimate the error rate.
According to the No-Free-Lunch theorem, the best classifier will not be the same
for all the datasets [21]. For this reason, the behavior of the feature selection methods
116 L. Morán-Fernández and V. Bolón-Canedo

Table 1. Characteristics of the 38 real datasets. It shows the number of samples (#sam.), features
(#feat.) and classes (#cl.).

Dataset #sam. #feat. #cl. Dataset #sam. #feat. #cl.


arrhythmia 452 279 13 molec-biol-promoter 106 57 2
bc-wisc-diag 569 30 2 molec-biol-splice 3190 60 3
bc-wisc-prog 198 33 2 musk-2 6598 166 2
breast 569 30 2 optdigits 5620 64 10
coil20 1440 1024 20 ozone 2536 72 2
congress 435 16 2 page-blocks 5473 10 5
conn-bench-sonar 208 60 2 parkinsons 195 22 2
connect-4 67557 42 2 pendigits 10992 16 10
dermatology 366 34 6 satimage 6435 36 6
gisette 7000 5000 2 segmentation 2310 19 7
glass 214 9 6 semeion 1593 256 10
heart 270 13 2 sonar 208 60 2
hill-valley 606 100 2 soybeansmall 47 36 4
ionosphere 351 35 2 spect 267 23 2
isolet 7797 617 2 splice 3175 60 3
krvskp 3196 36 2 USPS 9298 256 10
landstat 5435 36 6 waveform 5000 40 3
libras 360 90 15 wine 178 13 3
low-res-spect 531 100 9 zoo 101 17 7

will be tested according to the classification error obtained by five different classifiers,
each belonging to a different family. The classifiers employed were: two linear (naive
Bayes and Support Vector Machine using a linear kernel) and three nonlinear (C4.5,
k-Nearest Neighbor with k = 3 and Random Forest). All five classifiers were executed
using the Weka tool, with default values for the parameters.

4.1 Real Datasets


This section reports the experimental results achieved by the different feature selection
methods over the 38 real datasets, depending on the classifier. In order to explore the
statistical significance of our classification results, we analyzed the classification error
by using a Friedman test with the Nemenyi post-hoc test. The following figures present
the critical different (CD) diagrams, introduced by Demšar [6], where groups of meth-
ods that are not significantly different (at α = 0.10) are connected. The top line in the
critical difference diagram is the axis on which we plot the average ranks of methods.
The axis is turned so that the lowest (best) ranks are to the right since we perceive the
methods on the right side as better. As can be seen in Fig. 1, regardless of the classifier
used, it seems that the most suitable feature selection methods for this type of datasets
are CFS and INTERACT, which have the additional advantage that there is no threshold
Dimensionality Reduction 117

Table 2. Characteristics of the 17 microarary datasets. It shows the number of samples (#sam.),
features (#feat.) and classes (#cl.).

Dataset #sam. #feat. #cl. Dataset #sam. #feat. #cl.


9-tumors 60 5726 9 gli85 85 22283 2
11-tumors 174 12533 11 leukemia-1 72 5327 3
brain 21 12625 2 leukemia-2 72 11225 3
brain-tumor-1 90 5920 5 lung-cancer 203 12600 5
brain-tumor-2 50 10367 4 ovarian 253 15154 2
CLL-SUB-111 111 11340 3 smk 187 19993 2
CNS 60 7129 2 SRBCT 83 2308 4
colon 62 2000 2 TOX-171 171 5748 4
DLBCL 47 4026 2

for the number of features to select. In the case of ranker methods, which do need to set
a threshold, in general it seems that the percentage of 20% is the best option.
We now compare the classification error achieved by the feature selection meth-
ods and our baseline, the random selection. It can be seen that for all the classification
algorithms, the random selection, both with the logarithmic and 10% thresholds, is the
one that obtains the worst results. However, we can also see that random selection,
with the 20% threshold, achieves highly competitive results compared to certain fea-
ture selection methods. Due to the drawbacks of the traditional tests of contrast of the
null hypothesis pointed up by [2], we have chosen to apply the Bayesian hypothesis
test [14], in order to analyze the classification results achieved by “Ran-20” and the
ranker methods. In this type of analysis, a previous step is needed, which consists in the
definition of the Region of practical equivalence (Rope). Two methods are considered
practically equivalent in practice if their mean differences given a certain metric are less
than a predefined threshold. In our case, we will consider two methods as equivalent if
the difference in error is less than 1%.
For the whole benchmark and each pair of methods, we calculate the probability of
the three possibilities: (i) random selection (Ran) wins over filter method with a differ-
ence larger than rope, (ii) filter method wins over random selection with a difference
larger than rope, and (iii) the difference between the results are within the rope area.
If one of these probabilities is higher than 95%, we consider that there is a signifi-
cant difference. Thus, Fig. 2 shows the distribution of the differences between each pair
of methods using simplex graphs. It can be seen that, although random selection with
the 20% threshold is not statistically significant with respect to the results obtained
over several ranker methods, it always outperforms them. This means that applying
the ranker methods (ReliefF, InfoGain and MIM) with an incorrect threshold produces
results comparable to those obtained when randomly selecting the 20% of features.
These results highlight the importance of choosing a good threshold, which is not a
trivial task, especially because it usually depends on the particular problem to be solved
(and sometimes, even the classifier that is subsequently used).
118 L. Morán-Fernández and V. Bolón-Canedo

CD CD

2019181716151413121110 9 8 7 6 5 4 3 2 1 2019181716151413121110 9 8 7 6 5 4 3 2 1

Ran-log 16.3289 4.9868 INT Ran-log 16.8553 4.3947 CFS


Ran-10 16.2237 5.4605 CFS Ran-10 16.4211 6.7105 INT
RelF-10 13.7237 6.9737 JMI-20 14.5921 6.7632
Ran-20 JMI-20
Ran-20 13.2763 7.7237 mRMR-20 RelF-10 12.5 7.0263 JMI-log
MIM-10 12.5395 7.8816 MIM-20 12.1974 8.6053
IG-log mRMR-10
IG-10 12.1842 8.2105 IG-20 MIM-10 11.6579 8.8816 JMI-10
RelF-log 12.0526 8.7763 RelF-20 11.6053 9.0526
IG-10 mRMR-log
IG-log 11.3947 9.2895 JMI-log 11.5 9.2763
MIM-log RelF-20
11.2895 9.8684
MIM-log mRMR-log RelF-log 11.25 9.6974 mRMR-20
JMI-10 10.9342 10.8816 mRMR-10 10.6842 10.3289
IG-20 MIM-20

(a) C4.5 classifier (b) NB classifier


CD CD

2019181716151413121110 9 8 7 6 5 4 3 2 1 2019181716151413121110 9 8 7 6 5 4 3 2 1

Ran-10 17.0132 5.1184 INT Ran-10 16.1842 4.8816 CFS


Ran-log 16.3289 5.5658 CFS Ran-log 15.8684 5 INT
MIM-10 13.6579 6.2763 JMI-20
RelF-10 13.7632 6.0921 JMI-20
13.3553 7.0658 IG-10 13.5789 7.1974 RelF-20
Ran-20 mRMR-20
MIM-10 12.6842 7.4211 MIM-20 RelF-10 12.5921 7.5395 mRMR-20
12.2632 8.0526
IG-10 12.4474 8.1711 JMI-log IG-log MIM-20
MIM-log 12.2895 8.4868 IG-20 Ran-20 12.1579 8.4868 IG-20
IG-log 12.2763 9.2237 RelF-20 12.1316 10
MIM-log JMI-log
RelF-log 12.0132 9.9868 mRMR-log mRMR-10 11.6711 10.6974 mRMR-log
mRMR-10 10.6842 10.0132 JMI-10 11 10.7632
RelF-log JMI-10

(c) 3NN classifier (d) SVM classifier


CD

2019181716151413121110 9 8 7 6 5 4 3 2 1

Ran-10 16.1184 4.1711 INT


Ran-log 15.9079 4.3816 CFS
RelF-10 14.5789 5.2105 JMI-20
MIM-10 13.5263 8.1842 mRMR-20
IG-10 13.4737 8.2105 MIM-20
RelF-log 12.9342 8.3816 RelF-20
MIM-log 12.4605 8.5658 IG-20
IG-log 12.3289 8.7895 JMI-log
Ran-20 11.9342 10.0263 JMI-10
mRMR-10 10.5658 10.25 mRMR-log

(e) Random Forest classifier

Fig. 1. Critical difference diagrams showing the ranks after applying feature selection over the
38 real datasets. For feature selection methods that require a threshold, the option to keep 10%
is indicated by ‘−10’, the option to stay with 20% is indicated by ‘−20’, and the option ‘−log’
refers to use log2 .
Dimensionality Reduction 119

(a) C45 classifier (b) SVM classifier (c) SVM classifier

(d) SVM classifier (e) Random Forest classifier (f) Random Forest classifier

Fig. 2. Simplex graphs for pair comparison of each feature selection method and the baseline ran-
dom selection (Ran) over the 38 real datasets using Bayesian hierarchical tests: random selection
(left) and filter method (right).

Regarding the five different classifiers used, Table 3 shows the classification error
obtained by the five classifiers and the eight feature selection methods—the seven fil-
ters and the random selection—over the 38 real datasets (lower error rates highlighted
in bold). As can be seen, although the classification results obtained were not consider-
ably different between the different feature selection methods used, it is notable that the
results obtained with Random Forest outperformed those achieved by the other classi-
fiers.

4.2 Microrrray Datasets

The classification of DNA microarray has been viewed as a particular challenge for
machine learning researchers, mainly due to the mismatch between dimensionality and
sample size. Several studies have demonstrated that most of the genes measured in
microarray experiment do not actually contribute to efficient sample classification [4].
To avoid this curse of dimensionality, feature selection is advisable so as to identify the
specific genes that enhance classification accuracy.
Following the same study as for the previous datasets, and in order to analyze the
ranks of the feature selection methods over the 17 microarray datasets, Fig. 3 presents
the critical different diagrams for each classification algorithm. As can be seen, the fea-
ture selection method that performs best varies depending on the classifier. However,
we can say that, in general, CFS is the best option. With regard to the different thresh-
olds used by the ranker methods, the percentage that retains 5% of the features seems
to be the most appropriate for these high dimensional datasets.
120 L. Morán-Fernández and V. Bolón-Canedo

Table 3. Classification errors obtained by the five classifiers for the real datasets tested.

C4.5 NB 3NN SVM RF


CFS 15.17 18.05 14.83 14.85 13.06
INT 15.01 18.87 14.99 14.98 12.80
IG-10 22.05 26.51 21.96 24.93 21.12
IG-20 18.17 23.52 18.20 19.88 16.88
IG-log 21.65 27.30 21.96 25.84 20.92
RelF-10 23.66 27.67 23.88 25.13 22.87
RelF-20 19.86 24.39 19.84 20.33 18.11
RelF-log 23.57 28.12 23.40 26.27 22.67
MIM-10 22.08 26.64 22.24 25.08 21.23
MIM-20 18.13 23.55 17.92 19.88 16.69
MIM-log 21.88 27.37 22.23 26.04 20.98
mRMR-10 20.79 24.15 20.64 23.19 19.56
mRMR-20 18.10 23.35 17.88 19.66 16.57
mRMR-log 19.48 23.79 19.31 22.93 18.39
JMI-10 20.34 23.29 19.95 22.44 19.02
JMI-20 16.84 20.70 16.40 17.95 15.05
JMI-log 18.89 22.43 18.55 21.98 17.64
Ran-10 30.34 34.87 30.87 32.08 29.45
Ran-20 23.66 29.15 24.12 24.96 22.13
Ran-log 29.16 34.66 29.69 32.66 28.57

If we observe in depth the results provided by the statistical tests, we can also see
that the random selection, both for the thresholds that retain 5 and 10% and for the log-
arithm, obtains the poorest classification accuracy in the C4.5, NB, 3NN and Random
Forest classifiers. The SVM results show a particularly interesting behavior. It seems
that this classification algorithm does not work too well when the number of features is
low (compared to the original size of the dataset) [16]. Remember that, in the case that
the threshold used by the ranker methods select the top log2 (n) features, the number
of features used to train the model will be a maximum of 15 for these datasets (not
even 1% of the number of features in the original microarray dataset). Analogously as
with the real datasets, Fig. 4 shows the distribution of the differences between random
selection—with 5% and 10% thresholds—and the ranker methods with the logarithm
threshold using simplex graphs. As can be seen, the random selection performs better
on average and with statistical significance over the ranker methods which retain the top
log2 (n) features. Again, these results demonstrate, and in this case more prominently,
that an incorrect choice of threshold when using ranker methods might lead to perfor-
mance as poor as with a random selection of features. This problem is difficult to solve,
as the only way to ensure that we are using the correct threshold is to try a significant
number of them and compute the classification performance for that subset of features,
which would result in inadmissible computation times.
Dimensionality Reduction 121

CD
CD

2019181716151413121110 9 8 7 6 5 4 3 2 1
2019181716151413121110 9 8 7 6 5 4 3 2 1

Ran-log 18.4412 5.5294 RelF-5 Ran-log 19.4706 5.4412 CFS


Ran-10 14.5294 5.7059 RelF-10 Ran-5 14.7941 6 INT
Ran-5 14 6.1176 IG-5 13.3235 7.0882
Ran-10 mRMR-5
13.1765 6.6176
JMI-5 IG-10 RelF-log 12.9118 8.3235 IG-5
JMI-log 12.7647 8.2353 MIM-10 12.3824 8.4412
JMI-10 RelF-5
MIM-log 12.4118 8.2941 MIM-5 MIM-log 12.1471 8.5588 RelF-10
11.9118 8.6765 JMI-log 11.9706 8.7059 IG-10
JMI-10 mRMR-10
10.7647 9.3529
INT 11.2941 10 mRMR-5 IG-log mRMR-10
10.5 9.4118
RelF-log 11 10.0588 mRMR-log MIM-5 MIM-10
10.6471 10.5882 mRMR-log 10.2941 10.1176 JMI-5
IG-log CFS

(a) C4.5 classifier (b) NB classifier


CD
CD

2019181716151413121110 9 8 7 6 5 4 3 2 1
2019181716151413121110 9 8 7 6 5 4 3 2 1

Ran-log 18.7647 5.4706 mRMR-5 Ran-log 18.7941 5.2353 IG-10


15.8529 6.0588
Ran-5 15.9706 5.7647 CFS IG-log MIM-10
15.3235 6.0588
Ran-10 15.4118 7.0294 RelF-5 JMI-log mRMR-10
13.9412 7.1176 RelF-log 14.7353 6.4706 IG-5
MIM-log INT
13.8235 7.2059 MIM-log 14.6176 6.7647 RelF-10
JMI-log RelF-10
13.0294 7.4412 mRMR-log 14.3529 7.6471 JMI-10
IG-log IG-5
RelF-log 12.7353 7.5294 IG-10 Ran-5 13.2059 7.9706 MIM-5
10.9706 7.7647 INT 11.1176 8 RelF-5
JMI-10 MIM-5
mRMR-log 10.6176 9.7647 JMI-5 Ran-10 10.6176 8.1176 CFS
9.8235 9.8235 JMI-5 10.5294 8.5294 mRMR-5
mRMR-10 MIM-10

(c) 3NN classifier (d) SVM classifier


CD

2019181716151413121110 9 8 7 6 5 4 3 2 1

Ran-log 18.7059 5.4412 CFS


Ran-10 17.1176 6.0882 INT
Ran-5 16.2647 6.6176 RelF-5
JMI-10 14.6471 7.0294 IG-10
JMI-5 12.6471 7.3824 MIM-5
MIM-log 12.1471 8.1176 IG-5
RelF-log 11.8824 8.3824 IG-log
JMI-log 11 8.4118 mRMR-log
mRMR-10 10.3235 8.5294 mRMR-5
MIM-10 10.2647 9 RelF-10

(e) Random Forest classifier

Fig. 3. Critical difference diagram showing the ranks after applying feature selection over the 17
microarray datasets. For feature selection methods that require a threshold, the option to keep 5%
is indicated by ‘−5’, the option to stay with 10% is indicated by ‘−10’, and the option ‘−log’
refers to use log2 .
122 L. Morán-Fernández and V. Bolón-Canedo

Fig. 4. Simplex graphs for pair comparison of each feature selection method and the baseline
random selection (Ran) over the 17 microarray datasets for SVM classifier using Bayesian hier-
archical tests: random selection (left) and filter method (right).

Table 4 shows the classification error obtained by the five classifiers and the eight
feature selection methods over the 17 microarray datasets (the lowest error rates high-
lighted in bold). These results show the superiority in performance of SVM over other
classifiers in this domain, as it is also stated in González-Navarro [19].
Dimensionality Reduction 123

Table 4. Classification errors obtained by the five classifiers for the microarray datasets tested.

C4.5 NB 3NN SVM RF


CFS 30.15 19.77 19.49 17.53 22.52
INT 30.40 20.26 19.56 18.46 22.56
IG-5 27.10 21.98 20.08 15.88 23.73
IG-10 27.52 22.05 20.55 15.73 23.52
IG-log 30.54 23.37 24.73 25.60 23.98
RelF-5 27.46 22.99 19.00 16.90 23.16
RelF-10 27.10 23.01 19.04 16.81 24.81
RelF-log 31.76 27.24 25.73 27.30 26.91
MIM-5 29.08 23.73 20.37 16.70 24.40
MIM-10 28.83 22.94 21.15 15.82 25.28
MIM-log 31.90 24.95 25.78 24.86 27.00
mRMR-5 30.07 21.67 18.92 16.74 24.63
mRMR-10 29.45 22.94 21.15 15.82 25.97
mRMR-log 30.33 23.56 23.71 24.31 24.84
JMI-5 32.72 24.17 23.19 17.89 27.77
JMI-10 32.06 25.19 23.68 16.72 29.36
JMI-log 32.51 25.91 27.21 26.28 27.16
Ran-5 33.00 28.08 28.22 19.62 32.08
Ran-10 32.69 26.66 28.11 17.83 32.96
Ran-log 43.70 43.00 41.62 41.47 41.35

5 Conclusions

The objective of this work is to study in an exhaustive way the most popular methods in
the field of feature selection, making the corresponding comparisons between them, as
well as to determine if there exist some methods that are not able to outperform those
results obtained by the random selection. We performed experiments with 55 datasets
(including the challenging family of DNA microarray datasets) and demonstrated that,
in general, feature selection is effective and, in most of the cases, the feature selection
methods are better than the random selection, as expected.
In particular, our experiments showed that CFS is a very good choice for any type
of dataset. Therefore, in complete ignorance of the particularities of the problem to
be solved, we suggest the use of the CFS method, which has the added advantage of
not having to establish a threshold. Regarding the use of different thresholds, it seems
that 20% is more appropriate for the normal datasets (although worse than the subset
methods, which are the winning option for this type of datasets) and the 5% threshold
for microarray datasets. Indeed, our experiments confirmed that the choice of threshold
when using ranker feature selection methods is critical. In particular, for some thresh-
olds, the results obtained were as poor as when just randomly selecting some features.
124 L. Morán-Fernández and V. Bolón-Canedo

Besides, although the classification results obtained were not considerably different
between the feature selection methods used—as discussed in Morán-Fernández et al.
[18]—, we can conclude that Random Forest in the case of the real datasets and SVM
in the case of the microarrays were those that obtained, in a general way over all the
datasets used, the best results in terms of classification precision, as Fernández-Delgado
et al. [7] concluded in their study.
As mentioned before, the study of an adequate threshold for ranker-type methods
is a major problem in the field of feature selection that has yet to be resolved. Thus,
as future research, we plan to test a larger number of thresholds, as well as develop an
automatic threshold for each type of dataset.

References
1. Bache, K., Linchman, M.: UCI machine learning repository. University of California, Irvine,
School of Information and Computer Sciences. http://archive.ics.uci.edu/ml/. Accessed Dec
2020
2. Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for compar-
ing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688
(2017)
3. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection
methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)
4. Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benı́tez, J.M., Herrera, F.: A
review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135
(2014)
5. Climente-González, H., Azencott, C.A., Kaski, S., Yamada, M.: Block HSIC lasso: model-
free biomarker detection for ultra-high dimensional data. Bioinformatics 35(14), i427–i435
(2019)
6. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.
7, 1–30 (2006)
7. Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of clas-
sifiers to solve real world classification problems? J. Mach. Learn. Res. 15(1), 3133–3181
(2014)
8. Furxhi, I., Murphy, F., Mullins, M., Arvanitis, A., Poland, C.A.: Nanotoxicology data for in
silico tools: a literature review. Nanotoxicology 1–26 (2020)
9. Grgic-Hlaca, N., Zafar, M.B., Gummadi, K.P., Weller, A.: Beyond distributive fairness in
algorithmic decision making: feature selection for procedurally fair learning. AAAI 18, 51–
60 (2018)
10. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature Extraction: Foundations and Appli-
cations, vol. 207. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-35488-8
11. Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning (1998)
12. Hall, M.A.: Correlation-based feature selection for machine learning (1999)
13. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F.,
De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994).
https://doi.org/10.1007/3-540-57868-4 57
14. Kuncheva, L.I.: Bayesian-analysis-for-comparing-classifiers (2020). https://github.com/
LucyKuncheva/Bayesian-Analysis-for-Comparing-Classifiers
15. Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings
of the workshop on Speech and Natural Language, pp. 212–217. Association for Computa-
tional Linguistics (1992)
Dimensionality Reduction 125

16. Miller, A.: Subset Selection in Regression. CRC Press, Cambridge (2002)
17. Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification perfor-
mance be predicted by complexity measures? a study using microarray data. Knowl. Inf.
Syst. 51(3), 1067–1090 (2017)
18. Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Do we need hundreds of
classifiers or a good feature selection? In: European Symposium on Artificial Neural Net-
works, Computational Intelligence and Machine Learning, pp. 399–404 (2020)
19. Navarro, F.F.G.: Feature selection in cancer research: microarray gene expression and in vivo
1h-mrs domains. Ph.D. thesis, Universitat Politècnica de Catalunya (UPC) (2011)
20. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
27(8), 1226–1238 (2005)
21. Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural Comput.
8(7), 1341–1390 (1996)
22. Yang, H.H., Moody, J.: Data visualization and feature selection: new algorithms for non-
gaussian data. In: Advances in Neural Information Processing Systems, pp. 687–693 (2000)
23. Zhao, Z., Liu, H.: Searching for interacting features in subset selection. Intell. Data Anal.
13(2), 207–228 (2009)
Classification in Non-stationary
Environments Using Coresets over Sliding
Windows

Moritz Heusinger(B) and Frank-Michael Schleif

Department of Computer Science, University of Applied Sciences


Würzburg-Schweinfurt, Sanderheinrichsleitenweg 20, Würzburg, Germany
{moritz.heusinger,frank-michael.schleif}@fhws.de

Abstract. In non-stationary environments, several constraints require


algorithms to be fast, memory-efficient, and highly adaptable. While
there are several classifiers of the family of lazy learners and tree classi-
fiers in the streaming context, the application of prototype-based classi-
fiers has not found much attention. Prototype-based classifiers however
have some interesting characteristics, which are also useful in streaming
environments, in particular being highly interpretable. Hence, we propose
a new prototype-based classifier, which is based on Minimum Enclosing
Balls over sliding windows. We propose this algorithm as a linear version
as well as kernelized. Our experiments show, that this technique can be
useful and is comparable in performance to another popular prototype-
based streaming classifier – the Adaptive Robust Soft Learning Vector
Quantization but with an additional benefit of having a configurable
window size to catch rapidly changing drift and the ability to use the
internal mechanics for drift detection.

1 Introduction
The amount of data that is generated in real-time has increased drastically
in recent years [1]. This is caused by the growing popularity of the Internet of
Things (IOT) in companies as well as in private households. Domains, where IOT
devices are frequently used, are smart homes, wearables, smartphones, as well
as self-driving cars [2]. In most domains, an IOT device consists of one or more
sensors, which are continuously measuring data. Thus, these devices produce a
potentially infinite flow of data. Hence, the data has to be streamed to a model
where it can make predictions in near-real-time. Due to several reasons (e.g.
increasing dirt on a sensor), the data distribution of the stream can change over
time, which makes it even harder for supervised algorithms to preserve reliable
prediction models. These changes, also called Concept Drift (CD), can appear in
different styles, which differ in speed, frequency and intensity [3], as illustrated
in Fig. 1.
M. Heusinger—Supported by StMWi, project OBerA, grant number IUK-1709-0011//
IUK530/010.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 126–137, 2021.
https://doi.org/10.1007/978-3-030-85030-2_11
Classification in Non-stationary Environments 127

There are already several classification algorithms that can handle the char-
acteristics of non-stationary environments. However, the only prototype-based
classifiers which have been applied in non-stationary environments are variations
of Learning Vector Quantization (LVQ) [4,5]. However, kernelized prototype-
based techniques, like the kernelized Robust Soft Learning Vector Quantization
(RSLVQ) which has been considered in an offline fashion, would also be useful,
but have not been applied in non-stationary environments yet. To fill this gap
we provide a new algorithm, which is prototype-based and allows for different
kernel measurements and can be kernelized.
In this work we make the following novel contributions:

– we adopt the concept of the Minimum Enclosing Ball (MEB) to stream clas-
sification tasks
– we provide two new classification algorithms which are based on the Sliding
Window Minimum Enclosing Ball+ (SWMEB+) algorithm, proposed in [6]
– we provide experiments to compare our proposed methods against a state-of-
the-art streaming classifier w.r.t. different metrics

In Sect. 2 we discuss state-of-the-art classification models as well as the applica-


tion of MEB in the streaming context. Section 3 contains the fundamentals of our
proposed method and of stream classification. In Sect. 4 we present our MEB-
based classification model. We validate the performance of our classifier by com-
paring it to the Adaptive Robust Soft Learning Vector Quantization (ARSLVQ)
[4] in Sect. 5. Finally, we give a conclusion and outlook in Sect. 6.

2 Related Work

In the field of streaming data, different algorithms are applied to tackle classifi-
cation tasks considering the characteristics of non-stationary data.
In [7] a modern approach of a K-nearest neighbour (KNN) classifier which
uses a self-adjusting memory (SAM) mechanism called SAM-KNN has been
proposed. SAM-KNN can handle various types of concept drift, using biologically
inspired memory models and their coordination. The basic idea of it is to have
dedicated models for current and former concepts used according to the demands
of the given situation [7]. The disadvantage of SAM-KNN is, that it uses very
time-consuming mechanisms, like storing and updating multiple models, which
can be very slow in higher dimensional setups [8,9].
Often stream classifiers are combined with a Concept Drift Detector (CDD),
to adapt the model whenever a CD is detected. The Adaptive Windowing
(ADWIN) algorithm [10] is a drift detector and works by keeping updated statis-
tics of a variable-sized window, hence it can detect changes and perform cuts
in its window to better adapt the learning algorithms. Kolomogorov Smirnov
Windowing (KSWIN) [11,12] uses a similar technique but uses a Kolmogorov-
Smirnov test between every dimension of the stream and a slightly different
window technique.
128 M. Heusinger and F.-M. Schleif

For evolving data stream classification often tree-based algorithms, e.g.


Hoeffding Tree (HT) [13], are used due to their good performance [14]. To address
the problem of concept drift, an adaptive HT with ADWIN as drift detector was
published in [15] and showed better prediction performance on evolving data
streams as the classical HT.
Also, ensemble models are used to combine multiple classifiers [16] in the
streaming domain. The Oza Bagging (OB) [17] algorithm is an online ensemble
model that uses different base classifiers. For the use of concept drift detection,
this ensemble model can be again combined with the ADWIN algorithm to the
OB ADWIN algorithm.
Further, prototype-based algorithms have been used too. Especially the fam-
ily of LVQ algorithms first introduced by [18] has received attention as potential
stream classification algorithms [4,5]. In the prototype context of streaming also
MEB based approaches – as we propose them in this paper – have been applied
recently. In [6] different methods have been proposed to maintain a MEB over
a sliding window. The reduction of the window is done by combining multiple
instances of an append-only MEB. This technique has further been used to suc-
cessfully detect CD in [19] via the proposed Sliding Window Minimum Enclosing
Ball Windowing (SWMEBWIN) algorithm.

3 Preliminaries

3.1 Concept Drift

Concept drift is the change of joint distributions of a set of samples X and


corresponding labels y between two points in time:

∃X : p(X, y)t = p(X, y)t−1 (1)

The term virtual drift refers to a change in distribution p(X) for two points in
time, without affecting p(y|X). In a 2-class example, a hyperplane separating
the classes would not change due to the effects of virtual drift. Note, that we
can rewrite Eq. (1), using prior and posterior distributions, to

∃X : p(X)t p(y|X)t = p(X)t−1 p(y|X)t−1 , (2)

Note, that virtual concept drift may appear in conjunction with real concept
drift. Figure 1 shows the two drift types, which are used in this paper. For a
comprehensive study of these concept drift types we refer to [3,11]. The stability-
plasticity dilemma [3] defines the trade-off between incorporating new knowledge
into models (plasticity) and preserve prior knowledge (stability). This prevents
stable performance over time because on the edge of a drift, significant efforts
going into learning and testing against new distributions.
Thus, to achieve desirable results in streaming environments, it is necessary
to effectively detect and adapt to concept drift. This is originally done using
Classification in Non-stationary Environments 129

(a) Abrupt Drift (b) Gradual Drift

Fig. 1. Different types of drifts, one per sub-figure and illustrated as data mean. The
colors mark the dominate concept at given time step. The vertical axis shows the data
mean and the transition from one to another concept. Given the time axis the speed
of the transition is given. The figures are inspired by [3]. (Color figure online)

statistical tests over sliding windows [10,11] or measuring the performance of


a classifier [20,21]. In this work, our algorithm only uses a passive adaption
strategy, by maintaining a sliding window over a fixed-size of datapoints, which
leads to forgetting older concepts.

3.2 Minimum Enclosing Balls


We denote the Euclidean distance in
m Rm between two points p = (p1 , ..., pm )
m
and q = (q1 , ..., qm ) as d(p, q) = i=1 (pi − qi ) . A ball in R
2 with center c
m
and radius r is defined as B(c, r) = {p ∈ R : d(c, p)) ≤ r}. In this work c(B)
and r(B) are used to denote the center and radius of a ball. The μ-extension of
B(c, r) is denoted as μ · B, which represents a ball centered at c with a radius
of μ · r, i.e. μ · B = B(c, μ · r) [6].
Consider a set of n points P = {p0 , ..., pn } ⊂ Rm . The MEB of P , denoted
as M EB(P ) is the smallest ball, which contains all points in P . Center and
radius of M EB(P ) are represented by c∗ (P ) and r∗ (P ). For a parameter μ > 1,
a ball B is a μ-approximate MEB of P , if P ⊂ B and r(B) ≤ μ · r∗ (P ). A subset
S ⊂ P is a μ-coreset for M EB(P ), or μ − Coreset(P ), if P ⊂ μ · M EB(S).
Since S ⊆ P and r∗ (S) ≤ r∗ (P ), μ · M EB(S) is always a μ-approximate MEB
of P [6].

4 Non-stationary Classification via Minimum Enclosing


Balls
We propose a technique, that stores all incoming data points in a fixed-size
sliding window W = {x0 , ..., xn }. The size of the window n can be seen as a
130 M. Heusinger and F.-M. Schleif

hyperparameter. In a streaming environment a new datapoint x will be made


available to our classification model h at every timestep t. The newly arrived
datapoint will be added to the window W and if the size of the window is
exceeded the oldest datapoints will be removed until |W | ≤ n. The goal of the
classifier is to keep a MEB of W denoted as M EB(W ). To achieve this, we will
need to update the MEB whenever we denote a change in W , which happens at
every t. However, it can be that xt is already part of the MEB and adding it to
the coreset would be redundant. Hence, prior to adding x to MEB, we check if
the point lies outside the ball, using the following equation:

d(xt , c∗ (W )) > r∗ (W ) (3)


If Eq. (3) asserts to true, xt is outside of M EB(W ). Hence, if xt is inside of
M EB(W ) we do not need to adapt our coreset, because xt is already represented
by M EB(W ).
To represent all classes which are present in the data stream, we maintain
the above-mentioned W for every unique class y of a stream S. The window of
a class is then denoted as Wy and the related M EB is denoted as M EB(Wy ).
To calculate the label of an incoming unlabeled datapoints, we first calculate
the distance δ between the center c of each M EB(Wy ) and the datapoint xt :

δy = ||xt − c(M EBy )||2 (4)


where || · ||2 denotes the Euclidean distance.
Then we select the label of the MEB, where xt is inside or closest to the
according c(M EBy ).

y = argminy∈θ (δ − r(M EBy )) (5)


In Eq. (5) θ represents the set of unique labels (Fig. 2).
Additionally, we provide some mechanics to prevent the ball from growing
too fast. This is done by including only misclassified datapoints at the training
stage. Thus, at first the datapoint will be classified by Eq. (4) and Eq. (5). If the
predicted label ŷ equals the ground truth label y, the training step ends here. If
it differs, we proceed with training by updating the underlying SWMEB+. This
step can be seen as optional, hence our implementation provides this technique
as a hyperparameter.1
To incorporate old knowledge in our model we propose the following tech-
nique. Whenever a new instance of append-only Minimum Enclosing Ball
(AOMEB) is created, we put an additional datapoint xμ into our AOMEB.
Where xμ is the mean of all points in our coreset S. It is also possible to use a
more advanced technique, like maintaining a cluster of all points of the current
window, e.g. via DBSCAN [22] and then incorporating different cluster points
into the model. For a description of the AOMEB algorithm we refer to [6].
Due to the fact, that recently proposed streaming MEB algorithms are
append-only, like the 1.5-approximate MEB of [23] or the (1.22 + )-approximate
1
Implementation can be found on https://github.com/foxriver76/meb-classifier.
Classification in Non-stationary Environments 131

Fig. 2. Prediction step of the MEB classifier. Whenever a new datapoint is learned,
the underlying MEB is updated.

MEB of [24]. In this setup, we would have to calculate a new MEB, whenever a
point is removed from our M EB(Wt ). This is the case at every t as soon as W
has reached a length of n once and thus is inappropriate because of the runtime
for real time analysis. Another problem that arises, is that many MEB algorithms
are only suitable for low dimensions, i.e. m < 10, because of the time complexity
O( O(m)
1
) [25,26], while real-time streams often have higher dimensions [7,27].
Regarding space complexity, our model needs to store the points in our core-
set S as well as their corresponding radii, leading to a space complexity of O(k),
where k is the number of coreset points and is mostly below 10.

4.1 Linear Model


We are using the SWMEB+ [6] to match the rapidly changing characteristics
and time constraints of non-stationary environments. SWMEB+ can maintain a
coreset for MEB over a fixed-size sliding window. It is able to return a (9.66 + )-
coreset of Wt at any t, however, in almost all cases, the approximation ratio of
SWMEB+ can be improved to 3.36 +  [6]. Furthermore, we will show in our
experiments, that despite the approximation ratio, the approximated ball is able
to represent the classes in the tested scenarios.
The SWMEB+ algorithm maintains a single sequence of s indices Xt =
{x1 , ..., xs } over the sliding window Wt at timestep t. Each index xi corresponds
to an AOMEB instance A(xi ) that processes a substream of points from pxi to
pt . S[xi , t] represents the coreset returned by A(xi ) at timestep t and B[xi , t]
centered at c[xi , t] with radius r[xi , t] for M EB(S[xi , t]).
132 M. Heusinger and F.-M. Schleif

SWMEB+ maintains the indices based on the radii of the MEBs. To also
allow shrinkage of the MEB, the following technique is applied. For any 2 > 0,
for three neighbouring indices xi , xi+1 , xi+2 , if r[xi , t] ≤ (1 + 2 )r[xi+2 , t] then
xi+2 is considered a good approximation of xi and xi+1 can be deleted. So the
radii of MEBs gradually decrease from x1 to xs , with the ratios of any two
neighbouring indices close to (1 + 2 ). Any window staring between xi and xi+1
is approximated by A(xi+1 ). SWMEB+ keeps at most one expired index, which
must be x1 in Xt to track the upper bound for the radius r∗ (Wt ) of M EB(Wt ).
The AOMEB instance corresponding to the first non-expired index (x1 or x2
provides the coreset for M EB(Wt ) [6]. The number of indices in Xt is O( log θ ).
2
Thus, the time complexity of SWMEB+ to update each point is O( m log
4
θ
) while
2
the number of points stored by SWMEB+ is O( log3 θ ),
both are independent
of n [6]. Our proposed classification algorithm at learning time has the same
complexity as SWMEB+. At prediction time, we need to perform Eq. (4) and
Eq. (5) as additional steps. In the often used test-then-train setup, we could of
course save the distance calculation from the testing step for the training step.

4.2 Kernelized Model

The kernelized model follows the same idea as our linear proposal, however, it
can handle non-linear problems by maintaining a MEB in reproducing kernel
Hilbert space (RKHS). The learning and prediction technique stays the same,
but instead of using Eq. (4) to calculate the distance from center to xi , the
following equation is applied:
n
 n


d(c , φ(q)) =2
αi αj k(c∗i , c∗j ) + k(q, q) − 2 αi k(c∗i , q) (6)
i,j=1 i=1

where k is a symmetric positive definite kernel k(·, ·) : Rm × Rm → R and


φ(·) is its associated feature mapping, where k(p, q) = φ(p), φ(q)
for any
p, q ∈ Rm . In case of symmetric non-positive definite kernel techniques from
[28] can be used. The n-dimensional Lagrange multiplier vector is denoted as
α = [α1 , ..., αn ] .
To use Eq. (5) in the kernelized version, we need to calculate δ for an arrived
xt using Eq. (6).

δ = d(c∗ , φ(xt ))2 (7)


To maintain the MEB in the RKHS, we use the kernelized generalization of
SWMEB+, for details we again refer to [6]. Theoretically, the generalized algo-
rithm has the same approximation ratio and coreset sizes as our linear version.
Only the time complexity increases by a factor of 1 , due to fact, that the time
to compute the distance between ct−1 and φ(pt ) via Eq. (6) is O( m  ) instead of
O(m) [6].
Classification in Non-stationary Environments 133

Note, that we replace the Radial Basis Function (RBF) kernel used in the
implementation of [6] by a parameter-free version, the Extreme Learning Machine
(ELM) kernel as proposed in [29], to measure the similarity between two points:

2 1 + x, y

k(x, y) = arcsin    (8)


π 1
2σ 2 + 1 + x, x
2σ1 2 + 1 + y, y

It is shown in [29], that even when changing σ of Eq. (8) the scale of the
kernel remains constant. Thus, σ is no longer a necessary hyperparameter as in
the RBF kernel. This is desirable because the iterative characteristics of data
streams, do not allow tuning hyperparameters in an offline fashion.

5 Experiments
In our experiments, we compare our proposed algorithms against the ARSLVQ
as a state-of-the-art prototype-based stream classifier. We are focusing on the
accuracy of both approaches on different stream generators as well as real-world
streams. This allows to compare the performance on streams containing different
characteristics, like different drift types and drift speed.2
We tuned our algorithm over the following parameters via a grid search. For
the grid search, we only used partial data of the stream, to match real-world
scenarios, where access to the full data is not possible at the deployment step of
a model.

Table 1. Grid parameters for MEB-based classification

Kernel function linear cosine laplacian rbf sigmoid elm


Window size 5 50 100 300 – –
Only misclassified True False – – – –
Kernelized True False – – – –
 0.01 0.05 – – – –

Table 1 shows the parameters used for the grid search. All kernel functions
are tested with default parameters of scikit-learn. The mechanism to only learn
misclassified datapoints, denoted as Only missclasified in the table, is described
in Sect. 4.  is the approximation quality of SWMEB+.
We are testing the classifiers on 4 different synthetic streams: MIXED [21],
SEA [30], LED [31] and AGRAWAL [32]. The MIXED and the SEA stream
are modified once with gradual drift and once with abrupt drift. The synthetic
streams are tested with a size of 300,000 samples, drifts are introduced at sam-
ple 150,000 and gradual drift is present for 15,000 samples. Besides the synthetic
2
Experiments can be found on https://github.com/foxriver76/meb-classifier.
134 M. Heusinger and F.-M. Schleif

streams we also test our classifiers on multiple real-world streams: Weather [33],
Electricity [21], GMSC3 , POKER [34] and Moving Squares [7]. The tests are per-
formed using 5-fold cross-validation. The ARSLVQ is used with default param-
eters of the scikit-multiflow [35] framework, except that we set the gradient
descent to Adadelta.

Table 2. Accuracy of ARSLVQ and MEB-based classification

Streams ARSLVQ MEB-classification


MIXEDa 87.98 ± 0.09 90.20 ± 0.06
SEAa 98.19 ± 0.30 85.69 ± 0.05
MIXEDg 86.36 ± 0.06 87.42 ± 0.05
SEAg 98.16 ± 0.16 85.91 ± 0.12
LED 75.13 ± 0.12 61.13 ± 0.14
AGRAWAL 51.09 ± 2.40 49.91 ± 0.99
Weather 66.21 ± 0.00 65.66 ± 0.00
Electricity 87.26 ± 0.00 79.74 ± 0.00
GMSC 85.12 ± 0.00 83.57 ± 0.00
POKER 66.73 ± 0.00 71.00 ± 0.00
Moving Squares 13.34 ± 0.00 99.84 ± 0.00
Mean 74.14 ± 0.28 78.19 ± 0.13

Table 2 shows the accuracy scores on the different streams. As we can see
there are only small differences on most of the streams. On the synthetic streams,
we can see, that ARSLVQ is performing better at LED and SEA, while our app-
roach is slightly better at MIXED and AGRAWAL. At the real-world streams,
there are again only small differences regarding the accuracy, except for the
Moving Squares dataset, where the ARSLVQ is not able to adapt fast enough
to the contained drift, while the MEB-based approach can handle it very well
with small window sizes.
In Table 3 the runtime of both algorithms on the tested streams is shown. We
can see, that ARSLVQ is always faster. Over all streams, the ARSLVQ is around
30% faster, than the MEB-based classification. Note, that we have not taken the
runtime into account for our grid search parameters. The higher runtime is due
to the tight  and the additional overhead caused by the kernelization.
Overall, our approach is very competitive to the ARSLVQ, with the advan-
tage of handling very rapidly changing distributions using a small window size.
Another advantage is, that the MEB-based approach can easily be combined
with the Minimum Enclosing Ball Window Detection (MEBWIND) algorithm
proposed in [19]. This can be done by sharing a MEB for both algorithms, and
whenever a new drift is detected via MEBWIND, a new MEB will be created.
3
https://www.kaggle.com/c/GiveMeSomeCredit.
Classification in Non-stationary Environments 135

Table 3. Runtime of ARSLVQ and MEB-based classification in seconds

Streams ARSLVQ MEB-classification


MIXEDa 129.36 ± 0.86 445.29 ± 11.32
SEAa 135.06 ± 4.03 291.39 ± 0.79
MIXEDg 134.58 ± 3.75 442.31 ± 12.02
SEAg 124.92 ± 0.83 275.74 ± 4.04
LED 395.64 ± 12.19 183.49 ± 1.36
AGRAWAL 160.65 ± 1.10 174.14 ± 1.57
Weather 7.39 ± 0.06 7.77 ± 0.04
Electricity 17.23 ± 0.21 44.98 ± 0.54
GMSC 46.55 ± 0.10 53.45 ± 0.39
POKER 398.74 ± 10.81 403.27 ± 6.22
Moving Squares 144.26 ± 0.85 150.48 ± 1.33
Mean 154.03 ± 3.16 224.76 ± 3.60

This will eliminate the need for a fixed window size n and thus reduce the number
of hyperparameters of our approach.

6 Outlook and Conclusion


In future work, we want to combine our MEB-based classifiers with the SWMEB-
WIN to get rid of the fixed-size sliding window mechanism for the classification
task. This also allows to better adapt to drifting environments by having an
ensemble model. Another future direction is to replace the preserving mechanic
when creating a new ball with an algorithm that can maintain more information
than just the mean of the previous coreset. This could be done by maintaining
a cluster in an online fashion, e.g. via DBSCAN [22].
Overall, we have proposed two MEB-based classifiers, one is a linear model
while the other is a kernelized version, which allows replaceable kernels and
thus can be used with a variety of non-linear problems. Our results have shown
that our algorithms achieve comparable results on most of the datasets as the
ARSLVQ and thus are a good alternative for the task of prototype-based stream
classification. Especially, on datasets having sliding window characteristics like
Moving Squares, our proposed method is superior to ARSLVQ. Additionally,
the ball-related characteristics can be more meaningful than classic prototypes
in scenarios, where we want to describe the borders of a class too.

References
1. Bifet, A., Gavaldà,R., Holmes, G., Pfahringer, B.: Machine Learning for Data
Streams with Practical Examples in MOA. MIT Press (2018). https://moa.cms.
waikato.ac.nz/book/
136 M. Heusinger and F.-M. Schleif

2. Atzori, L., Iera, A., Morabito, G.: The internet of things: a survey. Comput. Netw.
54(15), 2787–2805 (2010)
3. Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Comput. Surv. 46(4), 1–37 (2014)
4. Heusinger, M., Raab, C., Schleif, F.-M.: Passive concept drift handling via momen-
tum based robust soft learning vector quantization. In: Vellido, A., Gibert, K.,
Angulo, C., Martı́n Guerrero, J.D. (eds.) WSOM 2019. AISC, vol. 976, pp. 200–
209. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-19642-4 20
5. Straat, M., Abadi, F., Göpfert, C., Hammer, B., Biehl, M.: Statistical mechanics
of on-line learning under concept drift. Entropy 20(10) (2018)
6. Wang, Y., Li, Y., Tan, K.-L.: Coresets for minimum enclosing balls over sliding
windows. In: Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, ser. KDD 2019, New York, NY, USA, pp.
314–323. Association for Computing Machinery (2019)
7. Losing, V., Hammer, B., Wersing, H.: KNN classifier with self adjusting memory
for heterogeneous concept drift. In: Proceedings - IEEE, ICDM, pp. 291–300 (2017)
8. Heusinger, M., Schleif, F.: Random projection in supervised non-stationary envi-
ronments. In: 28th European Symposium on Artificial Neural Networks, Compu-
tational Intelligence and Machine Learning, ESANN 2020, Bruges, Belgium, 2–
4 October 2020, pp. 405–410 (2020). https://www.esann.org/sites/default/files/
proceedings/2020/ES2020-13.pdf
9. Heusinger, M., Raab, C., Schleif, F.: Analyzing dynamic social media data via
random projection - a new challenge for stream classifiers. In: IEEE Conference on
Evolving and Adaptive Intelligent Systems (EAIS) 2020, pp. 1–8 (2020)
10. Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing.
In: Proceedings of the Seventh SIAM International Conference on Data Mining,
Minneapolis, Minnesota, USA, 26–28 April 2007, pp. 443–448 (2007)
11. Raab, C., Heusinger, M., Schleif, F.-M.: Reactive soft prototype computing for
frequent reoccurring concept drift. In: Proceedings of the 27. ESANN 2019, pp.
437–442 (2019)
12. Raab, C., Heusinger, M., Schleif, F.-M.: Reactive soft prototype computing for
concept drift streams. Neurocomputing (2020)
13. Domingos, P.M., Hulten, G.: Mining high-speed data streams. In: Proceedings of
the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, Boston, MA, USA, 20–23 August 2000, pp. 71–80 (2000)
14. Bifet, A., et al.: Extremely fast decision tree mining for evolving data streams. In:
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017, pp. 1733–
1742. ACM (2017)
15. Bifet, A., Gavaldà, R.: Adaptive learning from evolving data streams. In: Adams,
N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol.
5772, pp. 249–260. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-
642-03915-7 22
16. Gomes, H.M., Barddal, J.P., Enembreck, F., Bifet, A.: A survey on ensemble learn-
ing for data stream classification. ACM Comput. Surv. 50(2), 23:1-23:36 (2017)
17. Oza, N.C.: Online bagging and boosting. In: 2005 IEEE International Conference
on Systems, Man and Cybernetics, vol. 3, pp. 2340–2345 (2005)
18. Kohonen, T.: Learning vector quantization. In: Self-Organizing Maps. Springer
Series in Information Sciences, vol. 30, pp. 175–189. Springer, Heidelberg (1995).
https://doi.org/10.1007/978-3-642-97610-0 6
Classification in Non-stationary Environments 137

19. Heusinger, M., Schleif, F.: Reactive concept drift detection using coresets over
sliding windows. In: 2020 IEEE Symposium Series on Computational Intelligence,
SSCI 2020, Canberra, Australia, 1–4 December 2020, pp. 1350–1355. IEEE (2020).
https://doi.org/10.1109/SSCI47803.2020.9308521
20. Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavalda, R.,
Morales-Bueno, R.: Early drift detection method. In: Fourth International Work-
shop on Knowledge Discovery from Data Streams, vol. 6, pp. 77–86 (2006)
21. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In:
Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295.
Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28645-5 29
22. Ren, J., Ma, R., Ren, J.: Density-based data streams clustering over sliding win-
dows. In: Proceedings of the 6th International Conference on Fuzzy Systems and
Knowledge Discovery - Volume 5, ser. FSKD 2009, pp. 248–252. IEEE Press (2009)
23. Zarrabi-Zadeh, H., Chan, T.M.: A simple streaming algorithm for minimum enclos-
ing balls. In: CCCG (2006)
24. Chan, T.M., Pathak, V.: Streaming and dynamic algorithms for minimum enclosing
balls in high dimensions. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011.
LNCS, vol. 6844, pp. 195–206. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-22300-6 17
25. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent mea-
sures of points. J. ACM 51(4), 606–635 (2004). https://doi.org/10.1145/1008731.
1008736
26. Chan, T.M.: Faster core-set constructions and data stream algorithms in fixed
dimensions. In: Proceedings of the Twentieth Annual Symposium on Computa-
tional Geometry, ser. SCG 2004. New York, NY, USA, pp. 152–159. Association
for Computing Machinery (2004). https://doi.org/10.1145/997817.997843
27. Gomes, H.M., et al.: Adaptive random forests for evolving data stream classifica-
tion. Mach. Learn. 106(9–10), 1469–1495 (2017)
28. Schleif, F.-M., Tino, P.: Indefinite proximity learning: a review. Neural Comput.
27(10), 2039–2096 (2015). https://doi.org/10.1162/NECO a 00770
29. Frénay, B., Verleysen, M.: Parameter-insensitive kernel in extreme learning for
non-linear support vector regression. Neurocomputing 74(16), 2526–2531 (2011).
https://doi.org/10.1016/j.neucom.2010.11.037
30. Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale clas-
sification. In: Proceedings of the Seventh ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, ser. KDD 2001. New York, NY, USA,
pp. 377–382. ACM (2001). http://doi.acm.org/10.1145/502512.502568
31. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression
Trees. CRC Press, Boca Raton (1984)
32. Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspec-
tive. IEEE Trans. Knowl. Data Eng. 5(6), 914–925 (1993)
33. Elwell, D., Klink, J., Holman, J., Sciarini, M.: Ongoing experience with ohios
automatic weather station network. Appl. Eng. Agricult. 9, 437–441 (1993)
34. Bifet, A., Pfahringer, B., Read, J., Holmes, G.: Efficient data stream classification
via probabilistic adaptive windows. In: Proceedings of the 28th Annual ACM Sym-
posium on Applied Computing, ser. SAC 2013. New York, NY, USA, pp. 801–806.
ACM (2013). http://doi.acm.org/10.1145/2480362.2480516
35. Montiel, J., Read, J., Bifet, A., Abdessalem, T.: Scikit-multiflow: a multi-output
streaming framework. J. Mach. Learn. Res. 19(72), 1–5 (2018)
Deep Reinforcement Learning
in VizDoom via DQN and Actor-Critic
Agents

Maria Bakhanova1 and Ilya Makarov1,2(B)


1
HSE University, Moscow, Russia
[email protected]
2
Artificial Intelligence Research Institute, Moscow, Russia

Abstract. In this work, we study the problem of learning reinforcement


learning-based agents in a first-person shooter environment VizDoom.
We compare several well-known architectures, such as DQN, DDQN,
A3C, and Curiosity-driven model, while highlighting the main differences
in learned policies of agents trained via these models.

Keywords: Deep reinforcement learning · DQN · A3C · A2C ·


VizDoom

1 Introduction
Reinforcement learning is a popular machine learning technique for develop-
ing agents that can make a sequence of decisions preserving pre-defined goals.
Reinforcement learning has been very close to video game development. On the
one hand, games are the perfect environment for training agents and testing
new methods. On the other hand, reinforcement learning bots are useful for
automatic game testing. In this research, we focus on reinforcement learning in
first-person (FPS) shooter games.
FPS is a type of shooter game where the player controls the game from the
first-person point of view and tries to achieve some goal, such as fight enemies,
collect items or navigate. For reinforcement learning experts, 3D FPS video
games are a great environment where they can solve relevant task-oriented prob-
lems.
There are many difficulties in modeling an agent’s behavior. One of them is
a limitation of knowledge of the current state of the environment: the agent does
not know the exact position of an enemy and the state is partially observable
because of the limited angle of view. However, the most challenging part is highly
delayed and thinly dispersed rewards: the agent may be rewarded for the action
it performed several minutes ago. Modern reinforcement learning methods aim
to solve all these problems, and they have already made great strides in this.
To conduct reinforcement learning experiments on games, it is necessary to
use a special technical environment that would draw out data for modeling, such
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 138–150, 2021.
https://doi.org/10.1007/978-3-030-85030-2_12
Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents 139

as screenshots, player’s health and resources. Many previous studies developed


reinforcement learning agents in a video game console Atari 2600 [7].
However, Atari games take place in 2D environments that are not fully observ-
able to agents. An alternative 3D environment is ViZDoom [14]. It is a Doom-
based AI Research Platform that allows developing agents that play Doom using
only the visual information (the screen buffer). ViZDoom is considered to be the
most popular software because of the variety of maps and its technical parame-
ters.
The goal of this research consists of two parts. The first part is to make a
complete overview of existing reinforcement learning methods applied to Doom.
The second part includes training of reasonable and balanced agents with the
most famous techniques in ViZDoom: classical Q-learning-based methods: Deep
Q-Network (DQN) [4] and Double Dueling Deep Q-Network [8], and Actor-Critic
methods: A3C and its extension with Curiosity. The agents are expected to be
adaptable to the dynamic changes inside the environment on ’defend the center’
scenario. We also provide a comparison of methods by evaluating agents’ play
over 100 episodes.
The paper is structured as follows: Sect. 2 (Literature review) includes the
overview of the methods applied in ViZDoom; in Sect. 3 (Methods) we provide
an essential theoretical basis for experiments; Sect. 4 (Experiments) is devoted
to experiments description and obtained results; the last Sect. 5 (Conclusion)
provides a summary of the work.

2 Literature Review

Initial reinforcement learning studies have almost exclusively focused on Q-


learning-based algorithms [27]. Q-learning method itself is popular because of its
simplicity and effectiveness on several tasks. However, it has a significant limita-
tion: convergence can be reached only if the state space of an agent is finite.
A huge contribution to reinforcement learning has been made when Deep-
mind developed Deep Q-Network (DQN) [4]. It is an extension of Q-Learning
with deep neural networks and Experience Replay technique [25] that helps to
store observations in a replay buffer for further use.
DQN has become a starting point for many other algorithms. The most
famous modifications are Deep Recurrent Q-Network (DRQN) [9] and Double
Deep Q-Network (DDQN) [8]. They have shown great scores in Atari envi-
ronment, however, they usually result in long training time and instability
[12,13,19].
One of the crucial improvements of DQN is Prioritized Experience Replay
[25] that stabilizes the training process and reduces training time. The concept
is to sample examples from non-uniform distribution taking into account the
current loss of each example.
Other stabilizing techniques were introduced in Rainbow paper [10]. These
improvements helped to achieve state-of-art performance on the Atari 2600
benchmark in terms of the final score and memory efficiency.
140 M. Bakhanova and I. Makarov

While Atari environment uses a 2D world, ViZDoom [14] allows to simulate


first-person shooter similar to our previous studies in the field [2,3,21,22]. This
3D environment uses screenshots of the first-person perspective in a 3D world.
According to the original paper [14], “ViZDoom is lightweight, fast, and highly
customizable via a convenient mechanism of user scenarios”.
Many DQN-based algorithms were tested in ViZDoom deathmatch scenario
during organized ViZDoom AI competitions [28]. The task was to develop an
agent who’s aim is to maximize the number of killed opponents and to mini-
mize the number of gained damages. Despite the fact that all agents were quite
competent, no bot achieved human-like performance. Bots mainly focused on
finding targets and shooting but they did not pay much attention to defending
themselves and navigation.
Recent studies proposed DQN architectures that significantly outperformed
previous methods. For example, in [1] the authors applied deep reinforcement
learning algorithms for partially observable Markov decision processes combined
with DQN. They proposed several agents that appeared to show a triple increase
in average reward compared to DRQN with stabilizing tricks.
Another branch of algorithms known along with DQN is Actor-Critic meth-
ods, such as Actor Advantage Critic (A2C) and Asynchronous Advantage Actor-
Critic (A3C). These methods use two neural networks (actor and critic) that
interact with each other. One of the successful extensions of such methods is
Actor-Critic with Kronecker-factored trust region (ACKTR) [26]. ACKTR opti-
mizes both the actor and the critic using Kronecker-factored approximated cur-
vature and seems to outperform the basic A2C agent in ViZDoom in terms of
average reward, kill counts, and stability.
In this study, we conduct experiments with the following methods: DQN,
Double Q-learning with Dueling Architecture, A3C and A3C Curiosity. For DQN
and Dueling DQN, we use Prioritized Experience Replay buffer for sampling
batches for training.

3 Methods
3.1 Background

The main assumption of standard reinforcement learning is that we have an


agent that interacts with an environment using a discrete number of steps.
Suppose at time t the agent receives its current state st from the environment.
The agent can perform an action at according to its policy π(a|s) = P (a|s) and
get a reward rt . When the action at happens, the environment updates the state
to st+1 and sends a reward rt to the agent that indicates the value of transition
from st to st+1 . This process is called Markov Decision Process (MDP) denoted
as a group (S, A, R, T ), where S is a finite state space, A is a finite action space,
R is a reward function, which maps pair (s, a) ∈ (S, A) into stochastic reward r,
and T is a transition kernel: T (s, a, s ) = P (St+1 = s |St = s, At = a).
Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents 141

The discounted return from state st is defined as follows:




Rt = γ i rt+i , (1)
i=0

where γ ∈ (0, 1) is a discount factor that determines the importance of future


rewards. Usually, γ is chosen manually. For example, if the goal is to stimulate
short-term rewards then γ should be close to 0.
The main goal of the agent is to maximize its rewards during the game
by developing policy π. The policy π ∗ is optimal when it maximizes the value
function. The value function can be considered as a prediction of future rewards.
The value function of a particular state is the total reward that an agent can
obtain over the future, starting from this state. There are two types of value
functions: state-value function and action-value function.
We call V π (s) the state-value function for policy π, and we set Qπ (s, a) to
be the action-value function for policy π.

V π (s) = E[Rt |st = s], Qπ (s, a) = Eπ [Rt |st = s, at = a]. (2)

The relationship between V π (s) and Qπ (s, a) is the following:



V π (s) = π(a|s)Qπ (s, a). (3)
a∈A

In MDP, there are many different value functions corresponding to differ-


ent policies. The optimal value function yields maximum value among all other
value functions. The optimal state-value and action-value functions are defined
as follows:
V ∗ (s) = maxV π (s), Q∗ (s, a) = maxQπ (s, a). (4)
π π

3.2 Q-learning
Q-learning [27] is a standard model-free method that can be used to obtain an
approximation of optimal policy.

πopt = argmaxEπ (Rt ). (5)


π

We have already defined Qπ as a function for a given state-action pair with


policy π:
Qπ (s, a) = Eπ [Ri |si = s, ai = a]. (6)
In other words, Q is an expected reward that can be obtained by an agent
started from state s, with action a and policy π. We obtain the optimal policy
using Bellman equation:

Q∗ (s, a) = max(Eπ [Ri |si = s, ai = a]) = E(r(s, a) + γ × max Q∗ (s , a )). (7)
π  a
142 M. Bakhanova and I. Makarov

The authors of [27] proved that this method converges from arbitrary Q
to optimal Q∗ if state and action spaces are finite and each pair state-action
presents more than one time. In the case of FPS game, we have large state and
action spaces, therefore direct and even dynamic programming approaches are
impractical because it is hard for an agent to explore the whole environment. To
experience the environment, we usually use the epsilon-greedy strategy: we take
an action with probability  or choose the best action with probability 1 − . It
is suggested to set initial  close to 1 and decrease it while training.

3.3 Deep Q-Network (DQN)

If the state space is infinite then deep neural networks can be used to approxi-
mate Q-function. DQN was introduced and tested on classic Atari 2600 games
in [4]. This neural network takes an image as an input and outputs Q-values per
action. The authors proposed to extract features from an image using a convolu-
tional neural network (CNN) and stack two fully connected layers additionally
to approximate Q-function.
In this method, instead of (7) we compute Temporal Difference error (TD):

T D = Q(si , ai ) − (ri + γ × max



Q(si , a )). (8)
a

The task is to minimize the square of TD. The term Q(si , ai ) is estimated
using an online network with parameters θ which is trained via backpropagation,
while max

Q(si , a ) is estimated with target network with parameters φ. The
a
weights of the target network are fixed as a constant during the online network
training and are updated periodically. Each pair (si , ai ) is a sample that is be
propagated through the network. The loss is defined as follows:

L= (Q(si , ai ; θ) − (ri + γ × max

Q(si , a ; φ)))2 . (9)
a
i

DQN has shown great scores in Atari environment, however, it has some dis-
advantages, such as long training time and overestimation of the value function.
In order to handle these issues, we use a combination of two extensions from
[10]: Dueling Architecture and Double Learning

3.4 Double Deep Q-Network with Dueling Architecture (Dueling


DDQN)

Double DQN [8] was proposed to overcome the problem of DQN which is the
overestimation of action values under certain conditions. This problem comes
from the max operator in (9) [1]. In Double DQN, we replace max

Q(si , a ) with
a
Q(si , argmaxQ(si , a ); θ). Double Q-learning occurred to reduce the observed
a
overestimations and to perform much better in several games than DQN. Nev-
ertheless, there was a further improvement—the Dueling Architecture.
Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents 143

Dueling network [6] consists of two streams: the first stream estimates state-
value function V (s), the second one estimates the action-advantage function
A(s, a). The outputs of these streams are aggregated to output Q(s, a). The
advantage function can be defined through the following formula:

Q(s, a) = A(s, a) + V (s). (10)


The value stream aims to estimate state values, which is important because
states are different. At the same time, the advantage stream estimates the advan-
tage of taking a particular action in a specific state.
Let A be an action space. The aggregation of the outputs of two streams is
defined as
1 
Q(s, a) = V (s) + (A(s, a) − A(s, a )). (11)
|A| 
a ∈A
Both techniques Double and Dueling can be combined into one single network
- Double Deep Q-Network with Dueling Architecture (Dueling DDQN) [11].

3.5 Prioritized Experience Replay


Usually experiences are described as transitions (st , at , st+1 , rt ). These transi-
tions are memorized during the game in a replay buffer. When we have enough
experiences (usually the number that is higher than the batch size), we sample
batch from the buffer and perform training of neural network via backpropaga-
tion. We can sample batches uniformly, however it seems to be more efficient to
sample transitions with respect to their “importance” because some transitions
may help the network learn better than others. One of the approaches to sample
batches with some priority rule is Prioritized Experience Replay (PER) [25]. In
PER, we use the value of TD error yielded by the transition to construct the
probability of sampling this particular transition. There are two ways to do this.
The first method is called Proportional prioritization. Let i be the index
of transition and  is a small number that ensures non-zero probability even if
T Di = 0. Then the probability of sampling is pi = T Di + .
The second approach is called Rank-based prioritization. Let α be a prioriti-
zation strength coefficient, rank(i) be the rank of transition i when the replay
buffer is sorted according to T D. Then the probability pi and final probabilities
Pi are calculated as

1 pα
pi = , Pi =  i α . (12)
rank(i) k pk

However, this approach may result in a bias towards the frequently sampled
experiences. Therefore, to avoid this bias we use importance sampling weights
wi = ( N1 P1i )β , which means that for a transition i the network will make a
backward pass on wi T Di instead of T Di with β coefficient annealed from 0 to 1
over the episodes. It makes frequently used experiences influence a lot more the
network weights updates at the beginning of training.
144 M. Bakhanova and I. Makarov

3.6 Actor-Critic Methods

The idea of Actor-Critic methods [15] is to combine both policy-based and value-
based methods. Actor-Critic algorithm is based on two models. The first model
(actor) chooses an action based on state, so it learns optimal policy. The sec-
ond model (critic) estimates Q-values for actions by computing value function.
Both models interact with each other during training. Actor-Critic methods have
proven to be effective in complex environments of 2D and 3D games, such as
Doom and Atari [5,26].

Advantage Actor-Critic (A2C). In A2C, the actor learns the policy and
conducts actions in the environment. The critic learns advantage function instead
of Q-values, estimates how good the action is and sends this feedback to the actor.
The advantage function seems to reduce the high variance of policy networks and
stabilize the model.
The state-value function V (st ) is produced by the network, and we can
approximate action-value function Q(st , at ) and derive the equation for advan-
tage as follows:

A(st , at ) = Q(st , at ) − V (st ) = rt+1 + γV (st+1 ) − V (st ). (13)


Critic loss is simply an MSE between TD target and current state value, so
it is the second power of the advantage. To calculate actor (categorical) loss,
we calculate probabilities for each possible action and compute negative log-
likelihood loss.

Asynchronous Advantage Actor-Critic (A3C). A3C was proposed by


DeepMind [5]. It is considered to be a fast and robust method. A3C differs from
A2C only in the Asynchronous part. A3C is a set of independent Actor-Critic
neural networks (workers) that work in different instances of the environment,
so the training process can be easily parallelized.
A3C also includes a global network whose weights are periodically updated
by workers. The Asynchronous part implies that the updates are not happening
at the same time but in an asynchronous manner. The workers also reset their
weights to those of the global network and continue their training.

Curiosity-Driven Exploration. Curiosity-driven approach was proposed in


[24]. The idea is to develop a bot that can build its own intrinsic reward function
and learn by itself. An agent can explore the environment by discovering different
unfamiliar states. In this method, we aim to develop a reward function that is
generated by the agent itself. According to the original paper [24], curiosity is
an “error in an agent’s ability to predict the consequence of its own actions
in a visual feature space learned by a self-supervised inverse dynamics model”.
Shortly, it is an intrinsic reward that is equal to the error prediction of the next
state st+1 , given the current state st and action at .
Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents 145

In order to calculate this error, we use images to learn an embedding func-


tion φ of the states. We also assume that unfamiliar next states st+1 that can
be achieved from some state st will have the same representation. In [24], the
modeling is done with co-training of the inverse dynamics and forward dynam-
ics models. In the inverse dynamics model, we predict action at using φ(st ) and
φ(st+1 ). In the forward dynamics model, we predict φ(st+1 ) using at and φ(st ).

4 Experiments
We applied the above methods on the one scenario of Doom environment: Defend
the center. All experiments were conducted in Python 3.7 and PyTorch library.
Code is available at https://github.com/MariBax/deep rl vizdoom.

4.1 Defend the Center

The description is partially taken from ViZDoom GitHub page. The purpose of
this scenario is to teach the agent that killing the monsters is good and when
monsters kill you is bad. In addition, ammunition is limited. By default, the
agent is rewarded (+1) for killing monsters. In our implementation, we also
added penalties: the agent is penalized for missing (−0.2) and health decrease
(-0.2).
The map is a large circle. The player is always located in the center. There are
3 available buttons: turn left, turn right, shoot (attack). 5 melee-only, monsters
are spawned along the wall. Monsters are killed after a single shot. After dying
each monster is respawned after some time. The episode ends with player death.

4.2 Screen Processing

The input image of the screen is processed as follows: we resize it from (360, 480)
to (60, 80). In order to create the notion of motion, we stack these processed
images (frames) to get the tensor of size (stack size = 4, 60, 80). We did not
use depth estimation [16–18,23] for modeling states or graph embeddings [20]
for state transitions to be consistent with other studies, leaving this for future
work.

4.3 Model Parameters

DQN and Dueling DDQN. The architecture for DQN and Dueling DDQN
is similar. The model consists of a convolutional block followed by a linear
block. DQN and Dueling DDQN have the same convolutional block that can
be described as follows:

– Conv 32 filters, kernel size = 8, stride = 4.


– Conv 64 filters, kernel size = 4, stride = 2.
– Conv 64 filters, kernel size = 3, stride = 3.
146 M. Bakhanova and I. Makarov

Fig. 1. DQN and DDDQN training losses, number of episodes = 100.

The description of fully connected layer for DQN:

– Linear, output size = 512.


– Linear, output size = number of possible actions.

Linear block for Dueling DDQN:

– Linear for advantage, output size = 512.


– Linear for advantage, output size = number of possible actions.
– Linear for value, output size = 512.
– Linear for value, output size = 1.

Fig. 2. A3C (red) and A3C Curiosity (pink) training loss. (Color figure online)
Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents 147

The models were trained with RMSprop optimizer with a learning rate of
0.0025. For each model, we used Prioritized Experience Replay buffer with a
capacity of 1000. The batch size is set to 64. Each model was trained on 100
episodes. Each episode consists of a different number of iterations since iterations
depend on episode finishing time. The approximate number of iterations per
episode is 100. Loss plots are shown on Fig. 1.

A3C and A3C Curiosity. The architecture of Actor-Critic worker is two


convolutional layers followed by fully connected layer, LSTM network and two
(for advantage and action) fully connected layers:
– Conv 16 filters, kernel size = 8, stride = 4.
– Conv 32 filters, kernel size = 4, stride = 2.
– Linear, output size = 256.
– LSTM, output size = 256.
– Linear for action, output size = number of possible actions.
– Linear for advantage, output size = 1.
Adam optimizer was shared between workers with a learning rate of 0.0001.
The Curiosity model was trained with Adam optimizer and learning rate 0.001.
A3C model was trained during 1000 episodes, A3C Curiosity was trained during
600 episodes. Loss plots are shown on Fig. 2. The moving average of kill count
is depicted in Fig. 3.

Fig. 3. A3C (red) and A3C Curiosity (pink) moving average of kill counts. (Color figure
online)

4.4 Results
The loss plots prove the fact that training all networks is a very unstable pro-
cedure: the loss graph oscillates. However, there is a convergence trend in DQN
and Dueling DQN. We believe that this is because of PER that stabilizes train-
ing. Moreover, there is an interesting situation with A3C methods: policy loss
decreases while value loss increases. This indicates a convergence problem in the
148 M. Bakhanova and I. Makarov

method, and we believe that a more detailed selection of parameters may fix this
issue. Even though there was a problem with loss in A3C, this method outper-
formed all other methods: after 700 episodes the average kill count drastically
grew up to 16–20. Unfortunately, A3C with Curiosity did not perform well: even
Dueling DDQN performed better. The issue may be in parameter choice and the
difficulty of tuning.
We tested our agents to measure their kill counts per episode. Each agent
played 100 game episodes. The results are presented in Table 1. We also tested
Random agent taking random action disregarding any policy.

Table 1. Kill count statistics after 100 episodes of playing.

Model Mean Std Max


Random 0.14 0.82 3.0
DQN 1.80 1.08 5.00
Dueling DDQN 6.23 1.60 10.00
A3C 16.09 2.40 21.00
A3C curiosity 3.54 1.12 5.00

5 Conclusion and Future Plans

In this study, we presented an overview of reinforcement learning methods


applied to Doom FPS video game and trained the most popular methods - DQN
with PER, Dueling DDQN with PER, A3C, and A3C Curiosity. The results show
that the training of models is a very unstable procedure that requires stabilizing
tricks and a laborious selection of parameters. Nevertheless, we were able to train
the agent that achieves 16 average kills per episode. We believe that the results
may be improved with fine-tuning of parameters and using other stabilization
techniques. The future plans are to train and test agents in different scenarios:
deadly corridor (navigation) and health gathering; add more stabilizing tools:
Multi-step learning, Noisy Networks and C51 described in Rainbow paper [10].

References
1. Akimov, D., Makarov, I.: Deep reinforcement learning with vizdoom first-person
shooter. In: CEUR Workshop Proceedings, vol. 2479, pp. 3–17 (2019)
2. Makarov et al., I.: First-person shooter game for virtual reality headset with
advanced multi-agent intelligent system. In: Proceedings of the 24th ACM Inter-
national Conference on Multimedia, pp. 735–736 (2016)
3. Makarov et al., I.: Modelling human-like behavior through reward-based approach
in a first-person shooter game. In: EEML Proceedings (2016)
4. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature
518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236
Deep Reinforcement Learning in VizDoom via DQN and Actor-Critic Agents 149

5. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. CoRR
abs/1602.01783 (2016)
6. Wang et al., Z.: Dueling network architectures for deep reinforcement learning.
In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International
Conference on Machine Learning, New York, New York, USA (2016). Proceedings
of Machine Learning Research, vol. 48, pp. 1995–2003. PMLR. 20–22 Jun 2016
7. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning envi-
ronment: an evaluation platform for general agents. CoRR abs/1207.4708 (2012)
8. van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double
q-learning. CoRR abs/1509.06461 (2015). http://arxiv.org/abs/1509.06461
9. Hausknecht, M.J., Stone, P.: Deep recurrent q-learning for partially observable
MDPS. CoRR abs/1507.06527 (2015)
10. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learn-
ing. CoRR abs/1710.02298 (2017)
11. Huang, Y., Wei, G., Wang, Y.: V-d d3qn: the variant of double deep q-learning net-
work with dueling architecture. In: 2018 37th Chinese Control Conference (CCC),
pp. 9130–9135 (2018)
12. Kamaldinov, I., Makarov, I.: Deep reinforcement learning in match-3 game. In:
2019 IEEE Conference on Games (CoG), pp. 1–4 (2019)
13. Kamaldinov, I., Makarov, I.: Deep reinforcement learning methods in match-3
game. In: Van der Aalst, W.M.P., et al. (eds.) AIST 2019. LNCS, vol. 11832, pp.
51–62. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-37334-4 5
14. Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jaskowski, W.: ViZDoom:
a doom-based AI research platform for visual reinforcement learning. CoRR
abs/1605.02097 (2016)
15. Konda, V., Tsitsiklis, J.: Actor-critic algorithms. Soc. Ind. Appl. Math. 42 (2001)
16. Korinevskaya, A., Makarov, I.: Fast depth map super-resolution using deep neu-
ral network. In: 2018 IEEE International Symposium on Mixed and Augmented
Reality Adjunct (ISMAR-Adjunct), pp. 117–122. IEEE (2018)
17. Makarov, I., Aliev, V., Gerasimova, O.: Semi-dense depth interpolation using deep
convolutional neural networks. In: Proceedings of the 25th ACM International
Conference on Multimedia, pp. 1407–1415 (2017)
18. Makarov, I., Aliev, V., Gerasimova, O., Polyakov, P.: Depth map interpolation
using perceptual loss. In: 2017 IEEE International Symposium on Mixed and Aug-
mented Reality (ISMAR-Adjunct), pp. 93–94. IEEE (2017)
19. Makarov, I., Kashin, A., Korinevskaya, A.: Learning to play pong video game via
deep reinforcement learning. In: AIST (Supplement), pp. 236–241 (2017)
20. Makarov, I., Kiselev, D., Nikitinsky, N., Subelj, L.: Survey on graph embeddings
and their applications to machine learning problems on graphs. PeerJ Comput. Sci.
7, e439 (2021)
21. Makarov, I., Polyakov, P.: Smoothing Voronoi-based path with minimized length
and visibility using composite Bezier curves. In: AIST (Supplement), pp. 191–202
(2016)
22. Makarov, I., Tokmakov, M., Tokmakova, L.: Imitation of human behavior in 3d-
shooter game. In: AIST 2015 Analysis of Images, Social Networks and Texts, p. 64
(2015)
23. Maslov, D., Makarov, I.: Online supervised attention-based recurrent depth esti-
mation from monocular video. PeerJ Comput. Sci. 6, e317 (2020)
24. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by
self-supervised prediction. CoRR abs/1705.05363 (2017)
150 M. Bakhanova and I. Makarov

25. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay (2015)
26. Shao, K., Zhao, D., Li, N., Zhu, Y.: Learning battles in VIZDoom via deep rein-
forcement learning. In: 2018 IEEE Conference on Computational Intelligence and
Games (CIG), pp. 1–4 (2018)
27. Watkins, C.J.C.H., Dayan, P.: Technical note: Q-learning. Mach. Learn. 8(3), 279–
292 (1992)
28. Wydmuch, M., Kempka, M., Jaskowski, W.: Vizdoom competitions: playing doom
from pixels. CoRR abs/1809.03470 (2018), http://arxiv.org/abs/1809.03470
Adaptive Ant Colony Optimization for
Service Function Chaining in a Dynamic
5G Network

Segundo Moreno and Antonio M. Mora(B)

Departamento de Teorı́a de la Señal, Telemática y Comunicaciones, ETSIIT-CITIC,


Universidad de Granada, Granada, Spain
[email protected], [email protected]

Abstract. 5G Networks are strongly dependent on software-based man-


agement and processing. Services offered inside this environment are
composed of several Virtual Network Functions (VNFs) that must be
executed in a (normally) strict order. This is known as Service Function
Chaining (SFC) and, given that those VNFs could be placed in differ-
ent nodes along the network together with the expected low latency in
the processing of 5G services, makes SFC a tough optimization problem.
In a previous work, the authors presented an Ant Colony Optimiza-
tion (ACO) algorithm for the minimization of the routing cost of service
chain composition, but it was a preliminary approach able to solve sim-
ple and ‘static’ instances (i.e. network topology is invariable). Thus, in
this work we describe an evolution of our previous proposal, which con-
sider a dynamic model of the problem, closer to the real scenario. So,
in the instances nodes and links can be removed suddenly or, on the
contrary, they could arise. The ACO algorithm will be able to adapt to
these changes and still yield optimal solutions. The Adaptive Ant-SFC
method has been tested in three dynamic instances with different sizes,
obtaining very promising results.

1 Introduction

Current network technologies are focused on the incoming gigantic demand,


both, regarding the number of devices connected to them, and also to the exigent
low latency and high bandwidth requirements. 5G networks have been designed
to cope with these new features being able to flexibly serve the needs of users.
Thus 5G networks will be deployed onto novel enabler technologies provid-
ing the realization of a virtualized, programmable and flexible network. Two of
these technologies are Software Defined Networks (SDNs) and Network Func-
tion Virtualization (NFV). SDN aims to separate traffic forwarding and pro-
cessing, based on the automatising of some network management operations.
NFV is based on virtualization technology to run software-implemented net-
work functions. These two technologies are combined in the so-called Service
Function Chaining (SFC), which aims to dynamically establish new network
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 151–164, 2021.
https://doi.org/10.1007/978-3-030-85030-2_13
152 S. Moreno and A. M. Mora

services through a set of virtual functions, that must be run (normally) in an


specific order [1].
This work addresses the optimal composition of these chains of services. So,
given a network graph in which each node can run different virtual functions; and
considering an amount of resources per node, the resources a function requires,
the bandwidth on the links, and the desired composition order; the objective is
to construct the optimal and valid path corresponding to a network service, that
minimizes the routing cost (i.e. the number of hops between nodes).
This is an NP-hard problem, which the authors solved with a first approach
based in an Ant Colony Optimization algorithm (ACO) [2]. ACO algorithms
[3] are inspired in the behaviour of natural ants when they search for food to
solve combinatorial optimization problems, so they use a colony of artificial
ants, which are computational agents that communicate with each other using
a pheromone matrix. These agents work on problems formulated on a graph
with weights in its arcs. Every iteration, each ant will build a complete path
(solution) moving through it. Once the path is built (or during its construction),
the ant will deposit a trail of pheromone that, generally, will be related with the
goodness of the solution. Therefore, this trail will be a measure (informative for
others) of how desirable it is to follow the same route as the aforementioned ant.
Thus, we developed in [2] a variation of ACO baptised as Ant-SFC, inspired in
the simpler model, the Ant System. However, the scenarios where the algorithm
was tested, even if they were pretty similar to real instances in several aspects,
had a big flaw: an absence of dynamism. Given that networks are systems in
continuous change, i.e. nodes (and also links) arise or disappear constantly, this
behaviour must be incorporated into the instances to solve SFC.
To this end, the present work enhances both the problem definition and
modeling - making it closer to reality -, as well as the algorithm to address
it. Thus, dynamic instances have been considered, in which some events might
happen, so nodes or links could suddenly be activated or deactivated. Then, the
ACO approach has been transformed into a Dynamic Ant-SFC (DAnt-SFC),
able to adapt its behaviour to find optimal solutions in these changing scenarios.
New problem instances have been defined to test the proposed algorithm,
including all of them the aforementioned events; and a more complex scenario -
with 52 nodes - has been also considered. In the experiments, we have analysed
the adaptation capability of the algorithm and its performance in re-building
optimal solutions, as well as the study on the possibility of self-adaptation or
recovery that is offered to the network in those specific events.

2 The Problem: Dynamic Minimum Routing Cost


for SFC

This work is focused in the composition of a Service Function Chain (SFC) in a


network. SFC composition process is one of the major challenges in NFV because
both the path computation and the traffic steering are involved in this task.
Adaptive Ant Colony Optimization 153

Moreover, due to the SFC properties, the flow routes are defined as a chain-
ordered set of service functions (SF) that handles the traffic of the delivery,
control, and monitoring of a specific service/application [4].
In this environment and, in order to improve the performance and save
resources in the network, an optimal strategy is required. Thus, we have baptised
this problem as Optimization of Routing for SFC (OR-SFC). In this process, it
will be necessary to determine the path that the data should follow between the
adjacent virtual network functions for each of the services requested. Figure 1
shows an example instance of this problem.

Fig. 1. Example of instance for OR-SFC problem

In it, the user’s petition is named connection, and is defined by a tuple,


C=(origin, destination, value of the demand, [functions to execute]), in which
the connection has as origin node ‘1’ and as destination the node ‘6’, with a traffic
demand of 2 Mbps. If we look at the graph, the value close to the links refers to
the corresponding bandwidth available in each link. In every node, there have
been indicated the network functions that it can execute (cubes). Besides that,
each of the nodes will have associated some computing resources (CPU, Memory,
disk space) summarised into a single number for simplicity. In the same way, in
the problem definition there will be a list associating a cost in resources to every
function, as its requirements to be executed. The path must meet additional
restrictions: links must have enough bandwidth to cover the demand, and the
nodes must have enough resources available to run the functions.
In addition, a dynamic component is added to the problem. This dynamic
component will try to simulate a more realistic behavior of the network, by
giving it the chance of modification during problem resolution. It is related with
the possibility of real-time changes in the network topology, i.e. nodes or links
could be suddenly activated or deactivated without an expected reason (as it
could happen in real networks).

3 State of the Art

Service Function Chaining is one of the main challenges in NFV, that has to
be addressed as an NP-Hard optimization problem. Thus, it has attracted the
154 S. Moreno and A. M. Mora

attention of the academia, proposing different solutions to solve it, but mainly
focused on exact or heuristic solutions, and just a few of the proposals use
advanced computational intelligence methods, such as metaheuristics.
Regarding the exact approaches to solve the OR-SFC problem, most of them
are focused on optimization models based on Linear Programming techniques.
For example, the authors in [1] presented a model that solves the SFC routing
and the virtual function allocation for the peak hour intervals.
There are also heuristic algorithms for solving this kind of problems. These
heuristics are useful to find accurate feasible solutions for larger instances of the
problem within a reduced computation time. Greedy algorithms are widely used
for this purpose, as it is made in [5]. In general, heuristics are shown to produce
approximate solutions close to optimal and are appropriate when large instances
need to be solved. In these cases, heuristics provide an optimal tradeoff between
the validity of the solution and the computation cost [6].
Metaheuristics on the contrary have not been applied too much in this scope.
Even if OR-SFC problem fits well to the Ant Colony Optimization technique,
to our knowledge, only our previous work [2] has been focused on solving this
problem. Other approaches in the literature face instead the so-called Resource
Allocation in NFV, such as [7] where the authors apply Tabu Search; or [8] in
which the authors address the placement optimization of VNF in network nodes
considering the energy consumption in servers, applying a Genetic Algorithm.
Within this scope, some works try to solve this problem by applying an
approach from a dynamic point of view, where the network topology can change.
This is the case in [9] where different ant colonies are adopted simultaneously
to favor exploration in dynamic networks, largely avoiding stagnation. Another
case is given in [10], where it is used an ACO-based algorithm to solve dynamic
anycast routing and wavelength assignment in optical networks, offering blocking
probability reductions and improvements over other methods. However, again,
none of the proposals are focused on the resolution of SFC problem.

4 Dynamic Ant-SFC Algorithm


The algorithm implemented to address the dynamic version of the problem is an
‘evolution’ of our previous Ant-SFC [2], i.e. an adaptation of the classical AS [3]
to solve the standard SFC problem. Thus, it will cover the resolution of optimal
paths in a telecommunication network, complying with the following restrictions
(see Sect. 2):

– A path should be defined in the graph that models the network for each
connection resolution (service request). It must pass through available nodes
that can serve each one of the required network functions, in the given order
(indicated by each service).
– Each link must have enough capacity (available bandwidth) for being able to
satisfy the traffic demand of each connection. It will decrease every time that
a path passes through a link.
Adaptive Ant Colony Optimization 155

– Nodes, as same as links, must have enough resources available for the required
function execution. Node resources will decrease according to each function
requirement.
These restrictions must be accomplished even considering that nodes or
links could be suddenly activated or deactivated, as the problem instances are
dynamic. This algorithm has been named Dynamic Ant-SFC or just DAnt-SFC.
The implemented dynamism is intended to add an even more realistic com-
ponent to the simulated networks. It will be achieved through the removal or
introduction of nodes and/or links at certain moments of the execution, making
the algorithm capable of recover from events of potential collapse of main nodes.
In this way, the algorithm will receive some tuples inside an events file, as a
‘controlled’ and simplified modelling of that dynamism.
Each time one of these tuples is processed by the dynamism part, a part of
the algorithm will be activated automatically to remove or add the corresponding
node, thus ensuring that the next solutions to be calculated will have the network
updated with the modifications made. In addition, at the end of each process, a
check is made to see if there is a potentially better solution, taking into account
time and capacity of the tested nodes.
The resolution of each connection (service request) has been proposed as an
individual optimal path search, even if they are dependent on each other because
of the consumption of links bandwidth and nodes resources. The DAnt-SFC main
body is presented in the Algorithm 1.

Algorithm 1. DAnt-SFC ( )
Main DAnt-SFC algorithm
Parameters initialization()
Read network configuration()
Read connections()
Read dynamism configuration()
/* A solution for each connection is searched */
for each connection c do
while termination criteria is not met do
for each ant h do
s[h]=Build Solution(c,h)
end for
/* In all graph links */
Pheromone Evaporation()
/* Used links by the best ant */
s*=Choose Best Solution(s[h])
Global Pheromone Update(s*)
Update dynamism status() /* Check if best solution is still valid*/
end while
/* Links bandwidth and resources of nodes are updated */
Network Update(c,s*)
end for

The algorithm adaptation has been focused on the following aspects (see [2]):
– Heuristic: it has been considered to assign a greater probability of being
chosen to links with higher available bandwidth, trying minimize the risk of
156 S. Moreno and A. M. Mora

Algorithm 2. Build Solution (connection, ant id)


DAnt-SFC solution construction algorithm
Ant initialization(ant id)
Network initialization() /* Set current network values */
Dynamism application() /* Check nodes or links activated/deactivated */
current node = initial node.connec
current function = functions.connec[start]
L = save(current node) /* Visited states list */
F = save(current function) /* Served functions list */
while (current node = connec(final node)) AND (current function = function.connec[end]) do
/* A: feasible nodes list, P : probability of moving to each feasible node, Ω: problem restric-
tions */
P = calculate transition probabilities(current node, A, F , L, Ω)
next node = probability roulette(P , Ω)
/* Links bandwidth are updated */
Link Update(next node)
L = save(next node)
current node = next node
/* If the function is available it is served, and node resources are updated */
if current function in current.functions node[] then
Update Node(current function)
F = save(current function)
current function = following.connec(functions[])
end if
end while

link exhaustion. In addition, nodes serving the following function in the path
have double probability of being selected.
– Feasible nodes: the list A includes reachable nodes with enough available
resources to execute the following network function in the chain.
– Probability roulette: a probability roulette will be used as a decision policy
for the next state once assigned the probability of moving to each node from
the actual one.
– Link bandwidth restriction: it is based on the feasible nodes list construction.
– Link and Nodes updates (construction of a solution): the link bandwidth and
the node resources (in case the node serves a network function) are updated
each time an ant move towards a node while a solution is being constructed
for resolving a connection. Both values are updated with the connection traffic
demand and the resources network function cost, respectively, avoiding the
generation of infinite loops.
– Network updates (connections): the whole network is also updated considering
the solution path each time a solution for a certain chain is found.
– Complete path restriction: a solution will only be considered as valid if it starts
and ends in the exactly given nodes by the connection, and pass through all
the functions in the given order
– Path cost (connection): it will be the number of hops required in the graph
for the function chain composition necessary for complete a connection.
– Global solution cost: a complete solution will be made of some minimum cost
paths for solving each of the solicited connections. The global solution cost
will be the total number of hops of all the containing connections.
Adaptive Ant Colony Optimization 157

– Pheromone contribution: as well as the corresponding pheromone evapora-


tion conducted on all the links after the construction of all the solutions,
contribution will only be made on the links belonging to the best solution.

Once the Ant-SFC algorithm has been described, in the following section we
will evaluate and analyze its usefulness and value in two experiments.

5 Experiments and Results

Three different instances have been considered to test the proposed algorithm,
namely:
6 nodes instance: used as a conceptual approach, where the algorithm has been
evaluated and validated in a more intuitive way. Characteristics of the graph can
be seen in: https://doi.org/10.6084/m9.figshare.14572200.v1 including network
functions available in each node as well as bandwidth associated to each link.
This instance is considering 3 different connections to solve, specifically (see
format in Sect. 2): Connection 1: (A, F, 2, [3,5,6]), Connection 2: (A, E, 8, [1,2,4]),
and Connection 3: (A, D, 5, [2,4,5]).
19 nodes instance: it is a 19 nodes graph that models a more realistic case,
closer to those which will be solved in reality. The topology of this instance can
be consulted from https://doi.org/10.6084/m9.figshare.14572224.v1. The links
properties in this instance can be seen in https://doi.org/10.6084/m9.figshare.
14572080.v1. This instance is considering these 5 following connections to solve:
Connection 1: (H, J, 8, [5,1,2]), Connection 2: (B, D, 8, [4,3,1]), Connection 3:
(Q, B, 1, [2,3,1]), Connection 4: (R, J, 3, [5,2,3]), Connection 5: (J, S, 8, [4,1,3]).
52 nodes instance: it is a 52 nodes graph that model a close-to-reality case. It
is the biggest instance used in this work. Its topology is shown in https://doi.
org/10.6084/m9.figshare.14572230.v1, whereas the tables corresponding to the
links in this instance can be consulted in https://doi.org/10.6084/m9.figshare.
14483592.v2. 10 connections have to be solved: Connection 1: (AP, K, 16, [3,1,2]),
Connection 2: (P, O, 6, [2,3,1]), Connection 3: (R, AU, 11, [4,1,2]), Connection 4:
(AT, E, 5, [3,2,1]), Connection 5: (AF, AE, 5, [1,3,2]), Connection 6: (AA, AD,
14, [4,3,1]), Connection 7: (S, I, 11, [4,2,1]), Connection 8: (X, AD, 10, [3,2,1]),
Connection 9: (K, R, 7, [1,3,2]), Connection 10: (AG, AX, 18, [3,1,2]).

5.1 Obtained Results

For the algorithm execution, it has been used a personal computer with a 4 cores
and 8 threads at 2.40 GHz Intel Core i5-1135G7 processor, with 8 GB of DDR-4
RAM and 64 bits Windows 10 O.S.
Considered algorithm settings in each instance are shown in Table 1.
158 S. Moreno and A. M. Mora

Table 1. Considered parameters in the experiments.

Parameter 6N Instance 19N Instance 52N Instance


Iterations 6 19 52
Ants 12 38 104
α (pheromone weight) 1.2 1.2 1.2
β (heuristic weight) 2.0 2.0 2.0
ρ (evaporation factor) 0.3 0.3 0.3

Given values have been fixed based on ACO articles recommendations, as


pheromone and heuristic weights for the next-node election probability calcula-
tion, although these have been subsequently adjusted and modified according to
systematic experimentation. The iterations and connections numbers have been
fixed in order to obtain good solutions in an acceptable time, even if a further
research could be carried out just to optimally determine these initial values,
but is not this work’s objective.
Given that is a non-deterministic algorithm, to obtain reliable results, 10
independents executions have been carried out solving the same problem instance
(with the same number of connections) for each possible scenario, for each of the
three instances to be tested.
As described in Sect. 4, a dynamism function has been developed for this
algorithm. It allows to have the best possible solutions selected while certain
nodes in the network may go down or online during execution(simulating a real
telecommunications network scenario) without modifying the basic behaviour of
the executed code. This function is used to reinforce the already good perfor-
mance of the algorithm in a greater number of situations and different scenarios.
For this purpose, three scenarios with different characteristics have been devel-
oped to test their performance. They will have different numbers of nodes and
links, and the aim is to obtain the best possible cost in terms of number of hops,
i.e. the main objective remains unchanged.
These three cited tests will be based on the previous description and the
following characteristics:

– VERSION 1: A critical node is removed from the best solution so far : with
this type of variant, the algorithm is forced to recalculate the best route
obtained so far, given that a critical node will be dropped from it, thus not
being possible to use it and having to apply the utilities of the dynamism
function to select the best possible option as a so called “best alternative”.
– VERSION 2: Two critical nodes are removed from the best solution so far :
will be the same as in the above case, but with added difficulty (there will be
a larger part of the node matrix unavailable to be selected), making the new
route selection more challenging and “realistic”.
– VERSION 3: A critical node is removed from the best solution so far and then
restored : for this situation, a selected critical node shall be removed and, after
Adaptive Ant Colony Optimization 159

a certain number of iterations, restored, observing whether the algorithm re-


selects the previously saved best solution or not (if it has been forced to find
a better one).

In Table 2, the results obtained from the simulations can be seen, both in
terms of number of jumps of the best solution and execution time, as well as
time, mean value and standard deviation.

Table 2. DAnt-SFC algorithm results for the 6, 19 and 52 nodes instances. It is


specified: best execution, cost, time, mean value and standard deviation, obtained
from 10 executions in each dynamic version.

6 Nodes Instance
Test Best exec. COST TIME(s) Mean Std. Deviation
Version 1 7 13 0.073 13.8 0.707
Version 2 – – – – –
Version 3 2 12 0.070 12.2 1.41
19 Nodes Instance
Test Best exec. COST TIME(s) Mean Std. Deviation
Version 1 2 17 0.283 17.4 0.707
Version 2 8 16 0.258 16.2 0.707
Version 3 3 17 0.303 17.8 1.41
52 Nodes Instance
Test Best exec. COST TIME(s) Mean Std. Deviation
Version 1 5 35 4.906 35.5 1.414
Version 2 7 35 4.796 35.4 1.414
Version 3 2 37 4.953 38.6 2.121

It is worth mentioning that, for the simulation of the smallest instance (6


nodes), since it is used as a reference to replicate the operation of the others and
due to its reduced size, Version 2 of the test (where two fundamental nodes are
eliminated) cannot be performed as such, since the calculation of certain routes
would be impossible. However, its correct operation for this case can be verified
in the larger instances. For this reason, Version 1 and Version 3 have been used,
the latter being equally valid as a working example, since a node is deleted and
then reactivated, forcing the algorithm to recalculate the routes (as it does).
160 S. Moreno and A. M. Mora

Fig. 2. Best solution found for version 1 (one node removed (C) in iteration number
2) and version 3 (one node removed (C) in iteration number 2), then reactivated (C)
in iteration number 6) of the 6 nodes instance with 3 connections. Cost expressed in
number of hops beside each connection.

The numerical results presented in Table 2 can be complemented with visual


interpretation in the following figures. It is possible to clearly observe the dif-
ference between different versions executed. For the 6 nodes instance, both ver-
sion 1 and version 3 results are available at https://doi.org/10.6084/m9.figshare.
14854344.v1. In the first version (Fig. 2(bottom)), a fundamental node (C) will
go down in iteration number 2. When this node goes down, along with its links,
it cannot be used in the network for future computations. In this version, this
happens almost from the beginning (iteration number 2), so that the algorithm
could first use this node. However, after the corresponding application of the
dynamism function, this node will be deactivated and automatically, the best
solution so far processed, in case it contained it, will be eliminated and another
feasible one will be selected.
Nevertheless, it is observed that for version 3 (Fig. 2(bottom)), where this
same node goes down but in successive iterations it is reactivated again in iter-
ation number 6, the algorithm detects that this is the most efficient route and
that it can use this node again, also observing a lower cost to serve, for example,
in the first connection required. In this way, the most optimal possible solution
is always offered, considering the capabilities offered by the network.
As mentioned above, with the 19-node instance, the changes made on the net-
work can be graphically observed. Representative figures of the results obtained
in the three versions executed are available at https://doi.org/10.6084/m9.
figshare.14854350.v2. In the first version, shortly after starting the execution,
a node falls off the network (node N) in iteration number 4, which causes the
algorithm to recalculate the routes according to how the topology has been left.
After all the executions, the result is shown in Fig. 3 (up). This could be a prob-
lem a priori, since one of these nodes was used in the optimal paths of Version
1. However, the heuristic will eventually find another optimal path (in fact, in
this case even better than the previous one) to route the connection as required
and that all requirements are met.
Adaptive Ant Colony Optimization 161

Fig. 3. Best solution found for Version 1 (one node removed (N) in iteration number
4) and Version 2 (two nodes removed (M and N) in iterations number 4 and 10 respec-
tively) of the 19 nodes instance with 5 connections. Cost expressed in number of hops
beside each connection.

With respect to Version 2, both nodes N and M are no longer active in


the network (disconnected in iteration 4 and 10 respectively), which is why a
red mark appear in themselves and their links. Again, the algorithm is able to
recalculate new routes that do not include that part of the network without
major difficulty, as can be seen in Fig. 3 (bottom).
Finally, Fig. 4 shows what happens in Version 3 for this 19-node instance.
After dropping nodes N and M in successive iterations (iteration 3 and 9 respec-
tively), the calculated solutions now cannot include them in the obtained paths.
However, in iteration number 16 (out of the total of 19 existing ones), node M
becomes active again, along with its corresponding links. Thus, the algorithm
includes it again in its tables of active nodes and takes it into account for new
solutions; so much so that the final solution for connection 4 uses it in its results.

Fig. 4. Best solution found for Version 3 (two nodes removed (M and N) in iterations 4
and 10 respectively), then one reactivated (M) in iteration number 16) of the 19 nodes
instance with 5 connections. Cost expressed in number of hops beside each connection.
162 S. Moreno and A. M. Mora

In the 52-node instance, it has been decided to graphically display version 3 of


those executed (see Figs. 5 and 6). In this case, it is shown how node V and X are
deactivated (in instances 4 and 8, respectively), and the system should stop using
them for the calculation of solutions. However, node V becomes active again from
instance 13 onwards, so that it can be used again. In fact, in connection 9, it is
used for the best option in it. It is possible to observe the great complexity of
the network and how the behavior of the algorithm is flexible and resilient to
changes, adapting to them in every situation. All the results (from versions 1, 2
and 3 respectively), together with the graphical representations, are available at
https://doi.org/10.6084/m9.figshare.14854356.v2.

Fig. 5. Results of the 52 nodes instance with 10 connections, Version 1 and 2. Cost
expressed in number of hops beside each connection.

Fig. 6. Best solution found for Version 3 (two nodes removed (V and X) in iterations
4 and 8 respectively), then one reactivated(V) in iteration 13) of the 52 nodes instance
with 10 connections.
Adaptive Ant Colony Optimization 163

6 Conclusions and Future Work


This work presents an adaptation of an Ant Colony Optimization (ACO) algo-
rithm for solving the routing problem in Service Function Chaining (SFC) inside
a software defined network (SDN). The problem has been defined also consider-
ing dynamism in the network topology, since nodes and links can be activated
or deactivated at any moment.
The proposed algorithm has been applied in three different instances with
different sizes, and several predefined activation/deactivation events on some of
the existing nodes and links.
Given the obtained results, we can conclude that DAnt-SFC is able to build
optimal solutions, even coping with dramatic network topology changes, such
as critical nodes in the paths. Besides, the computing times, always much lower
than 1 s are acceptable for real-time instances.
Anyway, one of the advantages of ACO algorithms is their ability of getting
valid solutions from the first iteration, as well as their capability of auto-adapt
their behaviour to the changes in the problem definition, such as the topology
changes in these instances.
As future work, we will test better heuristic functions to guide the solution
building in the ACO algorithm. In addition, some hybrid approaches could be
implemented, such as Local Search methods. More sophisticated ACO models
would be also tested.

Acknowledgements. This work has been partially funded by projects RTI2018-


102002-A-I00 (Ministerio de Ciencia, Innovación y Universidades), TIN2017-85727-
C4-2-P (Ministerio de Economı́a y Competitividad), B-TIC-402-UGR18 (FEDER and
Junta de Andalucı́a), and project P18-RT-4830 (Junta de Andalucı́a).

References
1. Eramo, V., Miucci, E., Ammar, M., Lavacca, F.G.: An approach for service function
chain routing and virtual function network instance migration in network function
virtualization architectures. ACM Trans. Networking 25(4), 2008–2025 (2017)
2. Moreno, S., Mora, A.M., Padilla, P., Carmona-Murillo, J., Castillo, P.A.: Applying
ant colony optimization for service function chaining in a 5g network. In: Alsmirat,
M.A., Jararweh, Y. (eds.) Sixth International Conference on Internet of Things:
Systems, Management and Security, IOTSMS 2019, Granada, Spain, 22–25 Octo-
ber 2019, pp. 567–574. IEEE (2019)
3. Dorigo, M., Stützle, T.: The ant colony optimization metaheuristic: algorithms,
applications, and advances. In: Glover, F., Kochenberger , G.K. (ed.) Handbook
of Metaheuristics, pp. 251–285. Springer, Boston (2002). https://doi.org/10.1007/
0-306-48056-5 9
4. Medhat, A.M., Taleb, T., Elmangoush, A., Carella, G.A., Covaci, S., Magedanz,
T.: Service function chaining in next generation networks: state of the art and
research challenges. IEEE Commun. Mag. 55(2), 216–223 (2017)
5. Allybokus, Z., Perrot, N., Leguay, J., Maggi, L., Gourdin, E.: Virtual function
placement for service chaining with partial orders and anti-affinity rules. Networks
71(2), 97–106 (2018)
164 S. Moreno and A. M. Mora

6. Nguyen, T.M., Minoux, M., Fdida, S.: Optimizing resource utilization in NFV
dynamic systems: new exact and heuristic approaches. Comput. Netw. 148, 129–
141 (2019)
7. Gil-Herrera, J., Botero, J.F.: A scalable metaheuristic for service function chain
composition. In: 2017 IEEE 9th Latin-American Conference on Communications,
LATINCOM 2017. Volume 2017-Janua., pp. 1–6. Institute of Electrical and Elec-
tronics Engineers Inc. (2017)
8. Laaziz, L., Kara, N., Rabipour, R., Edstrom, C., Lemieux, Y.: FASTSCALE: a
fast and scalable evolutionary algorithm for the joint placement and chaining of
virtualized services. J. Network Comput. Appl. 148, 102429 (2019)
9. Sim, K.M., Sun, W.H.: Multiple ant-colony optimization for network routing. In:
First International Symposium on Cyber Worlds, 2002. Proceedings, pp. 277–281
(2002)
10. Bhaskaran, K., Triay, J., Vokkarane, V.M.: Dynamic anycast routing and wave-
length assignment in WDM networks using ant colony optimization (ACO). In:
2011 IEEE International Conference on Communications (ICC), pp. 1–6 (2011)
On the Use of Fuzzy Metrics for Robust
Model Estimation: A RANSAC-Based
Approach

Alberto Ortiz(B) , Esaú Ortiz, Juan José Miñana , and Óscar Valero

Department of Mathematics and Computer Science,


University of the Balearic Islands, and IDISBA (Institut d’Investigacio Sanitaria de
les Illes Balears), Palma de Mallorca, Spain
{alberto.ortiz,esau.ortiz,jj.minana,o.valero}@uib.es

Abstract. Application domains, such as robotics and computer vision


(actually, any sensor data processing field), often require from robust
model estimation techniques because of the imprecise nature of sensor
data. In this regard, this paper describes a robust model estimator which
is actually a modified version of RANSAC that takes inspiration from
the notion of fuzzy metric, as a suitable tool for measuring similarities
in the presence of the uncertainty inherent to noisy data. More precisely,
it makes use of a fuzzy metric within the main RANSAC loop to encode
as a similarity the compatibility of each sample to the current hypoth-
esis/model. Further, once a number of hypotheses have been explored
and the winning model has been selected, we make use of the same fuzzy
metric to obtain a refined version of the model. In this work, we consider
two fuzzy metrics that permit us to express the distance between the
sample and the model under consideration as a kind of degree of sim-
ilarity measured relative to a parameter. By way of illustration of the
performance of the approach, we report on the accuracy achieved by the
proposed estimator and other RANSAC variants for a benchmark com-
prising two kinds of perception problems typically encountered in vision
applications, and a large number of datasets with varying proportion of
outliers and different levels of noise. The proposed estimator is shown
able to outperform the classical counterparts considered.

Keywords: Robust model estimation · RANSAC · Fuzzy metric

1 Introduction
The Random Sample Consensus algorithm (RANSAC) [6] is a robust estimation
technique whose most distinctive feature is the use of random sampling and a
This work is partially supported by projects PGC2018-095709-B-C21 (MCIU/AEI/
FEDER, UE), EU-H2020 BUGWRIGHT2 (GA 871260) and ROBINS (GA 779776),
and PROCOE/4/2017 (Govern Balear, 50% P.O. FEDER 2014–2020 Illes Balears).
This publication reflects only the authors views and the European Union is not liable
for any use that may be made of the information contained therein.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 165–177, 2021.
https://doi.org/10.1007/978-3-030-85030-2_14
166 A. Ortiz et al.

voting scheme to find the optimal set of model parameters to fit/explain a given
dataset comprising both inliers and outliers. RANSAC is widely used nowadays,
so much that it has become of common use in robotics-related algorithms, since,
in this application domain, it is often necessary to solve model estimation prob-
lems whenever a perception task is addressed.
Nowadays, facing this kind of situation requires to cope with new challenges
due to an increased use of potentially poor, low-cost sensors, and the ever grow-
ing deployment of robotic devices which may operate in potentially unknown
environments. In general terms, the underlying algorithms need to be capable of
being robust against, in particular, strong uncertainty levels. In this regard, a
robust estimator is able to correctly find the original model that supposedly the
input data fits to, even when the data is noisy and contains outliers, i.e. data
items which are not consistent with the original model due to an arbitrary bias
affecting them. (See [8] for the details on the concepts, techniques and technical
issues surrounding robust estimation.)
Fuzzy methodologies have been shown to be useful to deal with imprecise
data, targeting on the design of systems that are able to cope with uncertainty
one way or another and even degrade gracefully if needed [9]. In this work, we
propose a variant of RANSAC which avoids discriminating between inliers and
outliers but makes use of a fuzzy metric, in the sense of I. Kramosil and J.
Michalek [11], to associate to every sample a degree of compatibility with regard
to the current model. The aforesaid fuzzy metric is besides used in a final model
refinement step that runs after the main hypothesis selection loop.
The rest of the paper is organized as follows: Sect. 2 overviews RANSAC;
Sect. 3 introduces two fuzzy metrics of relatively distinct nature though ori-
ented to be embedded within the main RANSAC loop, while Sect. 4 details the
RANSAC variation that incorporates these fuzzy metrics; Sect. 5 reports on a
number of experiments to illustrate the performance achieved; Sect. 6 concludes
the paper.

2 Brief Review of the RANSAC Approach for Robust


Model Estimation

Regarding model estimation, a common measure of estimation robustness is the


breakdown point (BDP), defined as a percentage threshold on the outlier rate
beyond which the technique under consideration is no longer robust to outliers.
RANSAC is one of those robust estimators with BDP higher than fifty percent.
Fifty percent is the limit of the Least Median of Squares (LMedS) [20], another
robust estimator that has also enjoyed high popularity as a high BDP tech-
nique. Least Trimmed Squares (LTS) and Minimum Probability of Randomness
(MINPRAN) are other high-BDP algorithms [17], although less popular than
RANSAC and LMedS. The BDP for others, such as the M-estimators family [8],
is below 50%. Applications in statistics typically require less than fifty percent
BDP, since outliers in this context are anomalies or exceptions in the data. How-
ever, the case is often different in robotics and computer vision applications,
On the Use of Fuzzy Metrics for Robust Model Estimation 167

where outliers are defined with respect to the best among competing models,
each describing well a fraction of the input data.
By randomly generating hypotheses on the model parameters, RANSAC tries
to achieve a maximum consensus in the input dataset in order to deduce the
inliers. Once the inliers are discriminated, they are used to estimate the parame-
ters of the underlying model by regression. In more detail, instead of using every
sample in the dataset to perform the estimation as in traditional regression tech-
niques, RANSAC tests in turn many random sets of samples. Since picking an
extra point decreases exponentially the probability of selecting an outlier-free
sample [5], RANSAC takes the Minimum Sample Set size (MSS) to determine a
unique candidate model, thus increasing its chances of finding an all-inlier sam-
ple set. This model is assigned a score based on the cardinality of its consensus
set. Finally, RANSAC returns the hypothesis that has achieved the highest con-
sensus, and the corresponding model is refined through a last minimization step
that only involves the inliers found.
Searching for an all-inlier sample, RANSAC typically runs for N iterations:

log (1 − ρ)
N= (1)
log (1 − (1 − ω)s )

where ρ is the desired probability of success, i.e. at least one of the considered
random sets is outlier-free, s is the size of the MSS for the problem at hand and
ω is the ratio of outliers. (See [6] for the details on Eq. (1).)
There have been a number of efforts aiming at enhancing the standard
RANSAC algorithm, e.g. MSAC, MLESAC, MAPSAC, PROSAC, R-RANSAC,
LO-RANSAC and U-RANSAC [4], since it, while robust, has its drawbacks
regarding accuracy, efficiency, stability and response time [17,19]. Among these
variants, there is a very reduced set adopting fuzzy methodologies [12,23]. In
both cases, the authors address a homography fitting problem, which, in [12],
is solved by discriminating data samples into the good, bad and vague fuzzy
sets using a fuzzy classifier, while [23] defines a triangle-type membership func-
tion for the set of inliers and combines this with a Monte Carlo method for
sample selection. It must be pointed out that the two aforementioned variants
of RANSAC differ significantly from the one described in this paper, which is
based on distance fuzzification.

3 Fuzzy Metrics for Robust Model Estimation

Two mathematical tools can be found in the related literature with regard to
the measurement of the degree of nearness between two points with respect to a
parameter. On the one hand, we have the so-called modular metrics [2]. In this
regard, we recall that a function w :]0, ∞[×X × X → [0, ∞] is a modular metric
on a non-empty set X if, for each x, y, z ∈ X and each θ, μ > 0, the following is
satisfied:
168 A. Ortiz et al.

(MM1) w(θ, x, y) = 0 for all θ ⇔ x = y;


(MM2) w(θ, x, y) = w(θ, y, x);
(MM3) w(θ + μ, x, z) ≤ w(θ, x, y) + w(μ, y, z).
This kind of generalized metrics has been typically used in modeling problems
that arise in classical Newtonian mechanics where the numerical value w(θ, x, y)
is interpreted as the velocity of a body traveling from location x to location y in
time θ. However, in general terms, w(θ, x, y) can be thought of as a dissimilarity
measurement between objects x and y relative to the value θ of a parameter.
Hence, the smaller the value, the closer the points x and y are, with respect to
θ. (We refer the reader to [3], and references therein, for a recent account of the
theory.) From now on, the value w(θ, x, y) will be denoted by wθ (x, y).
On the other hand, we have the notion of fuzzy metric. This type of metric
tool arises with the aim of extending to the fuzzy framework the notion of
statistical metric due to K. Menger. In the sequel, we assume that the reader
is familiar with the basics of fuzzy sets and t-norms. (An outstanding general
reference on these topics is [10].)
According to [11], a fuzzy metric space is a triplet (X, M, ∗) where X is a
non-empty set, ∗ is a continuous t-norm and M is a fuzzy set on X × X×]0, ∞[
satisfying, for each x, y, z ∈ X and θ, μ ∈]0, ∞[, the following:
(KM1) M (x, y, θ) = 1 for all θ if and only if x = y.
(KM2) M (x, y, θ) = M (y, x, θ).
(KM3) M (x, z, θ + μ) ≥ M (x, y, θ) ∗ M (y, z, μ).
(KM4) The assignment Mx,y :]0, ∞[→ [0, 1] is a left-continuous function, where
Mx,y (θ) = M (x, y, θ) for each θ ∈]0, ∞[.
The value M (x, y, θ) can be understood as a degree of similarity between two
points x, y ∈ X relative to the value θ ∈]0, ∞[ of a parameter. Thus, the larger
the value of M (x, y, θ), the closer the points x and y are with respect to θ.
At this point, it is worth noting that fuzzy metrics have been shown to be a
very appropriate similarity measure when working with data affected by vague-
ness or imprecision, like noisy data; e.g. see [1,7,14–16] for successful applications
to image filtering and to the study of perceptual colour differences. Despite the
applicability of fuzzy metrics, it must be pointed out the lack of examples in
the literature and the fact that this becomes a handicap in order to expand the
number of fields in which new applications can be generated.
At a glance, the exposed axiomatics of both notions of metrics are in essence
dual. Motivated by this fact and by the aforementioned lack of examples, the
intuitive duality relationship was formally proved with the aim, among others,
of introducing new methods for generating fuzzy metrics and, thus, overcome
the aforesaid handicap [13]. Specifically, the next result was proved.
Theorem 1. Let ∗ be a continuous t-norm with additive generator f∗ : [0, 1] →
[0, ∞]. If w is a modular metric on X, then the triplet (X, M w,f∗ , ∗) is a fuzzy
metric on X, where the fuzzy set M w,f∗ : X × X×]0, ∞[ is defined, for all x, y ∈
(−1)
X, by M w,f∗ (x, y, θ) = f∗ (w̃θ (x, y)), where w̃θ (x, y) = inf 0<λ<θ wλ (x, y) and
(−1)
f∗ is the pseudo-inverse of an additive generator f∗ of ∗.
On the Use of Fuzzy Metrics for Robust Model Estimation 169

Within the framework of the aforementioned metrics, we are now concerned


on obtaining a suitable metric tool for RANSAC; that is to say, a metric that
is suitable as a measurement in presence of noise and, in addition, it is able to
encode the compatibility of each sample to the current model/hypothesis. In this
regard, next we introduce, applying Theorem 1, two fuzzy metrics, induced from
modular metrics and the use of, on the one hand, the Luckasievicz t-norm and,
on the other hand, the Aczél-Alsina t-norms. To this end, we recall first a few
pertinent facts that will play a central role in our subsequent discussion.
On the one hand, the Luckasievicz t-norm ∗L and the family of Aczél-Alsina
t-noms (∗αAA )α∈]0,∞[ are given, for all x, y ∈ X, as follows [10]: x ∗L y = max{x +
1
−((− log(x)) +(− log(y)) ) α
α α
y − 1, 0} and x ∗α AA y = e . Moreover, additive generators
f∗L , f∗αAA : [0, 1] → [0, ∞] of ∗L and ∗α AA , respectively, are given for all x ∈
α
[0, 1] by f∗L (x) = 1 − x and f∗αAA (x) = (− log(x)) . Hence the pseudo-inverses
(−1) (−1) (−1)
f∗L , f∗α : [0, ∞] → [0, 1] are given for all x ∈ [0, ∞] by f∗L (x) = max{1 −
AA

x, 0} and f∗α (x) = e−(x ) .


(−1) α

AA
On the other hand, given a metric space (X, d), the function wd (θ, x, y) :
]0, ∞[×X × X → [0, ∞] is a modular metric, where wd (θ, x, y) is defined by
wd (θ, x, y) = d(x,y)
θ for all x, y ∈ X and for all θ ∈]0, ∞[ [3].
In view of the exposed facts, we construct two new fuzzy metrics,
d
(M1,n , X, ∗L ) and (M2,n d
, X, ∗nAA ), aiming at, among others, encoding the com-
patibility of each sample to the current hypothesis within the framework of a
RANSAC-based model estimator. Notice that n ∈ N and that N denotes the set
of positive integer numbers. To this end, given a metric space (X, d), consider
the modular metric wd (θ, x, y) on X and notice that w̃θd (x, y) = wθd (x, y) for all
x, y ∈ X and for all θ ∈]0, ∞[.
Next, we induce the fuzzy metric based on the t-norm ∗L . Define the fuzzy
d d (−1)
set M wθ ,f∗L on X × X×]0, ∞[ by M wθ ,f∗L (x, y, θ) = f∗L (wθd (x, y)) = max{1 −
d(x,y) d

θ , 0}. By Theorem 1, we deduce that (M


wθ ,f∗
, X, ∗L ) is a fuzzy metric. On
d d
account of [18, Theorems 4.15] and [21], (F∗P (M wθ ,f∗L , . . . , M wθ ,f∗L ), X, ∗L )
is a fuzzy metric, ∗P is the product t-norm and the function F∗P :
[0, 1]n → [0, 1] is defined by F∗P (a1 , . . . , an ) = a1 ∗P a2 ∗P . . . ∗P an (n ∈
N). It follows that (M d , X, ∗L ) is a fuzzy metric such that M d (x, y, θ) =
d d
F∗P (M wθ ,f∗L , . . . , M wθ ,f∗L ) (x, y, θ) = (1 − d(x,y)
θ ) if d(x, y) ≤ θ, and 0 other-
n

wise. Notice that the same arguments can be used to show that (M1,n d
, X, ∗L ) is a
d(x,y) n
fuzzy metric, where M1,n (x, y, θ) = (1 − nθ ) if d(x, y) ≤ nθ, and 0 otherwise.
d

(Note that d(x,y)


nθ is again a modular metric.)
Finally, we induce the fuzzy metric based on the t-norms ∗α AA . Define the
wθd ,f∗α wθd ,f∗α (−1)
fuzzy set M AA on X × X×]0, ∞[ by M AA (x, y, θ) = f∗α (wθd (x, y)) =
AA
d(x,y) α
e−( θ ) .
d
Theorem 1 guaranteess that (M wθ ,f∗ , X, ∗α
AA ) is a fuzzy metric. Now set
d
α = n. Then (M2,n , X, ∗nAA ) (n ∈ N) is a fuzzy metric with M2,nd
(x, y, θ) =
d(x,y) n
−( θ )
e .
170 A. Ortiz et al.

4 Fuzzy Metric-Based Model Scoring and Refinement


for RANSAC

As already described, RANSAC adopts a hypothesize-and-verify approach to


fit a model to data contaminated by random noise and outliers: i.e. for every
hypothesis/model considered, data samples are classified into inliers and outliers
by comparing the fitting error with a threshold τI related to data noise, and that
model accumulating the largest number of inliers is the one finally chosen as
solution of the estimation problem. This simple approach has been systematically
used for robust estimation of model parameters in the presence of arbitrary noise,
although, along the years, alternative implementations have been proposed to
counteract the misbehaviours and shortcomings that have been detected.
In this work, we focus on three facets of RANSAC: (1) samples classification
into inliers and outliers, in which we prevent the estimator from explicitly, and
prematurely, deciding which samples are relevant; (2) model scoring, for which we
replace the pure cardinality of the inlier set of plain RANSAC by an expression
involving the individual fitting errors, similarly to what MSAC and MLESAC
do [22]; and (3) model refinement once the main hypothesis-checking loop has
finished, for which we adopt an iterative re-weighting scheme that makes use of
all the available data samples without any distinction between inliers and out-
liers, contrarily to plain RANSAC, and other variants, that adopt least squares
regression only for the set of inliers (notice that the distinction between inliers
and outliers depends on the current model under consideration, and thus changes
with every model).
Algorithm 1 describes formally the RANSAC variant that is proposed in this
work. The details regarding points (1)–(3) above can be found next:

1. Samples classification. As already mentioned, no distinction is made


between inliers and outliers, but we make use of a fixed fuzzy metric M w,f∗
generated by the technique in Theorem 1 to obtain a compatibility value
φ ∈ [0, 1] between each sample xj and the current model MΘ  k , given the
fitting error (xj ; MΘ k ). Observe that the compatibility value obtained from
the fuzzy metric depends on the set of parameters (d, Φ) with Φ = (n, θ) when
d d
either M1,n or M2,n are under consideration. From now on, such a value will
be denoted by φi (; Φ) with the aim of making clear that such a value refers
d
to the fitting error  and that such a value comes from the fuzzy metric Mi,n
(i ∈ {1, 2}). Since we contemplate the use of a single, specific distance d, i.e.
the one related to the fitting error, we will denote both fuzzy metrics as Mi,n
(i ∈ {1, 2}) eliminating the allusion to metric d.
2. Model scoring. The individual compatibility values φi (; Φ) are aggregated
by simple summation to obtain the model score (step 6 in Algorithm 1) and
hence the so-far-the-best-model is given by the maximum score found up to
the current iteration (steps 7–9 of Algorithm 1).
3. Model refinement. Once a sufficient number of hypotheses/models have
been considered, we re-estimate the winning model using iterative weighted
least squares, where the compatibility values φi (; Φ), calculated for the fitting
On the Use of Fuzzy Metrics for Robust Model Estimation 171

Algorithm 1. FM-based RANSAC


Input: D - dataset comprising samples {xj }
φi ( ; Φ) - FM-based compatib. function for fitting error  and parameters Φ
kmax - maximum number of iterations of the main loop, as given by Eq. (1)
tmax - maximum number of iterations of the refinement stage
Output: MΘ  - estimated model, whose parameters are compactly represented by Θ


1: k := 0, ϕmax := −∞
2: for k := 1 to kmax do  find maximum consensus model MΘ

3: select randomly a minimal sample set Sk of size s from D
4: estimate model MΘ  k from Sk
5: calculate fitting errors (x
 j ; MΘ
 k ), ∀xj ∈ D
6: find model score ϕk := xj ∈D φi ( (xj ; MΘ k) ; Φ )
7: if ϕk > ϕmax then
8: ϕmax := ϕk , M0Θ := MΘ k
9: end if
10: end for
11: t := 0
12: repeat  refine model MΘ

13: calculate fitting errors (xj ; MtΘ
 ), ∀xj ∈ D
14: estimate model Mt+1 
Θ
using weights φi ((xj ; MtΘ); Φ)
15: t := t + 1
16: until convergence or t ≥ tmax
17: return MtΘ 

errors resulting from the current model, are used as weights for the new,
refined model (steps 12–16 of Algorithm 1). The loop iterates until changes in
 are negligible (or after a maximum
the estimated parameters of the model Θ
number of iterations).

5 Experimental Results

In this section, we illustrate the performance of the RANSAC variant proposed


in Sect. 4, using either M1,n or M2,n , for a number of experiments that:

– Consider two model fitting problems, namely straight line fitting and ellipse
fitting. The former is for 2D lines described by parameters Θ = (a, b, c),
corresponding to a straight line in general form ax + by + c = 0. The latter
case is for ellipses expressed as ax2 + by 2 + 2cxy + 2dx + 2ey + f = 0 and hence
Θ = (a, b, c, d, e, f ). The respective dimensionalities are clearly different.
– Compare with plain RANSAC and MSAC (their computational requirements
are similar to ours).
172 A. Ortiz et al.

5.1 Experimental Setup

For testing purposes, we generate 500 synthetic datasets for the straight line
estimation problem and 200 synthetic datasets for the ellipse estimation prob-
lem. Each dataset contains a total of 300 points which comprise both inliers
and outliers, the latter in a proportion equal to ω. The respective samples stem
from either 2D lines in random orientations and positions or ellipses with ran-
dom axes lengths and orientations. Given a random point p = (x, y) over the
respective curve and the normal vector n at p, an inlier pI of the dataset is
generated by shifting p along n using a zero-mean Gaussian distribution with
standard deviation σ, i.e. pI = p + N (0, σ) · n. In both cases, outliers pO are
uniformly generated within a rectangular area containing the ellipse or a part of
the straight line, ensuring that outliers lie out of a ±3σ stripe along the curve.
Every combination (σ, ω) gives rise to a different dataset.
Regarding hypothesis generation within the main loop, in all experiments, the
size of the MSS is always set to the minimum, i.e. s = 2 for straight lines and s =
5 for ellipses (Θ is normalized to unit norm). Besides, the number of iterations
kmax is calculated according to Eq. (1), with ρ = 99%. The parameters of φi (; Φ),
Φ = (θ, n), are set as follows: θ = κ · σ, as well as τI for RANSAC/MSAC,
considering different values for κ; n = 1 or 2, as indicated for each experiment.
Finally, to compare properly RANSAC, MSAC and our estimator, we make use
of the same sequence of MSS’s to avoid the effect of randomness.

5.2 Results and Discussion

In the following, to measure the estimation accuracy:

– For the straight line fitting problem, we make use of the average μ[ε] of the
angle ε between the true and the estimated normal vector for straight lines.
– For the ellipse fitting problem, we make use of the average μ[ε] of the max-
imum relative error ε between the true p∗ and the estimated p vector of
coefficients (a, b, c, d, e, f ), calculated as:
 
pi − p∗i |
|
ε= max × 100 .
pi ∈{a,b,c,d,e,f } p∗i

– For both cases, we also report on the average number of iterations spent
during model refinement μ[t].

Table 1 shows performance results for the straight lines case, for the two
fuzzy metrics M1,n and M2,n and several outlier ratios ω and Gaussian noise
magnitudes σ. In sight of these results, it is worth noting that: (1) the estimation
accuracy for M1,n is above that of plain RANSAC and MSAC in all cases, while
for M2,n the accuracy is in general better than the classical counterparts except
for n = 1 and ω = 0.5 and 0.6, although the difference with MSAC is very small;
(2) M1,n behaves in general better than M2,n ; (3) the value of θ in Mi,n does
not seem to be critical, since very similar errors result for κ = 1 – 3, maybe
On the Use of Fuzzy Metrics for Robust Model Estimation 173

Table 1. Straight line fitting case: estimation accuracy and number of iterations of
the refinement stage for (a) different outlier ratios ω, (b) different noise magnitudes σ
and (c) different settings for τI , θ = κ · σ. When they do not change, σ = 1, ω = 0.4
and κ = 3. Lighter background means higher performance.

μ[ε] (◦ ) μ[t]
ω RANSAC MSAC ours ours ours ours ω ours ours ours ours
M 1, 1 M 1, 2 M 2, 1 M 2, 2 M 1, 1 M 1, 2 M 2, 1 M 2, 2

(a) 0.60 4.43 3.14 1.55 2.38 4.14 2.68 0.60 11.62 12.16 12.16 11.33
0.50 3.03 2.33 1.02 1.52 2.35 1.68 0.50 9.63 9.48 9.14 8.56
0.40 2.13 1.81 0.86 1.13 1.60 1.21 0.40 8.55 8.09 7.71 7.09
0.20 1.58 1.53 0.67 0.74 0.88 0.76 0.20 7.64 6.91 6.49 5.79

σ RANSAC MSAC ours ours ours ours σ ours ours ours ours
M 1, 1 M 1, 2 M 2, 1 M 2, 2 M 1, 1 M 1, 2 M 2, 1 M 2, 2
2.00 9.82 6.92 4.03 5.15 4.50 5.38 2.00 12.81 11.89 9.38 10.73
(b) 1.00 2.13 1.81 0.86 1.13 1.60 1.21 1.00 8.55 8.09 7.71 7.09
0.50 0.74 0.71 0.31 0.34 0.50 0.34 0.50 7.24 6.64 6.57 5.53
0.25 0.37 0.36 0.22 0.14 0.16 0.14 0.25 6.78 6.05 5.83 4.87

κ RANSAC MSAC ours ours ours ours κ ours ours ours ours
M 1, 1 M 1, 2 M 2, 1 M 2, 2 M 1, 1 M 1, 2 M 2, 1 M 2, 2
4.00 2.85 2.09 1.01 1.49 1.79 1.64 4.00 7.75 7.65 7.12 6.87
3.00 2.13 1.81 0.86 1.13 1.60 1.21 3.00 8.55 8.09 7.71 7.09
(c) 2.50 2.03 1.88 0.82 0.99 1.46 1.03 2.50 9.56 8.59 8.13 7.45
2.00 2.18 2.18 0.85 0.88 1.29 0.91 2.00 12.21 9.52 8.71 8.18
1.00 3.60 3.58 1.91 1.17 0.94 1.06 1.00 23.09 18.35 12.20 15.90
0.50 4.51 4.62 4.06 2.69 1.06 2.40 0.50 22.15 24.12 21.23 23.59

more variation in performance is observed for M2,1 ; (4) the estimation accuracy
does not differ significantly for M1,1 and M1,2 , while, for M2,n , M2,2 seems to be
better; (5) as for the number of iterations of the refinement stage t, in general,
μ[t] is very similar for n = 1 and n = 2 and for both fuzzy metrics, (6) it grows
with the amount of noise in the data, as expected, and (7) higher values of κ
reduce t, indicating that outliers are nullified within the main loop and therefore
less iterations of refinement are required.
Table 2 reports on the accuracy which has resulted for the ellipse fitting case.
On this occasion: (1) again the behaviour for M1,n is better than that of plain
RANSAC and MSAC in general, with higher accuracy for M1,1 , except for some
very particular cases, i.e. ω = 0.60 or σ = 2; (2) M2,n clearly behaves better
for n = 2, also outperforming RANSAC and MSAC; (3) as for the number of
refinement iterations, it is above what is necessary for straight lines, as expected
because of the higher number of model parameters to estimate; (4) it seems the
dependency of μ[t] on a correct selection of κ is also higher for this estimation
problem.
174 A. Ortiz et al.

Table 2. Ellipse fitting case: estimation accuracy and number of iterations of the
refinement stage for (a) different outlier ratios ω, (b) different noise magnitudes σ and
(c) different settings for τI , θ = κ · σ. When they do not change, σ = 1, ω = 0.4 and
κ = 1.3 † . Lighter background means higher performance.

μ[ε] (%) μ[t]


ω RANSAC MSAC ours ours ours ours ω ours ours ours ours
M 1, 1 M 1, 2 M 2, 1 M 2, 2 M 1, 1 M 1, 2 M 2, 1 M 2, 2

(a) 0.60 4.34 0.84 0.58 0.76 1.16 1.28 0.60 14.77 18.91 25.84 20.28
0.50 1.58 1.49 0.23 0.31 1.14 0.34 0.50 11.42 12.73 25.10 12.45
0.40 0.53 0.44 0.28 0.24 0.88 0.28 0.40 10.95 10.43 19.56 9.45
0.20 0.50 0.70 0.22 0.23 0.54 0.25 0.20 8.69 7.54 7.96 6.53

σ RANSAC MSAC ours ours ours ours σ ours ours ours ours
M 1, 1 M 1, 2 M 2, 1 M 2, 2 M 1, 1 M 1, 2 M 2, 1 M 2, 2
2.00 3.41 1.77 0.71 1.65 1.18 1.59 2.00 16.02 25.00 17.59 25.12
(b) 1.00 0.53 0.44 0.28 0.24 0.88 0.28 1.00 10.95 10.43 19.56 9.45
0.50 0.55 0.34 0.20 0.20 0.20 0.16 0.50 9.04 7.98 7.60 7.07
0.25 0.36 0.22 0.15 0.10 0.06 0.10 0.25 7.81 6.57 5.72 5.66

κ RANSAC MSAC ours ours ours ours κ ours ours ours ours
M 1, 1 M 1, 2 M 2, 1 M 2, 2 M 1, 1 M 1, 2 M 2, 1 M 2, 2
3.00 2.25 1.16 0.49 1.15 1.09 1.14 3.00 12.15 21.18 13.77 20.22
2.00 0.88 0.56 0.24 0.55 1.06 0.71 2.00 9.74 12.68 20.34 13.42
(c) 1.50 0.72 0.45 0.30 0.29 0.98 0.34 1.50 10.16 10.64 21.18 9.91
1.00 0.50 0.52 0.38 0.26 0.58 0.25 1.00 13.36 11.35 14.95 9.96
0.50 0.95 0.87 0.50 0.41 0.29 0.39 0.50 22.81 17.80 13.80 16.14
0.25 1.07 1.15 0.96 0.66 0.45 0.61 0.25 21.72 23.78 19.68 23.39

Experimentally, we have determined that, because of the way the ellipse datasets
are generated, 99.7% of the samples are within ±1.3σ instead of ±3σ.

Figures 1 and 2 report on the best- and the worst-case estimations among
the full collection of datasets, for our approach and the two estimation problems
considered with regard to MSAC; that is to say, the best case is the case for
which FM-based RANSAC outperforms MSAC the most, and the worst case is
the case in which MSAC outperforms FM-based RANSAC the most. Besides,
we report on several percentiles of the respective ε for all three methods. In both
figures, the colour code of the left plots is as follows: the true/estimated model
is indicated as gray/black lines; regarding MSAC, inliers/outliers are indicated
as blue/red dots; as for FM-based RANSAC, φi ((xj ; MΘ  ) ; Φ ) is coded in gray
scale.
As can be observed, for the straight-lines estimation case, data samples are
correctly scored by our approach, and the estimated and true models are almost
identical even for the worst case, i.e. for the worst estimation, the error is not
significant. Regarding ellipse estimation and the best case, we can see that the
FM-based RANSAC scores correctly the inliers and hence manages to find the
ellipse, while MSAC cannot identify it correctly. As for the worst case, all three
variants fail to locate correctly the ellipse, though they all produce estimates
of the same quality. The percentile plots included in Fig. 1 and 2 for both esti-
mation problems provide more insight on the global performance of all three
methods, showing that M1,n outperforms M2,n in general and that the FM-
based RANSAC leads to significantly lower estimation errors than MSAC.
On the Use of Fuzzy Metrics for Robust Model Estimation 175

MSAC FM-based RANSAC Probability (%)


50 60 70 80 95
φi = 1 5.0
MSAC
M1,2
4.5
M2,2

4.0

Estimation error (◦ )
3.5
φi = 0
◦ ◦ ◦
ε = 25.62 M1,2 , ε = 1.34 M2,2 , ε = 1.46 3.0
φi = 1
2.5

2.0

1.5

φi = 0 1.0
ε = 0.67◦ M1,2 , ε = 2.99◦ M2,2 , ε = 3.49◦

Fig. 1. Straight line fitting case: (top) best and (bottom) worst estimations found in
500 datasets for FM-based RANSAC in comparison with MSAC; (right) percentiles of
ε. The true models MΘ∗ are (top) 0.15x − 0.99y + 0.00 = 0 and (bottom) 0.91x −
0.41y + 0.00 = 0. (σ, ω) = (1, 0.4) and κ = 3 in all cases. (Color figure online)

MSAC FM-based RANSAC Probability (%)


50 60 70 80 95
φi = 1 180
MSAC
M1,2
160
M2,2

140

Estimation error (%)


120
φi = 0
ε = 46.42% M1,2 , ε = 0.01% M2,2 , ε = 0.01% 100
φi = 1
80

60

40

20
φi = 0
ε = 54.06% M1,2 , ε = 88.89% M2,2 , ε = 89.61%

Fig. 2. Ellipse fitting case: (top) best and (bottom) worst estimations found in 200
datasets for FM-based RANSAC in comparison with MSAC; (right) percentiles of ε.
The true models MΘ∗ are (top) 0.062x2 + 0.02y 2 − 1 = 0 and (bottom) 0.045x2 −
0.034xy + 0.045y 2 − 1 = 0. (σ, ω) = (1, 0.4) and κ = 1.3 in all cases. (Color figure
online)
176 A. Ortiz et al.

6 Conclusions
This work introduces two new fuzzy metrics (FM) which have been succesfully
embedded within a revised version of RANSAC, proving thus useful for robust
model estimation. Further, this revised version of RANSAC includes an iter-
ated re-weighting least-squares stage for model refinement making use of the
same FM. By means of any of the two FMs considered, we avoid discriminating
between inliers and outliers, but make use of a compatibility value with regard
to the current model/hypothesis, provided by the FM itself for each data sam-
ple. These compatibility values are aggregated to score the model against other
models generated inside the main RANSAC loop. Experimental results show
good performance for the two FMs while being part of the FM-based RANSAC,
actually outperforming RANSAC.

References
1. Camarena, J., Gregori, V., Morillas, S., Sapena, A.: Fast detection and removal
of impulsive noise using peer groups and fuzzy metrics. J. Vis. Commun. Image
Represent. 19(1), 20–29 (2008)
2. Chistyakov, V.: Modular metrics spaces, I: basic concepts. Nonlinear Anal. 72,
1–14 (2010)
3. Chistyakov, V.V.: Metric Modular Spaces: Theory and applications. Springer,
Cham (2015). https://doi.org/10.1007/978-3-319-25283-4
4. Choi, S., Kim, T., Yu, W.: Performance evaluation of RANSAC family. In: Pro-
ceedings British Machine Vision Conference, pp. 42.1–42.11 (2009)
5. Chum, O., Matas, J., Kittler, J.: Locally optimized RANSAC. In: Michaelis, B.,
Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 236–243. Springer, Heidelberg
(2003). https://doi.org/10.1007/978-3-540-45243-0 31
6. Fischler, M.A., Bolles, R.C.: Random sample consensus. Commun. ACM 24(6),
381–395 (1981)
7. Gregori, V., Miñana, J., Morillas, S.: Some questions in fuzzy metric spaces. Fuzzy
Sets Syst. 204, 71–85 (2012)
8. Huber, P.J., Ronchetti, E.M.: Robust Statistics. Wiley, New York (2011)
9. Kacprzyk, J., Pedrycz, W.: Handbook of Computational Intelligence. Springer,
Heidelberg (2015). https://doi.org/10.1007/978-3-662-43505-2
10. Klement, E., Mesiar, R., Pap, E.: Triangular norms. Springer, Dordrecht (2000).
https://doi.org/10.1007/978-94-015-9540-7
11. Kramosil, I., Michalek, J.: Fuzzy metrics and statistical metric spaces. Kybernetika
11(5), 334–336 (1975)
12. Lee, J., Kim, G.: Robust estimation of camera homography using Fuzzy RANSAC.
In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007. LNCS, vol. 4705, pp. 992–1002.
Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74472-6 81
13. Miñana, J., Valero, O.: On indistinguishability operators, fuzzy metrics and mod-
ular metrics. Axioms 6(4), 34 (2017)
14. Morillas, S., Gregori, V., Peris-Fajarnés, G.: New adaptative vector filter using
fuzzy metrics. J. Electronic Imaging 16(3), 033007:1–15 (2007)
15. Morillas, S., Gregori, V., Peris-Fajarnés, G., Latorre, P.: Isolating impulsive noise
color images by peer group techniques. Comput. Vis. Image Underst. 110(1), 102–
116 (2008)
On the Use of Fuzzy Metrics for Robust Model Estimation 177

16. Morillas, S., Gregori, V., Peris-Fajarnés, G., Sapena, A.: Local self-adaptative fuzzy
filter for impulsive noise removal in color image. Signal Process. 8(2), 390–398
(2008)
17. Oluknami, P.O.: On the Sample Consensus Robust Estimate Paradigm: Compre-
hensive Survey and Novel Algorithms with Applications (2016), M.Sc thesis, Uni-
versity of KwaZulu-Natal, Durban (South Africa)
18. Pedraza, T., Rodrı́guez-López, J., Valero, O.: On some results in fuzzy metric
spaces. Inf. Sci. (2021)
19. Raguram, R., Chum, O., Pollefeys, M., Matas, J., Frahm, J.M.: USAC: a universal
framework for random sample consensus. IEEE Trans. Pattern Anal. Mach. Intell.
35(8), 2022–2038 (2013)
20. Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79(388),
871–880 (1984)
21. Sherwood, H.: Characterizing dominates on a family of triangular norms. Aequa-
tiones Math. 27, 255–273 (1984)
22. Torr, P., Zisserman, A.: MLESAC: a new robust estimator with application to
estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000)
23. Watanabe, T., Kamai, T., Ishimaru, T.: Robust estimation of camera homography
by Fuzzy RANSAC algorithm with reinforcement learning. J. Adv. Comput. Intell.
Intell. Inform. 19(6), 833–842 (2015)
A New Detector Based on Alpha Integration
Decision Fusion

Addisson Salazar(B) , Gonzalo Safont, Nancy Vargas, and Luis Vergara

Institute of Telecommunications and Multimedia Applications, Universitat Politècnica de


València, Camino de Vera s/n, 46022 Valencia, Spain
[email protected]

Abstract. This paper presents a new detector method based on alpha integration
decision fusion. The detector incorporates a regularization element in the cost
function. This element is considered a measure of the smoothness of the signal
in graph signal processing. We theorize that minimizing this term will reduce
the dispersion of the statistics of the fusion, and thus improving the separation
between the two hypotheses of the detection. To highlight the performance of
alpha integration methods and regularization classification, two experiments are
presented. The first one consists of simulated data, and the proposed method is
compared with alpha integration without regularization. The second one consists
of detection of ultrasound pulses buried into highly background noisy. In this latter
experiment, three single classifiers were implemented: support vector machine;
quadratic linear discriminant; and random forest. The results obtained from those
classifiers were fused by using the mean; standard alpha integration and alpha
integration with regularization. In all experiments, the advantages of the proposed
method were demonstrated.

Keywords: Decision fusion · Machine learning · Alpha integration · Graph


signal processing · Regularization · Classification · Ultrasounds

1 Introduction
Recent advances in data acquisition and data processing methods have opened several
research fields that pursue to integrate efficiently all the available data together. Thus,
decision fusion is a booming broad area of research that has been increasingly studied. It
has been named in different ways, for instance, sensor data fusion, decision fusion, mul-
timodal fusion, heterogeneous sensor fusion, mixture of experts, classifier combiners,
etc. [1, 2].
Fusion methods can be roughly organized into three classes depending on the consid-
ered assumptions: early fusion (i.e., feature-based), late fusion (i.e., decision-based) and
hybrid fusion [3, 4]. Early fusion integrates the features extracted from several sources
or modes (e.g., by concatenation of their values). On the other hand, late fusion com-
bines the results obtained by multiple single classifiers. This combination can be made
by using the full set of scores or posterior probabilities provided by the single classifiers

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 178–189, 2021.
https://doi.org/10.1007/978-3-030-85030-2_15
A New Detector Based on Alpha Integration Decision Fusion 179

or the decisions made by those classifiers. In the first case, it is called soft fusion, while
in the second case, it is called hard fusion. There are several advantages in the fusion
of multiple classifiers, such as improving classification performance, increasing confi-
dence, and enhancing reliability [5]. Finally, hybrid fusion considers both early and late
fusion.
There are many methods that have been proposed to perform late fusion [1–6]. In this
paper, we will focus on alpha integration, a recent technique that has been successfully
implemented in several applications. The parameters of alpha integration can be learned
by optimizing the least mean squared error (LMSE) or the minimum probability of
error (MPE) criterion [6–9]. Amari first proposed it for integrating multiple stochastic
models by minimizing their alpha divergence [10, 11] and lately it has been developed to
perform optimal integration of scores in binary classification (detection) problems [6].
Essentially, alpha integration is a family of integrators that encompasses many existing
combinations as special cases of the alpha parameter. For instance, setting α = −1 would
result in the average of the integrated measurements; α = 1 would result in the product
of the integrated measurements; and very high (low) values of α would result in the
minimum (maximum) rule [12, 13].
In this paper, we will propose a soft detector fusion that extends the alpha integration
method incorporating a regularization element in the cost function to be optimized. This
element links to the “smoothness” concept employed in graph theory [14–16]. It has
been studied, for instance, in constraints for semi-supervised learning [17, 18] and signal
processing on graphs [19–23].
The rest of the paper is organized as follows. Section 2 contains a review of the alpha
integration procedure. Section 3 develops the graph regularization method proposed for
alpha integration. Section 4 comprises the results of the experiments using synthetic and
ultrasound pulse data. Finally, the conclusions of the paper are included in Sect. 5.

2 Alpha Integration
The proposed method is based on optimal fusion by alpha integration, a recent tech-
nique that has been successfully implemented in several applications. The objective is to
improve the classification results by fusion of the scores (posterior probabilities) from
multiple single classifiers (or detectors in case of two-class classification), i.e., optimally
integrating those scores into a unique score. The parameters of alpha integration were
optimized to satisfy the least mean squared error (LMSE) criterion.
j j j
We will assume the observations are denoted by xj = [x1 , x2 . . . xM ]T , with M being
the number of features and j = 1…N denoting the jth observation. The classification
cases were two-class, therefore the labels y j can be either 0 or 1. The gradient descent
algorithm considered to train alpha integration [6, 7, 11] is shown in the following
procedure (Algorithm 1). Note that NTRAIN is the number of samples available for
training.
180 A. Salazar et al.

Algorithm 1. Alpha integration procedure.

Step Description
0 Given a set of scores from D classifiers for NTRAIN data
samples s j [s1j ...sDj ]T , i = 1…D, j = 1…NTRAIN, and their

associated labels yj ; the learning rates ; and the

starting values for the weights w and the alpha value


1 Apply alpha integration on the input scores,

2 Determine the value of the LMSE cost function

3 Determine the value of the derivatives of the LMSE cost


function

4 Update the values of the weights,

i = 1…D, and alpha,


5 Until convergence, repeat from step 1
6 Return the final values of the weights w and

3 Graph Regularization for Alpha Integration


We have D different detectors working on the same hypotheses H1 and H0 , everyone
contributes with a statistic si . The individual statistics are linearly combined to obtain a
A New Detector Based on Alpha Integration Decision Fusion 181

fused statistic x.
K
x= sk · wk = sT w (1)
k=1

where s = [s1 . . . sK ]T and w = [w1 . . . wK ]T and T means transposition. This


statistic will have conditioned probability densities p(x/H1 ) and p(x/H0 ), that can be
used to implement a likelihood ratio test (LRT) to yield afinal decision.

Let us assume that we have a set of labelled samples s(n) , y(n) n = 1 . . . N where
 
(n) (n) T
s(n) = s1 . . . sK is the vector formed by the statistics provided by the detectors, and
yn is the corresponding known binary decision (y(n) = 1 if H1 is true and y(n) = 0 if H0
is true). Given some coefficient vector, the fused statistics corresponding to the labelled
 T
samples will be x(n) = s(n)T w n = 1 . . . N . Let us define the vectors x = x(1) . . . x(N )
 T
and y = y(1) . . . y(N ) . The optimum coefficiennts w = [w1 . . . wK ]T will be obtained
by minimizing a cost function

opt = min y − x + β x Lx
wlin 2 T
(2)
w

The first term of (2) is proportional to the mean-square error (MSE) between the
fused statistic x(n) and the true label y(n) . Notice that, ultimately, the performance of
the final detector will depend on p(x/H1 ) and p(x/H0 ). Thus, by minimizing the MSE,
p(x/H1 ) is shifted to 1, while p(x/H0 ) shifts to 0. The second term is a regularization of
the MSE derived from a graph model having Laplacian matrix L as detailed below. As
we will see, minimizing this term will reduce the dispersion of the fused statistics, thus
improving the separation between p(x/H1 ) and p(x/H0 ). The real and positive constant
β defines a trade-off between both terms.
Let us consider the weighted graph, G{V , E, A} where V represents the set of N
vertices of the graph, E represents the set of edges connecting the vertices and A is
the adjacency matrix. The element anm of A is the weight corresponding to the edge
impinging from vertex k to vertex j. We assign xn to vertex n of = 1 . . . N , thus forming
a signal on graph. On the other hand, let us connect with a weight 1, those vertices of G
corresponding to the same true hypothesis, while keeping disconnected those vertices
corresponding to different true hypothesis, i.e.,

1 if y(n) = y(m)
anm = amn = (3)
0 if y(n) = y(m)
is defined as L = D − A, where D is a diagonal matrix having
The Laplacian matrix
diagonal elements dnn = N m=1 anm . It is straightforward to demonstrate that
N N
2
xT Lx = anm x(n) − x(m) (4)
n=1 m=1

So the Laplacian quadratic form is normally considered a measure of the smoothness


of the signal x on graph G. In fact, smoothness 0 is obtained if and only if x is a constant
signal. Considering the definition (3), Eq. (4) can be expressed in the form:
N 
2
(n) (m)
xT Lx = x − x (5)
n=1 m m|y =y
n
182 A. Salazar et al.

Hence, by minimizing the second term in (2), the statistics corresponding to the
same true hypotheses will reduce its dispersion, thus reducing the overlapping between
p(x/H1 ) and p(x/H0 ).
Let us compute the optimum coefficient vector wlin opt . We define matrix S =
1 T
s ... s N so that we can express x = S w. Hence the cost function to be minimized
T

is given by
2

J = y − ST w + β wT SLST w = yT y − 2wT Sy + wT S(I + βL)ST w (6)

and the corresponding derivative


δJ
= −2Sy + 2S(I + βL)ST w (7)
δw
Then we can solve by equating (7) to 0

−1
opt = S(I + βL)S
wlin T
Sy (8)

The parameter β defines the degree of importance given to the smoothness of the
fused statistics with respect to the MSE.
 values for α and w may be estimated from a set of labelled samples
 (n)Optimum
s , y(n) n = 1 . . . N by minimizing the cost function (2).

J (α, w) = y − x2 + β xT Lx (9)

Let us call E(α, w) = y − x(α, w)2 and R(α, w) = βxT (α, w)Lx(α, w). Clearly
∂J
∂α = ∂∂αE + ∂R ∂J ∂E ∂R ∂E ∂J
∂α and ∂w = ∂w + ∂w . The derivatives ∂α and ∂w have been calculated in
[6].
Considering (4) and alpha integration definitions, after some operations, we can
obtain the following derivatives ∂R ∂R
∂α and ∂w that allows using iterative gradient algorithms,

∂R N N
 ∂x(n) ∂x(m)

(n) (m)
=β anm · 2 x − x − (10)
∂α n=1 m=1 ∂α ∂α
∂R N N
 ∂x(n) ∂x(m)

(n) (m)
=β anm · 2 x − x − (11)
∂wk n=1 m=1 ∂wk ∂wk

4 Results
4.1 Simulated Experiment
We have designed a Monte Carlo experiment (random sampling) to showcase the per-
formance of alpha integration methods and regularization classification. Each iteration
of the experiment simulated a case with two types of targets and two expert classifiers,
each keyed to a particular type of target. Each expert detected its corresponding targets
with 100% accuracy, but it had no particular performance on the other type of targets.
The data of each iteration were simulated as follows:
A New Detector Based on Alpha Integration Decision Fusion 183

• Each iteration comprised 1024 samples: 924 noisy samples and 100 targets (50 of
type one and 50 of type two). Note that this is an imbalanced case, in order to better
approximate the experiment on real data.
• The scores of each expert were set to 1 for the targets it was specialized in, and the
rest of the scores were simulated as uniform noise in the range [0, 1/C], with C > 0. C
is a measure of the certainty in the scores provided by each expert: a low C signifies a
low amount of false alarms. C was the same for both experts. After generation, scores
higher than one were set to 1.
• Classification performance was measured using the balanced accuracy (BA) defined
as the average of the accuracies for each class, and considering binary classification
noise/target. In each case, the classification threshold was optimized to as the opti-
mum point of the receiver operating characteristic (ROC) curve. This optimization is
performed separately for each classifier and for each iteration.
• We considered the following methods: the scores given by each expert (Score1 and
Score2); the mean and the weighted mean of the scores; and alpha integration using
two cost functions (least mean squared error and area under the curve), with and
without regularization. The regularization values were cross-validated to maximize
BA, and were βLMSE = 0.21 and βAUC = 5.28 · 10−7 .

Figure 1 shows the average results for 100 iterations of the Monte Carlo experiment.
The standard deviations were omitted for clarity; in all cases, they were very under
1%. All methods experienced increased performance with rising C, with some methods
peaking at C = 1 owing to the nature of the experiment. It can be seen that alpha
integration methods always yielded the best result, large improving over classical fusion
methods. In fact, classical fusion methods were not able to improve the results of the
experts for highly uncertain cases (low C). The low BA value for the no-regularized
alpha integration method (LMSE) owed to the imbalanced classes, but was more than
compensated for by the regularization term, which brought it well above classical fusion
methods. The issue of imbalance between different categories of data could be alleviated
by balancing the a priori probability of the classes using undersampling or oversampling
methods, see for instance [24–27].
We can determine the effect of the performance indicator and the threshold from
Fig. 2, where we show the accuracy and balanced accuracy, before and after optimizing
the threshold. Note that the scales are different for the accuracy and for the balanced
accuracy. It can be seen that the basic alpha integration method decreased performance
after optimization. Given the differences between accuracy and balanced accuracy, it is
likely that this difference indicates that the basic alpha integration method was being
overly influence by the noise class, which is far more likely than the targets. The rest of
the methods did not seem to be that influenced by class imbalance, and the regularization
term compensated for that trend. In general, it can be seen that the optimization of the
threshold improved classification results and produced a more interesting result for us.
Before optimizing the threshold, the best balanced accuracy was yielded by the mean;
afterwards, alpha integration methods always yielded the best balanced accuracy.
184 A. Salazar et al.

Fig. 1. Balanced accuracy for the simulated experiment, where SNR = 10 log(C).

4.2 Ultrasound Pulse Detection


This experiment corresponded to a case of flaw detection in materials using ultrasounds
(US). When inspected with US, any sufficiently large internal inhomogeneity in the
material produces a reflection that is captured by the recording device. These reflections
induce a change in the recorded signals that we attempted to detect using the proposed
method.
In most cases, it is assumed that the contribution of the defect is effectively indepen-
dent from that of the main contribution (e.g., the infrared image of the solar panel or the
backscattering of the sonic signal).
In this experiment, the data consisted of ultrasound targets buried into background
noise. The targets were modeled using Gaussian-modulated
 tones
 with random initial
phase, i.e., x(t) = A · sin(2π fc t + φ0 ) · exp −(2(t − τ )/T )α , where: A is the peak
amplitude; fc , τ and T are respectively the central frequency, time center, and duration
of the tone; and α is an even number that determines the shape of the envelope of the
pulse. We used fc = 20 kHz, T = 1 ms, α = 4, τ = 20, 100 or 150 ms, φ0 randomly
drawn from a uniform distribution in the range [0, 2π ), and A calculated to obtain a
peak signal-to-noise ratio (PSNR) of 3 dB. The background noise dataset was modeled
by a K distribution as the following,
  L+ν   
2 Lνx 2 1 Lνx
p(x) = Kν−L 2 (12)
x μ
(L)
(ν) μ

This distribution can describe the statistics of the envelope of the backscattered
ultrasonic echo from a scattering medium [28]. The shape parameters were set to μ =
L = 1 and ν = 10. Finally, it was assumed that the data were sampled at 50 kHz. For
each iteration of the experiment, we generated four channels filled with background
A New Detector Based on Alpha Integration Decision Fusion 185

Fig. 2. Comparison of accuracy (left) and balanced accuracy (right), before (up) and after (down)
setting the optimal threshold.

noise during 200 ms, generating a total of 1000 samples. This noise was obtained by
multiplying four independent K-distributed noise channels by a [4 × 4] mixture matrix.
Then, three ultrasound targets were buried into the noise at different positions from the
start of the signal. An example of one channel of the generated data is shown in Fig. 3.
Note that the targets (areas marked in different colors) are hardly distinguishable from
the background noise, which shows the difficulty of the problem. To further showcase
the results obtained, the experiments were repeated four times with different values of
PSNR: 3, 6, 10, and 15 dB. These values were selected in order to simulate difficult
detection cases, where targets are difficult to distinguish from the surrounding noise.
186 A. Salazar et al.

The methods implemented for comparisons were the following: support vector
machine (SVM); quadratic linear discriminant (QDA); random forest (RDF) as sin-
gle classifiers. The results obtained from those classifiers were fused by using the mean;
standard alpha integration (α-LMSE) and alpha integration with regularization (α-LMSE
(REG)). For each value of PSNR, the average accuracy and kappa index were estimated.
Figure 4 shows the results for all the ultrasound detection experiments.

Fig. 3. Example of the simulated ultrasound data with ultrasonic pulses (marked in color) buried
in background noise. The pulses had a PSNR of 6 dB.

Figure 4 shows the performance, measured by accuracy and kappa index, of the
proposed alpha integration with regularization method (α-LMSE (REG)) that overcomes
the results of all the single classifiers (SVM, QDA, RDF) and also the ones of the fusion
methods (mean and α-LMSE). The results of SVM and QDA are the lowest for every
PSNR value. The results of the mean is similar to RDF, except for the lowest PSNR value
where is its result is lower than RDF. The results of RDF, although are always below
than the proposed method, are closer to the α-integration fusion methods for lowest and
highest PSNR values (3 and 15 dB), but for middle PSNR values (6 and 10 dB); the
differences are up to 1.3% accuracy and 6% kappa. This latter index measures better
the real performance difference between those methods. Note that those small detection
differences are significant due to the difficulty of the detection problem showed in Fig. 3.
A New Detector Based on Alpha Integration Decision Fusion 187

Fig. 4. Results of ultrasound pulse detection: accuracy and kappa for different SNR ratios.

5 Conclusion

A new detection method based on alpha integration has been proposed. The proposed
method extends the alpha integration method incorporating a regularization element in
the cost function to be optimized. This element is related to the “smoothness” concept
employed in graph theory. The performance of the proposed method was compared with
several classifiers to solve two detection problems: synthetic data and ultrasound pulse
detection. Three single classifiers (SVM, QDA, and RDF) and two fusion methods (the
mean and the standard alpha integration) were implemented. The results measured using
accuracy and kappa indexes demonstrate the advantages of the proposed method. There
188 A. Salazar et al.

are several open lines of research, for instance, extending the proposed method to the
multi-class problem, dynamic modelling, and approaching other real data applications,
see for instance, [29, 30].

Acknowledgement. This work was supported by Spanish Administration and European Union
under grant TEC2017-84743-P.

References
1. Atrey, P., Hossain, M., El Saddik, A., Kankanhalli, M.: Multimodal fusion for multimedia
analysis: a survey. Multimedia Syst. 16, 345–379 (2010)
2. Ross, A., Nandakumar, K.: Fusion, score-level. In: Li, S.Z., Jain, A. (eds.) Encyclopedia of
Biometrics, pp. 611–616. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-73003-
5_158
3. Yuksel, S.E., Wilson, J.N., Gader, P.D.: Twenty years of mixture of experts. IEEE Trans.
Neural Netw. Learn. Syst. 23(4), 1177–1193 (2012)
4. Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of
the state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)
5. Mohandes, M., Deriche, M., Aliyu, S.: Classifiers combination techniques: a comprehensive
review. IEEE Access 6, 19626–19639 (2018)
6. Soriano, A., Vergara, L., Ahmed, B., Salazar, A.: Fusion of scores in a detection context based
on alpha-integration. Neural Comput. 27(9), 1983–2010 (2015)
7. Safont, G., Salazar, A., Vergara, L.: Multiclass alpha integration of scores from multiple
classifiers. Neural Comput. 31(4), 806–825 (2019)
8. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning α-integration with partially-labeled data.
In: IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP)
Proceedings, pp. 2058–2061. IEEE, Dallas (2010)
9. Choi, H., Choi, S., Choe, Y.: Parameter learning for alpha integration. Neural Comput. 25,
1585–1604 (2013)
10. Amari, S.: Integration of stochastic models by minimizing α-divergence. Neural Comput. 19,
2780–2796 (2007)
11. Amari, S.: Information Geometry and its Applications. Springer, Tokyo (2016). https://doi.
org/10.1007/978-4-431-55978-8
12. Safont, G., Salazar, A., Vergara, L.: Vector score alpha integration for classifier late fusion.
Pattern Recogn. Lett. 136, 48–55 (2020)
13. Salazar, A., Safont, G., Vergara, L., Vidal, E.: Pattern recognition techniques for provenance
classification of archaeological ceramics using ultrasounds. Pattern Recogn. Lett. 135, 441–
450 (2020)
14. Spielman, D.: Spectral graph theory (Ch. 16). In: Naumann, U., Schnek, O. (eds.) Com-
binatorial Scientific Computing, pp. 1–23. Chapman and Hall/CRC Press, Boca Raton
(2012)
15. Merris, R.: Laplacian matrices of a graph: a survey. Linear Alg. Appl. 197, 143–176 (1994)
16. Zhang, X.D.: The Laplacian eigenvalues of graphs: a survey. In: Ling, G.D. (ed.) Linear
Algebra Research Advances, pp. 201–228. Nova Science Publishers Inc., New York (2007)
17. Zhou, D., Schölkopf, B.: A regularization framework for learning from graph data. In: ICML
Workshop on Statistical Relational Learning and Its Connections to Other Fields, Banff,
Alberta, Canada, pp. 132–137 (2004)
A New Detector Based on Alpha Integration Decision Fusion 189

18. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for
learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006)
19. Shuman, D.I., Narang, S.K., Frossard, P., Ortega, A., Vandergheynst, P.: The emerging field
of signal processing on graphs: extending high-dimensional data analysis to networks and
other irregular domains. IEEE Sig. Process. Mag. 30, 83–98 (2013)
20. Sandryhaila, A., Moura, J.M.F.: Discrete signal processing on graphs: frequency analysis.
IEEE Trans. Sig. Process. 62, 3042–3054 (2014)
21. Belda, J., Vergara, L., Salazar, A., Safont, G.: Estimating the Laplacian matrix of Gaussian
mixtures for signal processing on graphs. Sig. Process. 148, 241–249 (2018)
22. Belda, J., Vergara, L., Safont, G., Salazar, A.: Computing the partial correlation of ICA models
for non-Gaussian graph signal processing. Entropy 21(1), 1–16 (2019). Article no. 22
23. Belda, J., Vergara, L., Safont, G., Salazar, A., Parcheta, Z.: A new surrogating algorithm by
the complex graph Fourier transform (CGFT). Entropy 21(8), 1–18 (2019). Article no. 759
24. Salazar, A., Safont, G., Vergara, L.: Semi-supervised learning for imbalanced classification
of credit card transaction. In: International Joint Conference on Neural Networks (IJCNN)
Proceedings, Article No. 8489755, pp. 4976–4982. IEEE, Rio de Janeiro (2018)
25. Salazar, A., Safont, G., Vergara, L.: Surrogate techniques for testing fraud detection algorithms
in credit card operations. In: International Carnahan Conference on Security Technology
(ICCST) Proceedings, Article No. 6986987, pp. 124–129. IEEE, Rome (2014)
26. Izonin, I., Tkachenko, R., Shakhovska, N., Lotoshynska, N.: The additive input-doubling
method based on the SVR with nonlinear kernels: small data approach. Symmetry 13(4),
1–28 (2021). Article no. 612
27. Salazar, A., Vergara, L., Safont, G.: Generative adversarial networks and Markov random
fields for oversampling very small training sets. Expert Syst. Appl. 163, 1–12 (2021). Article
no. 113819
28. Eltoft, T.: Modeling the amplitude statistics of ultrasonic images. IEEE Trans. Med. Imaging
25(2), (2006)
29. Salazar, A., Igual, J., Safont, G., Vergara, L., Vidal, A.: Image applications of agglomera-
tive clustering using mixtures of non-Gaussian distributions. In: International Conference on
Computational Science and Computational Intelligence (CSCI) Proceedings, pp. 459–463.
IEEE, Las Vegas (2015). Article no. 7424136
30. Safont, G., Salazar, A., Vergara, L., Gomez, E., Villanueva, V.: Multichannel dynamic
modeling of non-Gaussian mixtures. Pattern Recogn. 93, 312–323 (2019)
A Safe and Effective Tuning Technique for
Similarity-Based Fuzzy Logic Programs

Ginés Moreno(B) and José A. Riaza

Department of Computing Systems, UCLM, 02071 Albacete, Spain


{Gines.Moreno,JoseAntonio.Riaza}@uclm.es

Abstract. We have recently designed a symbolic extension of FASILL


(acronym of “Fuzzy Aggregators and Similarity Into a Logic Language”),
where some truth degrees, similarity annotations and fuzzy connectives
can be left unknown, so that the user can easily see the impact of their
possible values at execution time. By extending our previous results in
the development of tuning techniques not dealing yet with similarity rela-
tions, in this work we automatically tune FASILL programs by appropri-
ately substituting the symbolic constants appearing on their rules and
similarity relations with the concrete values that best satisfy the user’s
preferences. The approach has been proved correct under some safe con-
ditions and an online tool is provided to check its effectiveness.

Keywords: Fuzzy logic programs · Similarity · Tuning · Symbolic


execution

1 Introduction
In essence, Bousi∼Prolog [2] and MALP [3] represent two different ways for intro-
ducing fuzzy constructs into the logic language Prolog by embedding similar-
ity relations or using fuzzy connectives for dealing with truth degrees beyond
{true, f alse}, respectively. We have recently combined both approaches in the
design of FASILL [1], whose symbolic extension (inspired by our initial experi-
ences with MALP [4]) is called sFASILL [7].
After summarizing in Sect. 2 both the syntax and the operational semantics
of sFASILL, in Sect. 3 we detail our empowered tuning technique coping now with
similarity relations. Although there exist other approaches (somehow connected
with our preliminary work [4]) which are able to tune fuzzy operators [9–11],
none of them manage similarity relations as our current technique does.
The main goals achieved in this paper are:
1. Firstly, in Definition 7 we provide a safe characterization of the set of concrete
truth degrees, similarities and fuzzy connectives that can be used for replacing
symbolic constants on a given sFASILL program.
This work has been partially supported by the EU (FEDER), the State Research
Agency (AEI) of the Spanish Ministry of Science and Innovation under grant PID2019-
104735RB-C42 (SAFER).
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 190–201, 2021.
https://doi.org/10.1007/978-3-030-85030-2_16
A Safe and Effective Tuning Technique 191

Fig. 1. The online tool loading an sFASILL program.

2. Next, Theorem 1 establishes a crucial property for enabling the application of


a tuning process by proving the correspondences between the results achieved
at execution time before and after replacing symbolic constants on a symbolic
program or on a set of symbolic computed answers.
3. In Sect. 3 we describe how the system is able to automatically produce a set of
preconditions (although users can also introduce their own constraints) which
prevent the generation of unsafe symbolic substitutions at tuning time.
4. All these facts have been taken into account in Definition 8, where we detail
the algorithm for tuning sFASILL programs which makes use of thresholding
techniques for improving its efficiency.
5. Last but not least, we provide an online implementation which is freely avail-
able at https://dectau.uclm.es/fasill/sandbox (see Fig. 1).

Finally, in Sect. 4 we conclude and propose some lines of future work.


192 G. Moreno and J. A. Riaza

2 The FASILL Language and Its Symbolic Extension


In this work, given a complete lattice L, we consider a first order language
LL built upon a signature ΣL , that contains the elements of a countably infi-
nite set of variables V, function and predicate symbols (denoted by F and Π,
respectively) with an associated arity—usually expressed as pairs f /n or p/n,
respectively, where n represents its arity—, and the truth degree literals ΣLT and
connectives ΣLC from L. Therefore, a well-formed formula in LL can be either:

– A value v ∈ ΣLT , which will be interpreted as itself, i.e., as the truth degree
v ∈ L.
– p(t1 , . . . , tn ), if t1 , . . . , tn are terms over V ∪ F and p/n is an n-ary predicate.
This formula is called atomic (atom, for short).
– ς(e1 , . . . , en ), if e1 , . . . , en are well-formed formulas and ς is an n-ary connec-
tive with truth function [[ς]] : Ln → L.

Definition 1 (Complete Lattice). A complete lattice is a partially ordered


set (L, ≤) such that every subset S of L has infimum and supremum elements.
Then, it is a bounded lattice, i.e., it has bottom and top elements, denoted by ⊥
and , respectively.

Example 1. In this paper we use the lattice ([0, 1], ≤), where ≤ is the usual
ordering relation on real numbers, and three sets of conjunctions/disjunctions
corresponding to the fuzzy logics of Gödel, L ukasiewicz and Product (with dif-
ferent capabilities for modelling pessimistic, optimistic and realistic scenarios).
It is possible to include also other fuzzy connectives (aggregators) like the arith-
metical average @aver (x, y)  (x + y)/2 or the linguistic modifier @very (x)  x2 .
The central box in Fig. 1 shows some Prolog clauses modeling this lattice inside
our tool.

Definition 2 (Similarity Relation). Given a domain U and a lattice L with


a fixed t-norm ∧, a similarity relation R is a fuzzy binary relation on U, that is,
a fuzzy subset on U × U (namely, a mapping R : U × U → L) fulfilling the follow-
ing properties: reflexive ∀x ∈ U, R(x, x) = , symmetric ∀x, y ∈ U, R(x, y) =
R(y, x), and transitive ∀x, y, z ∈ U, R(x, z) ≥ R(x, y) ∧ R(y, z).

The fuzzy logic language FASILL relies on complete lattices and similarity
relations [1]. We are now ready for summarizing its symbolic extension where, in
essence, we allow some undefined values (truth degrees) and connectives in pro-
gram rules as well as in the associated similarity relation, so that these elements
can be systematically computed afterwards. The symbolic extension of FASILL
we initially presented in [7] is called sFASILL.
Given a complete lattice L, we consider an augmented signature ΣL# pro-
ducing an augmented language L# L ⊇ LL which may also include a number of
symbolic values and symbolic connectives which do not belong to L. Symbolic
objects are usually denoted as o# with a superscript # and, in our tool, their
identifiers always start with #. An L# -expression is now a well-formed formula
A Safe and Effective Tuning Technique 193

of L#L which is composed by values and connectives from L as well as by sym-


bolic values and connectives. We let exp# #
L denote the set of all L -expressions
#
in LL . Given a L -expression E, [[E]] refers to the new L -expression obtained
# #

after evaluating as much as possible the connectives in E. Particularly, if E does


not contain any symbolic value or connective, then [[E]] = v ∈ L.
In the following we consider symbolic substitutions that are mappings from
symbolic values and connectives to expressions over ΣLT ∪ ΣLC . We let sym(o# )
denote the symbolic values and connectives in o# . Given a symbolic substitution
Θ for sym(o# ), we denote by o# Θ the object that results from o# by replacing
every symbolic symbol e# by e# Θ.
Definition 3 (Symbolic Similarity Relation). Given a domain U and a
lattice L with a fixed —possibly symbolic— t-norm ∧, a symbolic similarity rela-
tion is a mapping R# : U × U → exp# L such that, for any symbolic substitution
Θ for sym(R# ), the result of fully evaluating all L-expressions in R# Θ, say
[[R# Θ]], is a similarity relation.
Definition 4 (Symbolic Rule and Symbolic Program). Let L be a com-
plete lattice. A symbolic rule over L is a formula A ← B, where the following
conditions hold:
– A is an atomic formula of LL (the head of the rule);
– ← is an implication from L or a symbolic implication;
– B (the body of the rule) is a symbolic goal, i.e., a well-formed formula of L#
L;

A sFASILL program is a tuple P # = Π # , R# , L where Π # is a set of symbolic


rules, R# is a symbolic similarity relation between the elements of the signature
Σ of Π # , and L is a complete lattice.
Example 2. Consider a symbolic sFASILL program P # = Π # , R# , L based on
lattice L = ([0, 1], ≤), where Π # is the set of symbolic rules in the box at the
top of Fig. 1 and the symbolic similarity relation R# is represented as a graph1
on U = {vanguardist, elegant, modern, metro, taxi, bus} as follows:

sup{0.5, s# # # #
1 &s2 (s1 &s2 0.5)}
metro bus

0.5 &# #
s2 s1 s#
1

taxi vanguardist

s# #
0 &s2 0.9 0.9

elegant modern
sup{s# # # #
0 , (s0 &s2 0.9) &s2 0.9}

1
A matrix can be also used to represent this concept, as we will see in Sect. 3.
194 G. Moreno and J. A. Riaza

This symbolic similarity relation R# has been obtained after applying the closure
algorithm (with a symbolic t-norm & # s2 ) we initially introduced in [7] on the set
of symbolic similarity equations in the last box of Fig. 1.
As a logic language, sFASILL inherits the concepts of substitution, unifier and most
general unifier (mgu) from pure logic programming, but extending some of them
in order to cope with similarities, as Bousi∼Prolog [2] does, where the concept of
most general unifier is replaced by the one of weak most general unifier (w.m.g.u.).
One step beyond, in [7] we extended again this notion by referring to symbolic
weak most general unifiers (s.w.m.g.u.) and a symbolic weak unification algorithm
was introduced to compute them. Roughly speaking, the symbolic weak unification
algorithm states that two expressions (i.e., terms or atomic formulas) f (t1 , . . . , tn )
and g(s1 , . . . , sn ) weakly unify if the root symbols f and g are close with a certain
—possibly symbolic— degree (i.e. R# (f, g) = r = ⊥) and each of their arguments
ti and si weakly unify. Therefore, there is a symbolic weak unifier for two expres-
sions even if the symbols at their roots are not syntactically equal (f ≡ g).
More technically, the symbolic weak unification algorithm can be seen as
an reformulation/extension of the ones appearing in [12] (since now we man-
age arbitrary complete lattices) and [1,2] (because now we deal with symbolic
similarity relations). In essence, the symbolic weak most general unifier of two
expressions E1 and E2 , say wmgu# (E1 , E2 ) = σ, E, is the simplest symbolic sub-
stitution σ of E1 and E2 together with its symbolic unification degree E verifying
that E = R̂(E1 σ, E2 σ).
Example 3. Given the complete lattice L = ([0, 1], ≤) of Example 1 and
the symbolic similarity relation R# of Example 2, we can use the sym-
bolic t-norm &#s2 for computing the following two symbolic weak most gen-
eral unifiers: wmgu# (modern(taxi), vanguardist(bus)) = {}, 0.9 &# #
s2 s1  and
#
wmgu# (close to(X, taxi), close to(ritz, bus)) = {X/ritz}, s1 
In order to describe the procedural semantics of the sFASILL language, in the
following we denote by C[A] a formula where A is a sub-expression (usually an
atom) which occurs in the –possibly empty– context C[] whereas C[A/A ] means
the replacement of A by A in the context C[]. Moreover, Var(s) denotes the set
of distinct variables occurring in the syntactic object s and θ[Var(s)] refers to
the substitution obtained from θ by restricting its domain to Var(s). In the next
definition, we always consider that A is the selected atom in a goal Q, L is the
complete lattice associated to Π # and, as usual, rules are renamed apart:
Definition 5 (Computational Step). Let Q be a goal and σ a substitution.
The pair Q; σ is a state. Given a symbolic program Π # , R# , L and a (possi-
bly symbolic) t-norm ∧ in L, a computation is formalized as a state transition
system, whose transition relation  is the smallest relation satisfying these rules:
SS
1) Successful step (denoted as  ):

Q[A], σ A ← B ∈ Π # wmgu# (A, A ) = θ, E E = ⊥


SS
Q[A/E ∧ B]θ, σθ
A Safe and Effective Tuning Technique 195

FS
2) Failure step (denoted as  ):
Q[A], σ A ← B ∈ Π # : wmgu# (A, A ) = θ, E
FS
Q[A/⊥], σ
IS
3) Interpretive step (denoted as  ):
Q; σ where Q is a L# -expression
IS
[[Q]]; σ
Definition 6 (Derivation and Symbolic Fuzzy Computed Answer). A
derivation is a sequence of arbitrary length Q; id ∗ Q ; σ. When Q is an
L# -expression that cannot be further reduced, Q ; σ  , where σ  = σ[Var(Q)], is
called a symbolic fuzzy computed answer (sfca). Also, if Q is a concrete value
of L, we say that Q ; σ   is a fuzzy computed answer (fca).
The following example illustrates the operational semantics of sFASILL.
Example 4. Let P # = Π # , R# , L be the program from Example 2. It is pos-
sible to perform the following derivation for P # and goal Q = good hotel(x)
obtaining the sfca Q1 ; σ1  = @# # #
s3 (0.6 &s2 s2 , 0), {x/ritz}:

SS R4
good hotel(x), id 
SS R2
@#
s4 (elegant(x1 ), @very (close(x1 , metro))), {x/x1 } 
FS
@# #
s4 (s3 , @very (close(ritz, metro))), {x/ritz} 
IS
@# #
s4 (s3 , @very (0)), {x/ritz} 
# #
@s4 (s3 , 0), {x/ritz}
Apart from this derivation, there exists a second one ending with the alternative
sfca Q2 ; σ2  = @# # # # #
s4 ((s0 &s2 0.9) &godel 0.9, @very ((0.5 &s2 s1 ) &godel 0.7)),
{x/hydropolis} associated to the same goal (observe the presence of symbolic
constants coming from the symbolic similarity relation, which contrast with our
precedent work [4]):
SS R4
good hotel(x), id 
SS R1
@#
s4 (elegant(x1 ), @very (close(x1 , metro))), {x/x1 } 
SS R3
@# #
s4 ((s0 &#
s2 0.9) &godel 0.9, @very (close(hydropolis, metro))), {x/hydro...} 
@s4 ((s#
#
0 &#
s2 0.9) &godel 0.9, @very ((0.5 &# #
s2 s1 ) &godel 0.7)), {x/hydro...}

Now, let Θ = {s# # # # #


0 /0.8, s1 /0.8, &s2 /&luka , s3 /1.0, @s4 /@aver } be a symbolic sub-
stitution that can be used for instantiating the previous sFASILL program in
order to obtain a non-symbolic, fully executable FASILL program. As we are going
to explain in the next section, this substitution can be obtained by our tuning
tool after introducing a couple of test cases (i.e., 0.4−> good hotel(hydropolis)
and 0.6−> good hotel(ritz)) representing the desired degrees for two goals,
according to the user preferences.
196 G. Moreno and J. A. Riaza

3 Tuning sFASILL Programs


Let us start this section by describing and illustrating a result similar to Theorem
1 proved in [4], but focusing now on the FASILL language instead of MALP.
Since its proof requires an additional effort due to the management of similarity
relations in the new setting, let us consider the following auxiliary definition.

Definition 7 (Safe Symbolic Substitution). Given a symbolic similarity


relation R# on a domain U and a symbolic substitution Θ, we say that Θ is a
safe symbolic substitution w.r.t. R# if, for all x, y ∈ U such that R# (x, y) is an
L# -expression containing at least a symbolic constant, then [[R# (x, y)Θ]] = ⊥.

Let us remark that, beyond the simple case of assigning the ⊥ truth degree to at
least a symbolic constant, it is also possible to conceive other non safe symbolic
substitutions linking symbolic constant to values bigger than ⊥. For instance,
this is the case of Θ = {s# #
1 /0.4, &s2 /&luka , . . .} in our running example, because
if, in particular, we apply this symbolic substitution to R# (taxi, metro) =
0.5 &# # #
s2 s1 we have that [[R (taxi, metro)Θ]] = 0.5 &luka 0.4 = max(0, 0.5+0.4−
1) = max(0, −0.1) = 0, which implies that Θ is not a safe symbolic substitution
w.r.t. R# .

Theorem 1 Given a sFASILL program P # = Π # , R# , L and a goal Q, then


for any safe symbolic substitution Θ for sym(P # ), we have that v; θ is a fca
for Q in P # Θ iff there exists a sfca Q , θ  for Q in P # and an interpretive
IS ∗
derivation Q Θ; θ   v; θ , where θ is a renaming of θ.

Proof. (Sketch) For simplicity, we consider that the same fresh variables are used
for renamed apart rules in both derivations.
Consider the following derivations for goal Q w.r.t. programs P # and P # Θ,
respectively:

SS/FS ∗ IS ∗
DP # : Q; id  Q ; θ  Q ; θ
SS/FS ∗ IS ∗
DP # Θ : Q; id  Q Θ; θ  Q Θ; θ

Our proof proceeds now in three stages:


SS
1. Firstly, observe that the sequences of symbolic successful/failure (  and/or
FS
 ) steps in DP # and DP # Θ exploit the whole set of atoms in both cases,
such that a program rule R is used in DP # iff the corresponding rule RΘ
is applied in DP # Θ and hence, the symbolic answers of the derivations are
Q ; θ and Q Θ; θ, respectively.
2. Then, we proceed by applying interpretive steps until reaching the sfca Q ; θ
in the first derivation DP # and it is easy to see that the same sequence of
interpretive steps is applied in DP # Θ thus leading to state Q Θ; θ, which is
not necessarily a sfca.
A Safe and Effective Tuning Technique 197

3. Finally, it suffices to instantiate the sfca Q ; θ in the first derivation DP #


with the symbolic substitution Θ, for completing both derivations with the
same sequence of interpretive steps until reaching the desired fca v; θ. 


Example 5. Consider again the program P # = Π # , R# , L from Example 2.


Let Θ = {s# # # # #
0 /0.8, s1 /0.8, &s2 /&luka , s3 /1.0, @s4 /@aver } be a safe symbolic sub-
stitution. We can apply Θ to the sfca’s from Example 4 obtaining the following
results: @# #
s4 (s3 , 0)Θ, {x/ritz} ≡ @aver (1.0, 0), {x/ritz} for the first answer
and @s4 ((s0 &#
# # # #
s2 0.9) &godel 0.9, @very ((0.5 &s2 s1 ) &godel 0.7)), {x/hyd...} ≡
@aver ((0.8 &luka 0.9) &godel 0.9, @very ((0.5 &luka 0.8) &godel 0.7)), {x/hyd...}
for the second answer.
Next, we have the following final interpretive derivation steps on the instan-
IS
tiated sfca’s: @aver (1.0, 0), {x/ritz}  0.5, {x/ritz} for the first case, and
@aver ((0.8 &luka 0.9) &godel 0.9, @very ((0.5 &luka 0.8) &godel 0.7)), {x/hyd...}
IS ∗
 0.395, {x/hydropolis} for the second case.
And now observe that, as Theorem 1 confirms, it is also possible to generate a
pair of derivations for good hotel(x) in P # Θ leading to the same fuzzy computed
answers 0.5; {x/ritz} and 0.395; {x/hydropolis}.

In what follows, while summarizing the automated technique for tuning


MALP fuzzy logic programs initially formulated in [4], we will adapt it to cope
with FASILL programs.
It is easy to see that introducing changes on the fuzzy components of a FASILL
program (namely, truth degrees and connectives appearing on its program rules
and/or its associated similarity relation) may affect—sometimes in an unexpected
way—the set of fuzzy computed answers for a given goal. Typically, a programmer
has a model in mind where some parameters have a clear value. For instance, the
truth value of a rule or a similarity equation might be statistically determined and,
thus, its value is easy to obtain. In other cases, though, the most appropriate values
and/or connectives depend on subjective notions and, thus, programmers do not
know how to obtain these values. In a typical scenario, we have an extensive set of
expected computed answers (i.e., test cases), so the programmer can follow a “try
and test” strategy. Unfortunately, this is a tedious and time consuming operation.
Actually, it might even be impractical when the program should correctly model
a large number of test cases.
The first action for initializing the tuning process in the online tool consists
in introducing a set of test cases as shown in Fig. 2. Each test case appears
in a different line with syntax: v −> Q, where v is the desired truth degree
for the fca associated to query Q (which obviously does not contain symbolic
constants). So, in our particular example, consider that we introduce the test
cases 0.4−> good hotel(hydropolis) and 0.6−> good hotel(ritz), as Fig. 2
illustrates.
An important novelty of the tuning technique adapted to FASILL w.r.t. our
original technique of [4], which is strongly related to its new capability for dealing
198 G. Moreno and J. A. Riaza

Fig. 2. Screenshot of the online tool showing the tuning process.

with symbolic similarity relations, is the fact that the system can now constraint
the tuning process by managing a set of preconditions. A precondition has the
form of a sFASILL goal involving symbolic constants and it must be satisfied by
all symbolic substitution considered at tuning time. A precondition is satisfied
when its further evaluation produces a truth degree different to ⊥. We distinguish
two types of preconditions.
– User-preconditions. We can edit this kind of constraints inside the test
cases box in the online tool. For instance, if the user introduces as precondition
the goal #s0 = 0 in our running example, the tuning process will never be
started, because our algorithm only works with safe symbolic substitutions
(see Definition 7), as we are going to see.
– Safe-preconditions. The system automatically produces a set of entries (not
shown to the user) in order to ensure that all symbolic substitutions con-
sidered at tuning time be safe. In our example, the four safe-preconditions
synthesised by our tool are obviously s# # # #
0 and s1 , together with s0 &s2 0.9
and 0.5 &# # #
s2 s1 , that is, all L -expression appearing in the symbolic simi-
larity relation R which contain at most just one occurrence of the t-norm
#

associated to R# .

The following example illustrates that the verification of these automatically


generated safe-preconditions is crucial in order not to loose fca’s when applying
a symbolic substitution to a tuned sFASILL program and then executing its
instantiated FASILL version, as required in Theorem 1.
A Safe and Effective Tuning Technique 199

Example 6. Let P # be the sFASILL program of Example 2. After applying


the symbolic substitution Θ = {s# # # # #
0 /0.0, s1 /0.8, &s2 /&luka , s3 /1.0, @s4 /@aver }
(which is not safe since, in particular, s#
0 changes by 0.0), we get the following
FASILL program P = Π, R, L:


⎪ R : vanguardist(rizt) ← 0.9
⎨ 1
R2 : elegant(hydropolis) ← 1.0
Π=
⎩ R3 : close(hydropolis, taxi) ← 0.7

R4 : good hotel(x) ← @aver (elegant(x), @very (close(x, metro)))

R Vanguardist Elegant Modern Metro Taxi Bus


Vanguardist 1 0 0.9 0 0 0
Elegant 0 1 0 0 0 0
Modern 0.9 0 1 0 0 0
Metro 0 0 0 1 0.3 0.5
Taxi 0 0 0 0.3 1 0.8
Bus 0 0 0 0.5 0.8 1

Now, for goal Q = good hotel(x) this program wrongly produces the only fca
Q1 ; σ1  = 0.5, {x/ritz}. Note that if we apply this symbolic substitution
directly over the sfca’s of Example 4, we get two fca’s: Q1 ; σ1  = 0.5, {x/ritz}
and Q2 ; σ2  = 0.045, {x/hydropolis}. The problem here is that this last fca
for hydropolis would be missed when fully executing the instantiated FASILL
program since elegant and vanguardist would not be considered similar predi-
cates.

Once the set of test cases has been appropriately customized, users simply
need to click on the Tuning button for proceeding with the tuning process (see
again Fig. 2). The precision of the technique depends on the set of symbolic
substitutions considered at tuning time. So, for assigning values to the sym-
bolic constants (starting with #), our tool takes into account all the truth
values defined into the members/1 predicate, which in our case is declared as
members([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), as well as the set of connec-
tives defined in the lattice of Fig. 1, which in our running example coincides with
the @aver and @very aggregators plus the three well-known conjunction and dis-
junction connectives based on the so-called Product, Gödel and L  ukasiewicz fuzzy
logics. Obviously, the larger the domain of values and connectives is, the more
precise the results are (but the algorithm is more expensive, of course).
200 G. Moreno and J. A. Riaza

In order to improve the efficiency of our tuning algorithm, in the follow-


ing definition we use thresholding techniques, which are quite standard in the
fuzzy logic area for prematurely disregarding useless computations leading to
non-significant answers. In essence, our approach simply extends the original
algorithm we presented in [4] for tuning MALP programs by requiring now the
extra checking of the preconditions associated to symbolic similarity relations
when selecting candidates for symbolic substitutions (step 2):2

Definition 8. (algorithm for thresholded tuning of sFASILL programs).

Input: a sFASILL program P # and a number of test cases vi → Qi , i = 1, . . . , k.


Output: a safe symbolic substitution Θτ .

1. For each test case vi → Qi , compute the sfca Qi , θi  for Qi , id in P # .
2. Then, consider a finite number of safe symbolic substitutions (satisfying all
preconditions) for sym(P # ), say Θ1 , . . . , Θn , n > 0.
3. τ = ∞; For each safe symbolic substitution j ∈ {1, . . . , n} and τ = 0
z = 0; For each test case i = {1, . . . , k} and τ > z
IS ∗
compute Qi Θj , θi   vi,j ; θi 
let z = z + distance(vi,j , vi ).3
if z < τ then { τ = z; Θτ = Θj }.
4. Finally, return the best safe symbolic substitution Θτ .

Note that the algorithm makes use of a threshold τ for determining when a
partial solution is acceptable. The value of τ is initialized to ∞ (in practice,
a very large number) in step 3. Then, this threshold is dynamically decreased
whenever we find a symbolic substitution with an associated deviation which is
lower that the actual value of τ . Moreover, a partial solution is discarded as soon
as the cumulative deviation computed so far is greater than τ . In general, the
number of discarded solutions grows as the value of τ decreases, thus improving
the pruning power of thesholding.
Figure 2 shows the returned safe symbolic substitution, global deviation and
time consumed when performing a tuning process with our online tool. The
system also enables the possibility of applying the obtained best symbolic sub-
stitution to the original sFASILL program and then safely executing its FASILL
version, as illustrated in Example 5.

2
The set of safe symbolic substitutions always contains at least the one which assigns
 to all symbolic constants in the sFASILL program to be tuned.
3
In our tool this operation is defined in the lattice box by means of a set of Prolog
clauses associated to predicate distance/3, which returns on its third parameter
a real number representing the distance between the truth degrees in the first and
second arguments. In our example, we simply use the following clause (see again
Fig. 1): distance(X, Y, Z) : − Z is abs(Y − X)..
A Safe and Effective Tuning Technique 201

4 Conclusions and Future Work


The symbolic extension of the FASILL language based on symbolic similarity
relations we introduced in [7], has been used in this paper for developing an
effective tuning technique for FASILL programs. The new tool goes beyond our
preliminary version conceived only for MALP programs (see [4]) by safely sub-
stituting symbolic similarity/truth degrees and fuzzy connectives that best fit a
set of test cases provided by users a priori.
In the future we plan to improve the efficiency of our approach by using
unfolding techniques and SAT/SMT solvers in the lines of [6,8], respectively, as
well as to embed the new capabilities commented so far into the tool for tuning
neural networks we introduced in [5].

References
1. Julián-Iranzo, P., Moreno, G., Penabad, J.: Thresholded semantic framework for
a fully integrated fuzzy logic language. J. Log. Algebr. Meth. Program. 93, 42–67
(2017)
2. Julián-Iranzo, P., Rubio-Manzano, C.: A declarative semantics for bousi∼prolog.
In: Proceedings of 11th International ACM SIGPLAN Conference on Principles
and Practice of Declarative Programming, PPDP 2009, pp. 149–160. ACM (2009)
3. Medina, J., Ojeda-Aciego, M., Vojtáš, P.: Similarity-based unification: a multi-
adjoint approach. Fuzzy Sets Syst. 146, 43–62 (2004)
4. Moreno, G., Penabad, J., Riaza, J.A., Vidal, G.: Symbolic execution and thresh-
olding for efficiently tuning fuzzy logic programs. In: Hermenegildo, M.V., Lopez-
Garcia, P. (eds.) LOPSTR 2016. LNCS, vol. 10184, pp. 131–147. Springer, Cham
(2017). https://doi.org/10.1007/978-3-319-63139-4 8
5. Moreno, G., Pérez, J., Riaza, J.A.: Fuzzy logic programming for tuning neural
networks. In: Fodor, P., Montali, M., Calvanese, D., Roman, D. (eds.) RuleML+RR
2019. LNCS, vol. 11784, pp. 190–197. Springer, Cham (2019). https://doi.org/10.
1007/978-3-030-31095-0 14
6. Moreno, G., Riaza, J.A.: An online tool for unfolding symbolic fuzzy logic pro-
grams. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019. LNCS, vol. 11507, pp.
475–487. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20518-8 40
7. Moreno, G., Riaza, J.A.: Symbolic similarity relations for tuning fully integrated
fuzzy logic programs. In: Gutiérrez-Basulto, V., Kliegr, T., Soylu, A., Giese, M.,
Roman, D. (eds.) RuleML+RR 2020. LNCS, vol. 12173, pp. 150–158. Springer,
Cham (2020). https://doi.org/10.1007/978-3-030-57977-7 11
8. Moreno, G., Riaza, J.A.: Using SAT/SMT solvers for efficiently tuning fuzzy logic
programs. In: 2020 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE
2020, Glasgow, UK, pp. 1–8. IEEE (2020, in press)
9. De Raedt, L., Kimmig, A.: Probabilistic (logic) programming concepts. Mach.
Learn. 100(1), 5–47 (2015)
10. Riguzzi, F., Swift, T.: The PITA system: tabling and answer subsumption for
reasoning under uncertainty. Theory Pract. Log. Program. 11(4–5), 433–449 (2011)
11. Sagonas, K.F., Swift, T., Warren, D.S.: XSB as an efficient deductive database
engine. In: Proceedings of ACM SIGMOD International Conference on Manage-
ment of Data, pp. 442–453. ACM Press (1994)
12. Sessa, M.I.: Approximate reasoning by similarity-based SLD resolution. Theoret.
Comput. Sci. 275(1–2), 389–426 (2002)
Predictive Ability of Response Surface
Methodology (RSM) and Artificial Neural
Network (ANN) to Approximate Biogas Yield
in a Modular Biodigester

Modestus O. Okwu1(B) , Lagouge K. Tartibu1 , Olusegun D. Samuel2 ,


Henry O. Omoregbee1 , and Anna E. Ivbanikaro1
1 Department of Mechanical and Industrial Engineering, University of Johannesburg,
Johannesburg, South Africa
[email protected], [email protected]
2 Department of Mechanical Engineering, Federal University of Petroleum Resources, Effurun,
Warri, Delta State, Nigeria
[email protected]

Abstract. This study indicates the modelling and optimization of biogas pro-
duction on assorted substrates of poultry wastes (PW) and cow dung using RSM
and ANN. Three-layered ANN feedforward BP and RSM models were developed
to estimate the yield of biogas produced via mixture of CD and PW droppings
produced from a bio-digester system in the ratio 1:2. At the first run, maximum
biogas yield of 51.3% was achieved with 38:23 CD/PW within the retention time
of 9 days. The results showed that the coefficient of determination (R2 ) of the RSM
and ANN models were 0.9998 and 1.0. The root-mean-square-error (RMSE) for
best RSM and ANN were obtained at 0.0055 and 0.00022188. The study showed
that ANN result seems marginally better than the RSM model. This is a con-
firmation that biomass could be harnessed in solving the current global energy
crisis.

Keywords: ANN · RSM · Anaerobic digestion · Feedstock · Biogas yield

1 Introduction

The generation of biogas is an eco-friendly and a successful way of reducing pollution


and the green-house gas emission that is causing global warming [1]. Hence, researchers
and strategist in energy field tend to explore classical and global techniques to harvest
large volume of biogas from decomposable plants, animal waste and other form of
biodegradable municipal waste. Anaerobic digestion has been the choice for treating
waste mixtures for the production of biogas as compared to aerobic waste mixture
treatment due to its numerous advantages such as: effective utilization of the up-flow
anaerobic sludge blanket (UASB) is a bioreactor that favors the use of immobilized cells
which has the efficiency of achieving a relatively low hydraulic retention time [2–4]. The

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 202–215, 2021.
https://doi.org/10.1007/978-3-030-85030-2_17
Predictive Ability of Response Surface Methodology (RSM) and (ANN) 203

use of UASB bioreactor for anaerobic processing of waste mixture has great advantages
such as: adequate space usage; lesser amount energy requirement for biogas production;
possibility to accommodate and process different feed-stocks into biogas in the system
(from medium to high) within a range of 3 – 48 h of hydraulic retention times [5, 6].
Bioreactors are of different types, categories and sizes. As aforementioned, there are
other types meant for various processing like the field-scale landfill bioreactor and the
laboratory-scale anaerobic bioreactor which can both be scaled up [7]. The production
process of biogas is quite tedious. The process takes into consideration the nature of bio-
digester and the feedstock for the production process. Different feedstocks have been
used for biogas production, like crop residues [8], cow manure combined with food
waste [9, 10], sewage [2], pepper and eggplant [11], municipal and industrial waste [12–
14]. There has been various techniques and practices in enhancing biodegradation of
municipal solid waste (MSW) which includes leachate recirculation, inoculum addition,
pH [15], mixing, moisture content (MC), reduction of particle size, temperature and
addition of enough nutrient [16].
The analytical parameters in a biological degradation process are often non-linear
[16]. Artificial Neural Network (ANN) is an efficient means of dealing with problems
that are inherently non-linear. ANN can be used for modelling stochastic factors involved
in biodegradable waste. This could be study such as temperature in the biodigester, pres-
sure and many more. ANN is equally relevant in the prediction of wastewater treatment
plant performance as well as modeling of air pollution [18]. ANN is a predictive tech-
nique which mimics the human brain in carrying out its operation, it is a meta-heuristic
technique which is inspired by nature from the brain neurons which are simple pro-
cessing units that are interconnected by networks thereby transmitting signals between
the neurons. The sigmoid function act as the controller of the network. The system is
designed to operate as a black box model which describes the relationship of the input
to output system. Working with mathematical based-model in simulating the bioreac-
tor, ANN can be used to fine-tune process parameters, suggest feed-stock addition with
respect to time, concentration of product and substrates, and for the elicit concentration
of desired output of composition [16].
The gradient descent with adaptive learning rate or a multilayer perceptron (MLP)
neural network back propagation (BP) and the Levenberg–Marquardt algorithm are
the most used ANN model for anaerobic sludge digester of a wastewater treatment
plant [7]. The back-propagation algorithm is quite a popular algorithm for solving non-
deterministic problems [8].There are related research articles that describes the strength
of computational intelligence for effective prediction of stochastic variables. ANN is
one of the very popular computational intelligence techniques for prediction of input
parameters under uncertainty [24].
ANN is a creative algorithm which is data-driven with versatile computing method
that can capture non-linear and complex data. Nonetheless, it can be used to com-
pare input variables and output variables due to the black box learning methodology
[27]. There are other creative algorithms with superior and versatile computing methods
capable of solving deterministic and complex problems such as RSM and other hybrid
metaheuristic techniques for feedback interaction between input and response variables.
[15] are of the opinion that a better prediction is possible by combination RSM and
204 M. O. Okwu et al.

ANN as hybrid systems [28]. The response surface methodology (RSM) as a technique
is not left out as one of the popular soft computing techniques. In literature of soft com-
putational intelligence and energy sources, ANN and RSM has been used intensively
for prediction of experimental values with non-deterministic inputs [25, 26, 29]. In a
research conducted by [23], ANN was compared with RSM in the modelling of waste
coconut oil ethyl esters production and the solution obtained in that research was good
enough.
Reports proliferate on the biogas production from other reactor designed not designed
for the tropic and its methodology for the modelling. However, to the best of the authors’
knowledge, there is a lapse in the knowledge of hybrid model tools to enhancing and
approximating biogas from dual waste from the modular reactor for tropic usage. This
research is focused on proximate analysis of biogas yield using the two stimulating soft
computing techniques. Lots of animal wastes and agricultural residues are increasingly
being diverted for use as domestic fuel to displace fossil fuel and reduce environmental
pollution thereby reducing the emission of greenhouse gases. In this work, a biogas
digester was developed for the tropics. Cow dung and poultry droppings were used as
input components into a biodegradable biomass digester plant designed and fabricated
for the purpose of producing biogas from these input components.

2 Materials and Methods


2.1 Experimental Design and Procedure (The Digester Plant)
The digester with a gas holder chamber was fabricated using an empty 10 kg gas cylinder.
Two holes of 100 mm (0.1 m) have been drilled on the top and at the base of the empty
gas cylinder for the collection of the gas which is produced in the digester chamber
where the substrates (cow dung and poultry dropping) are introduced. Attached also to
the digester are the purification chambers which are mainly for the removal of unwanted
hydrogen sulphide (H2 S) and carbon-dioxide (CO2 ). The removal constitutes 40–45%
of the gas. This is to ensure that there is ensure energy unit volume and an improved
calorific value. Figure 1 presents the schematics of biomass digester plant developed to
produce biogas from components which are cow dungs and poultry droppings.

Fig. 1. Digester plant used to produce biogas


Predictive Ability of Response Surface Methodology (RSM) and (ANN) 205

2.2 Loading of the Digester with the Substrate

Torsten et al. study indicated that loading rates of a bio digester depend on factors such as
size of the digester, operating system (whether batch or continuous), energy requirement,
type of influent and retention time [25]. The optimal solution derived from a pre-digester
where the substrate collection and mixing were done, was used for the experimental set
up as input to the digester. The ratio between cow dung and poultry droppings was
1:3. Water was added to the feedstock (cow dung, poultry dropping) mixing ratio at
1:1:3 as formulated for this work and was fed to the digester and homogenized with
an agitator. After charging the wastes into the bio-digester, all openings were closed
for anaerobic digestion. The digester has a stirrer and was stirred thrice daily manually
to avoid scum formation and to ensure homogenous dispersion of the materials. The
anaerobic digestion was on for 30 days and the gas holder cylinder was weighed using a
digital scale to know the volume of gas produced. The temperature was measured thrice
daily. The daily readings were taken every day for 6 weeks and the average weekly
temperature was recorded. The temperatures were taken with the aid of a mercury-in-
glass thermometer via the thermometer duct provided. The temperature readings were
taken three times daily around 8:00 a.m, 2:00 p.m and 6:00 p.m of the day.

2.3 Experimental Design

2.3.1 RSM Model


In the present study, biogas production was developed by statistical method employing
analysis of variance (ANOVA) and was optimized via Response Surface Methodology
(RSM). Central Composite Design (CCD) is chosen for this research and the response
is biogas yield. RSM was employed to model the variables of biogas production and
the effect of two independent parameters namely cow dung/poultry waste, the ratio and
retention time on biogas yield was evaluated. A three-level-two factor CCD, consisting of
11 runs was utilized in this study. Table 1 highlights the ranges and level of independent
parameters and the approach was implemented within the software Design Expert (7.0
version).

Table 1. Ratio of mixture table with biogas yield

Runs Type Factor 1 Coded Factor 2 Coded Response 1


A:CD:PW A:CD:PW B:Time Time Biogas yield
w/w% Days %
1 Fact 25 – 1 7 – 1 49.34
2 Fact 75 1 7 – 1 46.28
3 Fact 25 – 1 11 1 51.32
4 Fact 75 1 11 1 47.14
5 Axial 25 – 1 9 0 49.86
6 Axial 75 1 9 0 46.98
(continued)
206 M. O. Okwu et al.

Table 1. (continued)

Runs Type Factor 1 Coded Factor 2 Coded Response 1


A:CD:PW A:CD:PW B:Time Time Biogas yield
w/w% Days %
7 Axial 50 0 7 – 1 51.34
8 Axial 50 0 11 1 51.24
9 Center 50 0 9 0 51.32
10 Center 50 0 9 0 51.32
11 Center 50 0 9 0 51.32
12 Center 50 0 9 0 51.32
13 Center 50 0 9 0 51.32
CD:PW, cow dung and poultry waste ratio.

The data presented in Table 1 were analyzed with the aid of Eq. (1).
k k k
YPredicted = β0 + βij xi + βij xi2 + βij xi xj + e (1)
i=1 i=1 j>1

where: Y-predicted is the predicted response (Biogas yield); β0; βii; βij are the regression
coefficients; k is the number of factors studied and optimized in the experiment, and e
is the random Error. The Average values were specified from the repeated experimental
run to confirm accuracy.

2.3.2 ANN Model


The neural network toolbox used the input variables called from the MATLAB program
2020 version. The first sample collection includes 70 percent dataset for training, 15
percent for testing and another 15 percent for validation respectively. The flow chat for
the back-propagation algorithm is presented in Fig. 2, and the architecture of the fitting
tool is presented in Fig. 3. To obtain an appropriate solution, a continuous refinement of
weighted parameters is ensured. The iteration was conducted at about twenty-five runs
with the coefficient of correlation (R) defining the degree of association or relationship
between the variables concerned. The absence of a linear relationship is a correlation
value of 0, whereas 1 implies a perfect relationship between variables. A satisfactory
result is the R2 value within the range of 0.7 to 1.0. Weight parameters were continuously
iterated (or refined) to achieve a model with the best possible fit. Figure 2 highlights
the step in the ANN modelling. Using the available experimental data, Levenberg–
Marquardt (LM) ANN fitting tool and Logistic Sigmoid Activation Transfer Function
4–10–1 (number of input layer, neurons in hidden layer and output layer nodes) model
were implemented as shown in Fig. 3.
Predictive Ability of Response Surface Methodology (RSM) and (ANN) 207

Start

Input Layer

Hidden
Layer
Hidden Layers

Updated
Weights
Output
Layer
Output Layer

Output from
Desired Output Back-Propagation Algorithm
ANN

Error

Is Error < Th reshold No


Or
Number of Cycl es > limit?

Yes

End of Simul ation

Fig. 2. Flow chart for the ANN modelling for biogas production

Fig. 3. Architecture of the ANN Model

3 Results
3.1 Regression Model for the Biogas Production
Table 2 summarized the experimental matrix for the biogas yield. As suggested by the
data reported in Table 1, the quadratic type possesses the highest significant order without
aliased. Hence, the actual and predicted yield of biogas produced are modelled with the
aid of Eqs. (2) and (3), respectively.

Biogas yield = 51.30 − 1.69A + 0.46B − 2.84A2 + 0.033B2 − 0.28AB (2)


208 M. O. Okwu et al.

Biogasyield =39.41480 + 0.43689(CD : PW ) + 0.36092(Time) − 4.53959 × 10−3 (CD : PW )2


+ 8.18966 × 10−3 (Time)2 − 5.6 × 10−3 (CD : PW )(Time) (3)

Table 2. ANOVA for RSM model

Source Sum of square DF* Mean square F-value Prob > F


Model 44.42 5 8.88 64.48 < 0.0001 Significant
A 17.07 1 17.07 123.9 < 0.0001 Significant
B 1.25 1 1.25 9.08 0.0196 Significant
A2 22.23 1 22.23 161.39 < 0.0001 Significant
B2 2.96E-03 1 2.96E–03 0.022 0.8875 not significant
AB 0.31 1 0.31 2.28 0.1751 not significant
Residual 0.96 7 0.14
Lack of Fit 0.96 3 0.32
Pure Error 0 4 0
Cor Total 45.38 12
R2 0.9788
Adj. R2 0.9636

3.2 Effect of Biogas Parameters from RSM Plot

The response surface and contour plot of biogas yield versus the ratio of cow dung and
poultry with retention time are depicted in Figs. 4(a) and (b). As observed, the yield of
biogas varied between 46.28–51.34% as the ratio of cow dung/poultry waste (CD/PW)
of 25–75 wt.% and retention time varied from 7 to 11 days. The maximum biogas’s yield
(51.73%) was obtained with the CD/PW of 38.23 within the retention time of 10 days.

3.2.1 Parameter Optimization


The optimal biogas methanation conditions were estimated by solving the regression
model indicated in Eqs. (2) and (3) according to the limit criterion of maximizing biogas
yield as depicted in Fig. 5. It can be observed that cow dung /poultry waste ratio of 49.5
within 9 days resulted in the optimal biogas yield of 51.3%.
Predictive Ability of Response Surface Methodology (RSM) and (ANN) 209

Fig. 4. (a) Response surface graph and (b) contour plot for interaction between cow dung/poultry
waste ratio and retention time

Fig. 5. The optimal conditions for biogas production

3.3 ANN Analysis


The mean square error (MSE) and R-values of the available dataset are presented in Table
3. Considering the ANN result of Table 3, as the MSE value for training, validation and
testing are quite high, the output value for the first iteration obtained was not satisfactory.
If R-value is close to unity with low MSE values, it becomes appropriate and good for
prediction. The iteration performance value was observed until the best value of R and
MSE was obtained as shown in Table 3, with best training value of 0.999.

Table 3. ANN result for MSE and R value of the biogas experiment

Samples MSE R
Training 9 2.07938x10–6 0.999
Validation 2 8.45870x10–3 1.0
Testing 2 1.86422x10–1 1.0
210 M. O. Okwu et al.

Table 4. Comparative of statistical indices of RSM and ANN models

Variables RSM ANN


R 0.9995 1.0
R2 0.9998 1.0
RMSE 0.0055 0.00022188

3.4 Best Value of the Predicted Biogas Yield

To achieve the best value as presented in Table 3, the iteration was performed several
times. The best solution was observed for training, testing and validation at the twenty
fifth iteration with the lowest MSE value and the highest R values. The straight lines
shown in Fig. 6 represent the linear relationships used in this study between the output
and the target data. The real and expected correlation coefficients (R) are 0.999 (training),
1.000 (testing) and 1.000 (validation). Therefore, in terms of correlation, the prediction
of ANN is significant with overall value of 0.9969 as presented in Table 3 and Fig. 6.
The best validation performance can be traced in the curve shown in the graph of Fig. 7,
validation check was performed at different epoch. As shown in Fig. 8, the best validation
performance of the network was obtained at the mean square error (MSE) of 0.00845 at
epoch 3. Also, the error histogram plot for the process of training, validation and testing
for the presented dataset in Table 1 is presented in Fig. 8. The result obtained at the end
of the iteration process demonstrate a good agreement between the experimental yield
of biogas and the predicted ANN at a minimal error value as shown in Figs. 7 and 8.

Fig. 6. Regression result for training, testing and validation of dataset


Predictive Ability of Response Surface Methodology (RSM) and (ANN) 211

Fig. 7. Validation check at different epoch

Fig. 8. Error histogram plot

The predicted indices for the models are highlighted in Table 4. As realized in
Table 3, ANN possesses higher regression coefficient and lower errors compared to the
RSM values. Also obtained are values of R, R squared, RMSE, SEP, MAE and AAD
respectively, which clearly showed the accuracy of ANN model and its superiority to
that of RSM model.

4 Discussion

4.1 ANN Versus Response Surface Methodology

The information showing the input variables in terms of ratio of mixture and the output
response based on ANN and RSM prediction is presented in Table 5.
212

Table 5. Ratio of mixture with biogas yield, ANN and RSM predicted results

Runs Type Factor 1 Coded Factor 2 Coded Response 1 RSM predicted RSM Residual ANN predicted ANN residual
A:CD:PW A:CD:PW B:Time Time Biogas yield biogas yield
w/w% Days %
M. O. Okwu et al.

1 Fact 25 – 1 7 – 1 49.34 49.45 – 0.11 49.4701 – 0.1301


2 Fact 75 1 7 – 1 46.28 46.63 – 0.35 46.2805 – 0.0005
3 Fact 25 – 1 11 1 51.32 50.92 0.40 51.3205 – 0.0005
4 Fact 75 1 11 1 47.14 46.99 0.15 47.1439 – 0.0039
5 Axial 25 – 1 9 0 49.86 50.15 – 0.29 49.8593 0.0007
6 Axial 75 1 9 0 46.98 46.78 0.20 46.3694 0.6106
7 Axial 50 0 7 – 1 51.34 50.88 0.46 51.3389 0.0011
8 Axial 50 0 11 1 51.24 51.79 – 0.55 51.2400 0.0000
9 Center 50 0 9 0 51.32 51.3 0.018 51.3192 0.0008
10 Center 50 0 9 0 51.32 51.3 0.018 51.3192 0.0008
11 Center 50 0 9 0 51.32 51.3 0.018 51.3192 0.0008
12 Center 50 0 9 0 51.32 51.3 0.018 51.3192 0.0008
13 Center 50 0 9 0 51.32 51.3 0.018 51.3192 0.0008
CD:PW, cow dung and poultry waste ratio.
Predictive Ability of Response Surface Methodology (RSM) and (ANN) 213

In determining the dominance and predictive capacity of the model techniques, statis-
tical variables such as correlation coefficient (R), regression coefficient (R2 ), root-mean
square error (RMSE) are applied. The equation for computing the above-mentioned
statistical variables are presented in systems of Eqs. 1–6.
⎛ n   ⎞
m=1 (YPred .m − yPred ) Yexp.m − yexp
R = ⎝  2
⎠ (4)
n 2 n
m=1 (YPred .m − yPred ) m=1 Yexp.m − yexp
n  2
i=1 Yi,P − Y1,e
R =1−  
2
2 (5)
n
i=1 Yi,P − Ye,ave

n  2
i=1 Yi,e − Yi,p
RMSE = (6)
n
Ypred represent the predicted value of samples, Yexp is the experimental value, Yi
is the observed value of samples, Ye is the estimated value of samples, N represent the
number of samples. A, B and C denote independent variables, βij is the coefficient of
linear terms.

5 Conclusions
In this study, experimental value of biogas produced from a developed biodigester was
modeled and optimized. Substrates fed into the biodigester system contains poultry
wastes (PW) and cow dung (CD) in the ratio 1:2. Three-layered ANN feedforward BP
and RSM models were developed to estimate the yield of biogas produced via mixture
of CD and PW. The inputs of the models include the ratio of mixture of CD and PW and
the retention time. The output of the models was the biogas yield BY. The results showed
that the coefficient of determination (R2 ) of the best RSM and ANN models were 0.9998
and 1.0. The root-mean-square-error (RMSE) for best RSM and ANN were obtained at
0.0055 and 0.00022188. The result revealed that RSM and ANN gave good predictions,
though the result obtained from ANN model seems marginally better than that of RSM.
The expected methane contents of CD and PW gas produced after the experiment is
71.95%. The system, if scaled up will be very useful electricity generation for off grid
power.

Conflicts of Interest. The authors declare no conflict of interest.

References
1. Okwu, M.O., Samuel, O.D., Otanocha, O.B., Ojo, E., Ogugu, T., Balogun, P.: Design and
development of a bio-digester for production of biogas from dual waste. World J. Eng.
Emerald. Insight. 17(2), 247–260 (2020). https://doi.org/10.1108/WJE-07-2018-0249
2. Chen, R., Nie, Y., Tanaka, N., Niu, Q., Li, Q., Li, Y.: Enhanced methanogenic degradation of
cellulose-containing sewage via fungi-methanogens syntrophic association in an anaerobic
membrane bioreactor. Bioresour. Technol. 245, 810–818 (2017)
214 M. O. Okwu et al.

3. Jha, P., Kana, E.B.G., Schmidt, S.: Can artificial neural network and response surface method-
ology reliably predict hydrogen production and COD removal in an UASB reactor? Int. J.
Hydrogen Energy 42, 18875–18883 (2017)
4. Ren, T.T., Mu, Y., Ni, B.J., Yu, H.Q.: Hydrodynamic up-flow of anaerobic sludge blanket
reactors. AIChE J. 55, 516–528 (2009)
5. Jung, K.W., Kim, D.H., Kim, S.H., Shin, H.S.: Bioreactor design for continuous dark
fermentative hydrogen production. Bioresour. Technol. 102, 8612–8620 (2011)
6. Liu, Z., Lu, F., Zheng, H., Zhang, C., Wei, F., Xing, X.H.: Enhanced hydrogen production
in a UASB reactor by retaining microbial consortium onto carbon nanotubes (CNTs). Int. J.
Hydrogen Energy 37, 10619–10626 (2012)
7. Behera, S.K., Meher, S.K., Park, H.-S.: Artificial neural network model for predicting methane
percentage in biogas recovered from a landfill upon injection of liquid organic waste. Clean
Technol. Environ. Policy 17(2), 443–453 (2014). https://doi.org/10.1007/s10098-014-0798-4
8. M.G.K., Machesa, L.K., Tartibu, F.K., Tekweme, M.O., Okwu, D.E., Ighravwe: A neural
network-based prediction of oscillatory heat transfer coefficient in a thermo-acoustic device
heat exchanger. IEEE Int. Conf. (icABCD) (2020). https://doi.org/10.1109/icABCD49160.
2020.9183877
9. Xing, B.-S., et al.: Cow manure as additive to a DMBR for stable and high-rate digestion of
food waste: Performance and microbial community. Water Res. 168, 115099 (2020)
10. Dahiya, S., Kumar, A.N., Shanthi Sravan, J., Chatterjee, S., Sarkar, O., Mohan, S.V.: Food
waste biorefinery: sustainable strategy for circular bioeconomy. Bioresour. Technol. 248, 2–12
(2018)
11. Hamraoui, K., Gil, A., El Bari, H., Siles, J.A., Chica, A.F., Martín, M.A.: Evaluation of
hydrothermal pretreatment for biological treatment of lignocellulosic feedstock (pepper plant
and eggplant). Waste Manage. 102, 76–84 (2020)
12. Pecorini, I., Baldi, F., Carnevale, E.A., Corti, A.: Biochemical methane potential tests of
differnet auotoclaved and microwaved lignocellulosic organic fractions of municipal solid
waste. Waste Manage. 56, 143–150 (2016)
13. Pellera, F.-M., Gidarakos, E.: Microwave pretreatment of lignocellulosic agroindustrial waste
for methane production. J. Environ. Chem. Eng. 5, 352–365 (2017)
14. Pellera, F.-M., Gidarakos, E.: Chemical pretreatment of lignocellulosic agroindustrial waste
for methane production. Waste Manage. 71, 689–703 (2018)
15. Xu, S.Y., Lam, P.H., Karthekeyan, O.P., Wong, J.W.C.: Optimization of food waist hydrolysis
in leach bed couple with methanogenic reactor: effect of PH and bulking agent. Bioresour.
Technol. 102(4), 3702–3708 (2011)
16. Nair, V.V., Dhar, H., Kumar, S., Thalla, A.K., MukherJee, S., Wong, J.W.C.: Artificial neural
network based modeling to evaluate methane yield from biogas in a laboratory-scale anaerobic
bioreactor. Biores. Technol. 217, 90–99 (2016)
17. Karaca, F., Alagha, O., Erturk, F.: Statistical characterization of atmospheric PM10 and PM25
concentrations at a non-impacted suburban site of Istanbul Turkey. Chemosphere. 59, 118–3
(2005)
18. Samuel, O.D., Okwu, M.O.: Comparison of response surface methodology (RSM) and artifi-
cial neural network (ANN) in modelling of waste coconut oil ethyl esters production. Energy
Sour. Part A Recov. Utiliz. Environ. Effects 41(9), 1049–1061 (2018). https://doi.org/10.1080/
15567036.2018.1539138
19. Ozkaya, B., Demir, A., Bilgili, M.S.: Neural network prediction model for the methane fraction
in biogas from field-scale landfill bioreactors. Environ. Model Softw. 22(6), 815–822 (2007)
20. MGK Machesa; LK Tartibu; FK Tekweme; MO Okwu; DE Ighravwe, : Performance Pre-
diction of a Stirling heat engine using Artificial Neural Network model. IEEE International
Conference on (icABCD) (2020). https://doi.org/10.1109/icABCD49160.2020.9183890
Predictive Ability of Response Surface Methodology (RSM) and (ANN) 215

21. Machesa, M.G.K., Tartibu, L.K., Okwu, M.O., Tekweme, F.K.: Performance prediction of
oscillatory heat transfer coefficient in heat exchangers of thermoacoustic systems. In: Inter-
national Mechanical Engineering Congress & Exposition, Calvin L. Rampton, Salt Palace
Convention Center, Salt Lake, USA (2019)
22. Samuel, O.D., Okwu, M.O., Semiu, A., Adeniran, S.A.: Production of fatty acid ethyl esters
from rubber seed oil in hydrodynamic cavitation reactor: Study of reaction parameters and
some fuel properties. Indust. Crops Prod. Sci. Direct. 141, 1–13 (2019). https://doi.org/10.
1016/j.indcrop.2019.111658
23. Samuel, O., Okwu, M.: Comparison of response surface methodology (RSM) and artificial
neural network (ANN) in modelling of waste coconut oil ethyl esters production. Energy Sour.
Part A Recov. Utiliz. Environ. Effects 41(9), 1049–1061 (2019). https://doi.org/10.1080/155
67036.2018.1539138
24. Edwiges, T., et al.: Influence of chemical composition on biochemical methane potential of
fruit and vegetable waste. Waste Manage. 71, 618–625 (2018)
25. Torsten, F., et al.: Farm-Scale Biogas Plant, pp. 1–9. Krieg & Fischer Ingenieure GmbH,
Göttingen (2002)
26. Santos, O.O., Jr., Maruyama, S.A., Claus, T., de Souza, N.E., Matsushita, M., Visentainer,
J.V.: A novel response surface methodology optimization of base-catalyzed soybean oil
methanolysis. Fuel 113, 580–585 (2013)
27. Gupta, A., Sharma, D.S.D.: A survey on stock market prediction using various algorithms.
Int. J. Comput. Technol. App. 5, 530–533 (2014)
28. Ebrahimpour, A., Rahman, R., Ch’ng, D.E., Basri, M., Salleh, A.: A modeling study by
response surface methodology and artificial neural network on culture parameters optimiza-
tion for thermostable lipase production from a newly isolated thermophilic Geobacillus sp.
strain ARM. BMC Biotechnol. 8(1), 96 (2008). https://doi.org/10.1186/1472-6750-8-96
29. Nwufo, O.C., Okwu, M., Nwaiwu, C.F., Igbokwe, J.O., Martin, O., Nwafor, I., Anyanwu,
E.E.: The application of artificial neural network in prediction of the performance of spark
ignition engine running on ethanol-petrol blends. Int. J. Eng. Technol. 12, 15–31 (2017).
https://doi.org/10.18052/www.scipress.com/IJET.12.15
Biosignals Processing
Analysis of Electroencephalographic
Signals from a Brain-Computer Interface
for Emotions Detection

Beatriz Garcı́a-Martı́nez1,2(B) , Antonio Fernández-Caballero1,2,3 ,


Arturo Martı́nez-Rodrigo4,5 , and Paulo Novais6
1
Departamento de Sistemas Informáticos, Escuela Técnica Superior
de Ingenieros Industriales, Universidad de Castilla-La Mancha, Albacete, Spain
[email protected]
2
Instituto de Investigación en Informática de Albacete,
Universidad de Castilla-La Mancha, Albacete, Spain
3
CIBERSAM (Biomedical Research Networking Centre in Mental Health),
Madrid, Spain
4
Research Group in Electronic, Biomedical and Telecommunication Engineering,
Facultad de Comunicación, Universidad de Castilla-La Mancha, Cuenca, Spain
5
Instituto de Tecnologı́as Audiovisuales de Castilla-La Mancha,
Universidad de Castilla-La Mancha, Cuenca, Spain
6
Algoritmi Center, Department of Informatics,
Universidade do Minho, Braga, Portugal

Abstract. Despite living in a digital society, the relation between


humans and automatic systems is still far from being similar to the
interaction among humans. In order to solve the lack of emotional intel-
ligence of those systems, many works have designed algorithms for an
automatic recognition of emotions through the assessment of physio-
logical signals, with special interest in electroencephalography (EEG).
However, the complexity of professional EEG recording devices limits
the possibility to develop and test these algorithms in real life scenar-
ios, out of laboratory conditions. On the contrary, the use of wearable
brain-computer interfaces could solve this limitation. For this reason,
the present work analyzes EEG signals recorded with a BCI device
for the off-line classification of emotional states. Concretely, the spec-
tral power in the different frequency bands of the EEG spectrum has
been computed and assessed to discern between high and low levels of
valence and arousal. Results reported an interesting classification perfor-
mance of the BCI device in all frequency bands, being beta waves those
which reported the best outcomes, 68.21% of accuracy for valence and
72.54% for arousal. In addition, the application of a sequential forward
selection approach before the classification step revealed the relevance
of frontal areas for valence detection and posterior regions for arousal
identification.

Keywords: Emotion recognition · Electroencephalography · Spectral


power · Brain-computer interface
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 219–229, 2021.
https://doi.org/10.1007/978-3-030-85030-2_18
220 B. Garcı́a-Martı́nez et al.

1 Introduction
Nowadays, we live in a digitized world in which the interaction between humans
and machines is in constant expansion [11]. However, the interplay with sys-
tems is still far from being comparable to human intercommunication due to the
lack of emotional intelligence of those human-machine interfaces (HMIs) [23].
As a result, machines are still not able to interpret human emotions and decide
which actions to execute accordingly [11]. Emotions play a key role in a variety
of scenarios, influencing on cognition, perception, rational decision making, and
also on basic processes of communication and interaction among humans [3]. For
this reason, it becomes essential to endow those system with the capability of
identifying the emotional state of the user and generate a proper response in
order to humanize the relation between people and machines [23]. This is pre-
cisely the purpose of the affective computing science, which has emerged for the
development of emotional models for the automatic recognition of emotions [23].
Nevertheless, the identification of emotional states is not an easy task. One of
the reasons is the high intercorrelation of emotions and the absence of a standard
model for their definition [30]. Indeed, many theories can be found in literature
containing a number of discrete emotions ranging from a few basic feelings (hap-
piness, sadness, anger, fear, surprise and disgust) [9] to dozens of complex states
derived from the combination of the basic ones [28]. However, the circumplex
model of affect proposed by Russell is the most widely used [26]. It consists of
a bidimensional model in which all emotions are distributed according to two
parameters called valence and arousal. Valence measures the degree of pleasant-
ness or unpleasantness produced by a stimulus, whereas arousal is determined
by the activation or relaxation that a stimulus provokes [26]. Hence, the location
of each emotional state in the circumplex model depends on its associated level
of valence and arousal.
In the last years, the identification of emotional states has been mainly based
on the assessment of physiological variables, with special relevance of the elec-
troencephalography (EEG), representing the electrical activity of the brain gen-
erated by neural connections [1]. The reason is that EEG measures the source
of all bodily responses, whereas the rest of physiological signals represent the
secondary effects of the brain activity spread by the central nervous system [8].
In addition, the assessment of the spectral features of EEG can reveal inter-
esting information related to the emotional state of the individual. Precisely,
the EEG spectrum (0.5–45 Hz) is divided into five frequency bands, namely
delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz) and gamma
(30–45 Hz) [15]. The prominence of these frequencies varies according to the
emotional state, thus it is possible to identify different emotions by means of the
evaluation of EEG recordings from a spectral point of view [16].
The main advantages of EEG with respect to other neuroimaging techniques
are the non-invasiveness and the high temporal resolution [27]. Nevertheless,
medical EEG recording devices are expensive, require long preparation times,
and their portability is quite reduced given the great amount of wires (one per
electrode) needed for the recording process [16]. Therefore, the use of those
Analysis of Electroencephalographic Signals 221

systems in non-laboratory environments is considerably restricted. As a solu-


tion, different wearable devices and brain-computer interfaces (BCIs) have been
recently introduced to solve those limitations. These BCIs are wireless, portable,
simple to use, and relatively less expensive than medical devices, thus enhancing
the possibility of registering EEG signals in real-life situations, out of laboratory
conditions [16].
In the present work, EEG signals recorded with a BCI device are assessed for
the recognition of different emotional states. Concretely, the aim of this study
is to discern between different levels of valence and arousal by analyzing the
spectral features of the EEG signals recorded with a well-known BCI called
Emotiv EPOC headset. Moreover, the relevance of each frequency band in the
EEG spectrum will be evaluated, in order to discover the information reported
by the different brain rhythms under different emotional conditions.

2 Methods

2.1 Database and BCI Device

With the purpose of guaranteeing the reproducibility of this study, the EEG record-
ings were extracted from a publicly available database called DREAMER [19]. This
dataset was specifically designed for the recognition of emotions through EEG and
electrocardiogram (ECG) signals recorded with wearable, wireless and off-the-shelf
devices, with the aim of integrating affective computing techniques in everyday
scenarios. Their experiment consisted of registering EEG and ECG of a total of
23 healthy participants (14 males, 9 females) with ages ranging between 22 and 33
years (average 26.6) during the visualization of 18 film clips with emotional content.
Concretely, the emotions elicited were happiness, calmness, amusement, excite-
ment, anger, disgust, fear, sadness and surprise. The duration of the clips ranged
from 65 to 393 s (average duration 199 s). It is important to remark that before each
film clip, a neutral clip of 60 s of length was presented in order to return the partic-
ipant to a neutral state and establish a baseline. After each visualization, subjects
rated their level of valence and arousal, among other emotional parameters, using
self-assessment manikins graphically representing an intensity scale from 1 to 5.
More information about this database can be found in [19].
The EEG wearable device used for the recording of the brain signals contained
in DREAMER dataset was Emotiv EPOC, shown in Fig. 1(a). This wireless BCI
includes 16 gold-plated contact sensors that measure 14 EEG channels and 2
references (CMS and DRL). Those 14 channels are AF3, F7, F3, FC5, T7, P7,
O1, O2, P8, T8, FC6, F4, F8, and AF4, located according to the International
Standard 10–20 system for electrodes positioning [20]. Their locations in the
scalp can be observed in Fig. 1(b). The sampling frequency of this device 128 Hz.
222 B. Garcı́a-Martı́nez et al.

(a) (b)

AF3 AF4
F7 F8
F3 F4

FC5 FC6
T7 T8

CMS DRL

P7 P8

O1 O2

Fig. 1. (a) Emotiv EPOC headset. (b) Locations of electrodes (in blue) and references
(in gray) in the Emotiv EPOC headset. (Color figure online)

2.2 EEG Preprocessing


Raw EEG signals usually contain noise and interferences that may hide the
neural information. Hence, it is important to preprocess the EEG signals before
the application of any kind of analysis. In this work, this procedure was developed
with EEGLAB, which is a Matlab toolbox specifically created for the analysis
of EEG time series [6]. Signals were firstly referenced to the common average by
computing the mean potential of all electrodes and subtracting it from each single
channel [4]. After that, two forward/backward high-pass and low-pass filtering
approaches were applied at cutoff frequencies of 4 and 45 Hz, respectively. In this
sense, the emotional information of interest contained in theta, alpha, beta and
gamma frequency bands was maintained [15]. Moreover, these cutoff frequencies
were also useful for removing baseline and power line interferences. After that, a
blind source separation technique called independent component analysis (ICA)
was applied for removing artifacts not rejected in previous preprocessing steps.
These artifacts are produced by either physiological sources (i.e., heart bumps,
facial movements, or eye blinks) or technical reasons (i.e., electrode pops, or bad
contact of the electrodes over the scalp). Once the independent components were
computed, those related to artifacts were eliminated, thus only remaining the
information related to the brain activity.

2.3 Feature Extraction


As aforementioned, the 18 film clips used as stimuli had different durations.
For consistency of the analyses carried out in this work, the same length was
chosen for all the clips. Furthermore, although each stimulus had a main target
emotional state, many different emotions could be elicited for longer exposure
times. In order to avoid this issue, only the last 60 s of each film clip were finally
selected for their analysis, thus discarding the rest of the samples recorded for
each stimulus as in [19].
Analysis of Electroencephalographic Signals 223

For the spectral assessment of EEG recordings, the power spectral density
(PSD) was calculated by means of the Welch’s periodogram, using a Hamming
window of 2 s length with 50% of overlapping and 256 points of resolution. After
that, the spectral power of the whole band (4–45 Hz), SPall , was computed for
each EEG channel as the area under the PSD curve contained within the whole
spectrum:

45Hz
SPall = |P SD(f )| (1)
4Hz

The procedure for obtaining the spectral power of the frequency bands sepa-
rately, SPθ , SPα , SPβ , and SPγ , was similar as for the whole spectrum. However,
in order to preserve the fluctuations among participants, the resulting power in
each frequency band was normalized by the power computed for the whole spec-
trum:
f2
1 
SPB = |P SD(f )|, (2)
SPall
f1

where B is the frequency band, and f1 and f2 its lower and higher frequency,
respectively.

2.4 Classification Procedure


Two different classification schemes were conducted, one for arousal detection,
and another for valence identification. Precisely, the classification approach was
implemented to discern between high and low levels of the corresponding emo-
tional dimension, i.e. high arousal vs. low arousal (HA vs. LA), and high valence
vs. low valence (HV vs. LV). As the participants’ ratings ranged from 1 to 5,
the group of low level contained samples with a rating value of 1 or 2, whereas
the group of high level was conformed by samples with rating values of 4 and
5. Samples rated with a value of 3 were discarded in this analysis because of
representing a neutral midline between the two target classes. According to this
criteria, the samples included in each group were HA = 181, LA = 114, HV = 163,
and LV = 161.
The classification procedure was similar for valence and arousal schemes and
for the different frequency bands. First of all, the samples were rearranged under
a K-fold cross-validation approach to avoid overfitting of the classifier. More pre-
cisely, samples were randomly divided into 10 equally-sized folds, checking that
each fold was representative of the whole samples set and contained a balanced
number of samples from both high and low groups [18]. After that, a sequen-
tial forward selection (SFS) technique was implemented to choose the subset
of EEG channels that minimized the misclassification rate in each of the 10
folds. The criterion model was a decision tree classifier in which the growth of
the nodes was stopped either when they only contained samples from one of
the groups of study or when the number of samples was less than 20% of the
whole dataset. In addition, the node splitting criterion was an impurity-based
Gini index. Hence, channels selected by the SFS approach at least 30% of the
224 B. Garcı́a-Martı́nez et al.

iterations in the cross-validation process were used as input features to train


the decision tree classification model. Finally, test samples were introduced in
the resulting model and its performance was evaluated according to the true
positive (TP), true negative (TN) false positive (FN) and false negative (FN)
rates. Concretely, TP and TN are the positive and negative samples correctly
identified, respectively. FP is the rate of negative samples incorrectly identified
as positive, whereas FN is the number of positive samples incorrectly classified
as negative. With these values, different performance parameters can be com-
puted. For instance, sensitivity (Se) is the rate of positives correctly identified
out of all positive samples:
TP
Se = (3)
TP + FN
On the other hand, specificity (Sp) represents the rate of negatives correctly
detected out of all negative samples:
TN
Sp = (4)
TN + FP
In this work, samples with a high level of either valence or arousal, i.e., HV
and HA, were considered as positive class, whereas samples with a low level of
valence or arousal, LV and LA, conformed the negative class. Hence, Se was the
rate of high samples correctly detected as high, while Sp represented the amount
of low samples correctly identified as low. Finally, the classification accuracy
(Acc) was obtained as the total number of correctly classified samples:

TP + TN
Acc = (5)
TP + TN + FP + FN

3 Results

Figure 2 represents the mean spectral power in all brain areas obtained for the
different groups of study from all frequency bands. Concretely, the first two
columns show the average values for the arousal detection scheme, with LA
results in the first column and HA outcomes in the second column. Similarly, the
last two columns show the mean spectral power obtained for LV and HV groups,
respectively. As can be observed, for both arousal and valence approaches, the
groups with a low level of the corresponding emotional dimension (i.e., LA and
LV) presented a generalized increase of spectral power in all brain regions with
respect to the groups of high level (HA and HV) for all the frequency bands.
Only for beta band the spectral power was similar in HA and LA groups, and
a bit higher in HV with respect to LV in frontal and parieto-occipital areas. In
the same manner, results for the whole spectrum in valence approach reported
a slightly higher power for HV than for LV in posterior regions.
In addition, Table 1 shows the EEG channels selected by the SFS approach
at least 30% of the iterations of the cross-validation process. In the HA vs. LA
scheme, the number of channels selected ranged between 2 and 5, depending on
Analysis of Electroencephalographic Signals 225

AROUSAL VALENCE
LA HA LV HV

Theta
(4-8 Hz)

Alpha
(8-13 Hz)

Beta
(13-30 Hz)

Gamma
(30-45 Hz)

All bands
(4-45 Hz)

Min. power Max. power

Fig. 2. Representation of the averaged spectral power for LA, HA, LV and HV groups
from all frequency bands.

the frequency band under study. Furthermore, it can be observed that those cho-
sen channels were located in all brain regions, although with a special relevance
in the posterior half of the brain, central, temporal, parietal and occipital lobes,
from both hemispheres. On the other hand, the number of channels selected at
least 30% of iterations for the HV vs. LV approach ranged between 1 and 3.
In this case, these channels were mainly located in frontal and fronto-central
regions of both brain hemispheres.
226 B. Garcı́a-Martı́nez et al.

Table 1. Channels selected by the SFS approach at least 30% of iterations of the
cross-validation process for each frequency band and for the whole spectrum.

Frequency band Channels selected


Arousal Valence
Theta T7, O2 F8
Alpha F8, FC6, O2 F7, FC5
Beta F7, FC5, T7, P7, P8 F4, FC5, P7
Gamma AF4, FC6, T7, T8, P8 F3, F4, FC5
All bands F7, P7 AF4

These channels selected by the SFS approach were used as input features in
the decision tree classification model designed for each frequency band and for
both arousal and valence schemes. Results of performance of these classification
procedures are included in Table 2. In the arousal case, Acc values obtained
for the different frequency bands ranged between 66.78% (whole spectrum) and
72.54% (beta band). On the other hand, the classification of valence groups
reported Acc outcomes between 59.57% (theta and alpha bands) and 68.21%
(beta band). Beta band provided the best results when discerning between high
and low levels of the two emotional dimensions. Furthermore, it is interesting to
note that Se and Sp values are quite unbalanced for both valence and arousal
approaches. Concretely, Sp results are notably higher than Se in both cases, thus
demonstrating a good performance for the detection of low levels of valence (70–
95%) and arousal (73–95%), but more difficulties for the identification of high
levels (22–66% for valence and 30-63% for arousal).

Table 2. Classification performance of HA vs. LA and HV vs. LV schemes for the


different frequency bands and for the whole spectrum.

Frequency band Arousal Valence


Se (%) Sp (%) Acc (%) Se (%) Sp (%) Acc (%)
Theta 30.70 95.03 70.17 23.60 95.09 59.57
Alpha 58.77 80.66 72.20 22.36 96.32 59.57
Beta 63.16 78.45 72.54 66.26 70.19 68.21
Gamma 49.12 86.74 72.20 56.52 72.39 64.51
All bands 56.14 73.48 66.78 41.72 81.99 61.73

4 Discussion and Conclusions


In this study, the spectral properties of EEG signals have been evaluated for
the identification of different emotional states. Concretely, the spectral power
Analysis of Electroencephalographic Signals 227

of the different frequency bands of the EEG spectrum has been computed to
discern between high and low levels of valence and arousal. The SFS approach
used for the selection of the EEG channels that minimized the misclassification
rate reported that frontal channels were the most relevant for valence detection,
whereas electrodes located in the posterior half of the brain provided more infor-
mation for arousal identification. Interestingly, the relationship between these
brain areas and the two emotional dimensions has already been described in the
scientific literature [2,7,13]. Other studies have also highlighted the relevance
of these regions in emotional processes in healthy subjects [10,21,29], patients
with disorder of consciousness [14], or children with autism [24]. The relevance
of these areas could be reinforced by the possible existence of anatomical con-
nections between those regions [22] and the functional relation between them in
opposite hemispheres, such that the activation of the frontal lobe in one hemi-
sphere could occur together with an activation in posterior areas in the other
hemisphere [5,25].
Classification results obtained for the case of detection of high an low valence
ranged between 59 and 68%, while in the case of high and low arousal these
results achieved values between 66 and 72%. Although all frequency bands
reported a similar performance, the best classification accuracy was reported
by beta band in both valence and arousal schemes. In addition, this frequency
band also presented the most balanced relationship between Se and Sp outcomes,
thus showing a proportional capability to detect both high and low levels of the
corresponding emotional dimension. Beta rhythm, typically related to an alert
state of mind, has already reported interesting information in previous works
of emotions recognition. For instance, beta band has been associated with emo-
tional stress, presenting a lower beta activity for stressed than for non-stressed
subjects [12]. In a similar line, an increase in beta waves, together with emo-
tional problems and childhood trauma, has been associated with hyperarousal
cases [17].
This study has corroborated the capability to discern between different emo-
tional states by means of the analysis of EEG signals recorded with wearable and
off-the-shelf devices. Therefore, it opens the possibility of recognizing emotions
out of laboratory conditions, in real life scenarios, with the purpose of adapting
and improving the wellbeing of the individual and its interaction with automatic
systems. Furthermore, one of the main strengths of this work is the use of the
publicly available dataset DREAMER, thus guaranteeing the reproducibility of
the experiments and allowing to directly compare the results of applying different
methodologies of analysis.
In addition, it is interesting to highlight that the classification results
obtained in this study could be improved by means of using more input features
in the classification model, i.e., using the 14 EEG channels available instead of
only those electrodes selected by the SFS approach. Another option could be
the application of more complex machine learning or deep learning techniques.
Although these options could report higher classification outcomes, the clinical
interpretation of the results would be blurred by the complexity of the models,
228 B. Garcı́a-Martı́nez et al.

thus not being able to control the contribution of each brain region to the final
performance. On the contrary, the implementation of a decision tree classifier
with a reduced number of input features previously selected by an SFS approach
allows to give a clinical interpretation of the results obtained, thus revealing new
findings about the brain’s behavior under different emotional conditions.

Acknowledgments. This work was partially supported by Spanish Ministerio de


Ciencia, Innovación y Universidades, Agencia Estatal de Investigación (AEI)/European
Regional Development Fund (FEDER, UE) under EQC2019-006063-P, PID2020-
115220RB-C21, and 2018/11744 grants, and by Biomedical Research Networking Cen-
tre in Mental Health (CIBERSAM) of the Instituto de Salud Carlos III. Beatriz Garcı́a-
Martı́nez holds FPU16/03740 scholarship from Spanish Ministerio de Educación y For-
mación Profesional.

References
1. Alarcao, S.M., Fonseca, M.J.: Emotions recognition using EEG signals: a survey.
IEEE Trans. Affect. Comput. 10(3), 374–393 (2017)
2. Alia-Klein, N., et al.: Trait anger modulates neural activity in the fronto-parietal
attention network. PLoS ONE 13(4), e0194444 (2018)
3. Coan, J.A., Allen, J.J.B.: Handbook of Emotion Elicitation and Assessment.
Oxford University Press, Oxford (2007)
4. Cohen, M.X.: Analyzing Neural Time Series Data: Theory and Practice. MIT
Press, Cambridge (2014)
5. Davidson, R.J.: Affect, cognition, and hemispheric specialization. In: Emotion,
Cognition, and Behavior, pp. 320–365. Cambridge University Press, New York
(1988)
6. Delorme, A., Makeig, S.: EEGLAB: an open source toolbox for analysis of single-
trial EEG dynamics including independent component analysis. J. Neurosci. Meth-
ods 134(1), 9–21 (2004)
7. Dolcos, F., Cabeza, R.: Event-related potentials of emotional memory: encoding
pleasant, unpleasant, and neutral pictures. Cogn. Affect. Behav. Neurosci. 2(3),
252–263 (2002)
8. Egger, M., Ley, M., Hanke, S.: Emotion recognition from physiological signal anal-
ysis: a review. Electron. Notes Theor. Comput. Sci. 343, 35–55 (2019)
9. Ekman, P.: An argument for basic emotions. Cogn. Emot. 6(3–4), 169–200 (1992)
10. Garcı́a-Martı́nez, B., Martı́nez-Rodrigo, A., Zangróniz, R., Pastor, J.M., Alcaraz,
R.: Symbolic analysis of brain dynamics detects negative stress. Entropy 19(5),
196 (2017)
11. Han, J., Zhang, Z., Schuller, B.: Adversarial training in affective computing and
sentiment analysis: recent advances and perspectives. IEEE Comput. Intell. Mag.
14(2), 68–81 (2019)
12. Hayashi, T., Okamoto, E., Nishimura, H., Mizuno-Matsumoto, Y., Ishii, R., Ukai,
S.: Beta activities in EEG associated with emotional stress. Int. J. Intell. Comput.
Med. Sci. Image Process. 3(1), 57–68 (2009)
13. Heller, W., Nitschke, J.B.: The puzzle of regional brain activity in depression and
anxiety: the importance of subtypes and comorbidity. Cogn. Emot. 12(3), 421–447
(1998)
Analysis of Electroencephalographic Signals 229

14. Huang, H., et al.: An EEG-based brain computer interface for emotion recognition
and its application in patients with disorder of consciousness. IEEE Trans. Affect.
Comput. (2019)
15. Ismail, W.W., Hanif, M., Mohamed, S., Hamzah, N., Rizman, Z.I.: Human emotion
detection via brain waves study by using electroencephalogram (EEG). Int. J. Adv.
Sci. Eng. Inf. Technol. 6(6), 1005–1011 (2016)
16. Jebelli, H., Hwang, S., Lee, S.: EEG signal-processing framework to obtain high-
quality brain waves from an off-the-shelf wearable EEG device. J. Comput. Civ.
Eng. 32(1), 04017070 (2018)
17. Jin, M.J., Kim, J.S., Kim, S., Hyun, M.H., Lee, S.H.: An integrated model of emo-
tional problems, beta power of electroencephalography, and low frequency of heart
rate variability after childhood trauma in a non-clinical sample: a path analysis
study. Front. Psych. 8, 314 (2018)
18. Jung, Y., Hu, J.: A K-fold averaging cross-validation procedure. J. Nonparametric
Stat. 27(2), 167–179 (2015)
19. Katsigiannis, S., Ramzan, N.: DREAMER: a database for emotion recognition
through EEG and ECG signals from wireless low-cost off-the-shelf devices. IEEE
J. Biomed. Health Inform. 22(1), 98–107 (2018)
20. Klem, G.H., Lüders, H.O., Jasper, H.H., Elger, C.: The ten-twenty electrode system
of the International Federation. Electroencephalogr. Clin. Neurophysiol. 52, 3–6
(1999)
21. Martı́nez-Rodrigo, A., Garcı́a-Martı́nez, B., Alcaraz, R., González, P., Fernández-
Caballero, A.: Multiscale entropy analysis for recognition of visually elicited nega-
tive stress from EEG recordings. Int. J. Neural Syst. 29(02), 1850038 (2019)
22. Nauta, W.J.: Neural associations of the frontal cortex. Acta Neurobiol. Exp. 32(2),
125–140 (1972)
23. Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing:
from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017)
24. Portnova, G., Maslennikova, A., Varlamov, A.: Same music, different emotions:
assessing emotions and EEG correlates of music perception in children with ASD
and typically developing peers. Adv. Autism 4(3), 85–94 (2018)
25. Rubia, K.: The neurobiology of meditation and its clinical effectiveness in psychi-
atric disorders. Biol. Psychol. 82(1), 1–11 (2009)
26. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178
(1980)
27. Sanei, S.: Adaptive Processing of Brain Signals. Wiley, Hoboken (2013)
28. Schröder, M., Cowie, R.: Towards emotion-sensitive multimodal interfaces: the
challenge of the European Network of Excellence HUMAINE. In: Adapting the
Interaction Style to Affective Factors Workshop in Conjunction with User Modeling
(2005)
29. Soroush, M.Z., Maghooli, K., Setarehdan, S.K., Nasrabadi, A.M.: Emotion recog-
nition through EEG phase space dynamics and Dempster-Shafer theory. Med.
Hypotheses 127, 34–45 (2019)
30. Valenza, G., Lanata, A., Scilingo, E.P.: The role of nonlinear dynamics in affective
valence and arousal recognition. IEEE Trans. Affect. Comput. 3(2), 237–249 (2012)
A Fine Dry-Electrode Selection
to Characterize Event-Related Potentials
in the Context of BCI

Vinicio Changoluisa1,2 , Pablo Varona1 , and Francisco B. Rodriguez1(B)


1
Grupo de Neurocomputación Biológica, Dpto. de Ingenierı́a Informática, Escuela
Politécnica Superior, Universidad Autónoma de Madrid, 28049 Madrid, Spain
[email protected], {pablo.varona,f.rodriguez}@uam.es
2
Grupo de Investigación en Electrónica y Telemática (GIETEC),
Universidad Politécnica Salesiana, Quito, Ecuador

Abstract. A brain-computer interface (BCI) detects brain activity and


converts it to external commands, facilitating the interaction with exter-
nal devices. One way to implement a BCI is through event-related poten-
tials (ERP), which are positive or negative voltage deflections detected
by electroencephalography (EEG) through conductive electrodes. A very
promising technology of dry electrodes has been used in recent years,
which is much easier and faster to install; useful also for daily life appli-
cations. But the disadvantage is that its signal-to-noise ratio is lower
compared to traditional wet electrodes technology. Thus, we hypothe-
sized that an appropriate selection of dry electrodes allows the recov-
ery of much more information than traditional standard electrodes and
therefore improves the BCI performance. This work shows the impor-
tance of electrode selection to obtain a better detection of the ERPs of
the EEG signal with a minimum number of electrodes in a personalized
manner. To illustrate this problem, we designed a BCI experiment based
on P300-ERPs with a dry electrodes wireless EEG system and we evalu-
ated its performance with two electrode selection methodologies designed
for this purpose in 12 subjects. The experimental analysis of this work
shows that our electrode selection methodology allows the P300-ERPs to
be detected with greater precision than a standard electrode set choice.
Besides, this minimum electrode selection methodology allows dealing
with the well-known problem of inter- and intrasubject variability of
the EEG signal, thus customizing the optimal selection of electrodes for
each individual. This work contributes to the design of more friendly
BCIs through a reduction in the number of electrodes, thus promoting
more precise, comfortable, and lightweight equipment for real-life BCI
applications.

Keywords: Event-related potentials · Low-cost BCI · Electrode


selection · Inter- and intrasubject variability · Bayesian linear
discriminant analysis · EEG signal · Oddball paradigm

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 230–241, 2021.
https://doi.org/10.1007/978-3-030-85030-2_19
A Fine Dry-Electrode Selection to Characterize Event-Related Potentials 231

1 Introduction
A Brain-computer Interface (BCI) detects brain activity and converts it into
external commands facilitating the interaction with external devices. This tech-
nology was initially conceived for clinical use [34], currently it is widely accepted
in various fields such as in education, entertainment [4,20,26], or even military
use. Electroencephalography (EEG) is the most widely used neuronal activity
detection technique in BCIs, not only for its temporal resolution (milliseconds)
[25], but for its relatively low cost and comparative safety since the user is not
exposed to strong electric or magnetic fields [12]. Electrical brain activity is
acquired through conductive electrodes placed on the scalp in the region under
study. These can be of the wet type, which implies placing a conductive gel
or paste to achieve an optima impedance and record high-quality signals. A
relatively new alternative is the use of dry electrodes that remedies the dis-
advantages of wet electrodes such as the long set-up time, dirty hair, periodic
replenishment, and irritation in some cases [12,19]. Although the performance of
dry electrodes has been increasingly enhanced, this technology still has problems
to overcome such as: less signal-to-noise ratio, offset, drift, phase and high sensi-
tivity to motion artifacts [12]. Dry-electrode systems show promise [22], but this
young technology still needs to be refined. Thus, dry electrode systems together
with the design of new algorithms that improve event detection precision are
considered fundamental elements that will allow to meet the challenges of future
BCIs: more precise, comfortable, adaptable to the subject, easy to use and install
[4,20,22,26].
One of the ways to implement a BCI is through event-related potentials
(ERP), which are positive or negative voltage deflections that can be detected
in electroencephalograms. There are several types of ERPs such as the N200,
P300, P500, N2pc, etc. P300 is one of the most used ERPs in the context of BCIs.
This type of ERP is characterized by appearing 300 ms after starting a stimulus,
which can be visual, auditory [31]. The oddball paradigm generates a P300 ERP,
for this, two types of stimuli are presented randomly: one frequently and the
other infrequently (target), which is the one that generates the ERP. Although
P300 is named after its prominent positive deflection, it is a fact that there
are other accompanying positive and negative deflections [9,24], which play an
important role in the study of the brain and cognition and strongly influence the
accuracy of BCIs [3,7,11,13,17,27,32,33]. These positive or negative electrical
deflections related to P300 can be identified in different brain regions such as
the frontal, central, posterior [3,7,13,17,24], although the latter region is the
one with the highest activity as has been manifested from the seminal article
[10] to the most recent ones BCIs [3,7,13,17,30]. Thus, an unofficial standard
electrode configuration has been established, prioritizing the posterior side with
the common presence of the Pz, Oz, PO electrodes and the Fz, Cz electrodes
in the frontal and central regions, respectively. Despite the good performance
of these electrodes in the standard configuration, there is evidence that proper
electrode selection improves the accuracy in P300-based BCI [5–8,21]. Thus, this
reinforces the evidence that more electrodes do not mean better precision, and
232 V. Changoluisa et al.

in some cases, it is considered detrimental to performances [3,13]. Being thus a


latent problem and concentrating several efforts in identifying the electrodes that
give the best performance [26,35]. A proper selection of electrodes is affected by
the non-stationary nature of neuronal activity [16], which is also associated with
P300 amplitude and latency variability [15]. Ignoring this variability impairs
the detection of ERPs and therefore affects the precision of ERP-based BCIs.
This variability occurs both in the same subject from trial to trial [23], called
intra-subject variability or between subjects, called inter-subject variability [18].
Although for many years this variability has been seen as -a nuisance to be
controlled for- and impairs the detection of the signal under study, it can be
seen also as an opportunity [14,28] that can help to improve the accuracy and
personalized control of BCIs [18].
Considering this intrinsic inter- and intrasubject variability of the EEG signal
and the low signal-to-noise of dry electrode systems, we hypothesized that the
appropriate selection of dry electrodes allows the recovery of much more infor-
mation than the traditional standard choice of electrodes and therefore improve
the BCI performance between days. To investigate the variability and accuracy
of EEG-based BCI with a dry-electrode system, we designed an experiment of
P300-based BCIs with a wireless EEG system and evaluated the performance
with two electrode selection methodologies in 12 subjects. We compared a stan-
dard electrode configuration widely used in P300-based BCIs with that arising
from a new electrode selection method based on the area under the curve (AUC)
that has had good results with wet electrodes [7]. Thus, the objective of this work
is to show the importance of dry electrode selection through a recently developed
method [7], which adds a new way of scoring the different electrodes based on
the variability measured with the variance in dry electrode systems considering
its low-quality signal compared to a wet-electrode system.
This article is organized as follows: Sect. 2 describes the experiment, the elec-
trode selection, and the classification methods used. Section 3 shows the results
achieved and the last section the discussion and conclusions.

2 Materials and Methods

2.1 Participant

Twelve volunteer subjects (age 27.1SD3.57±, 2 female) without any history


of neurological or psychiatric illness participated in this experiment. Prior to
the experiment, the details of the experimental protocol were explained and
asked to read and sign the written informed consent. Permission of the ethics
committee of the Autonomous University of Madrid was obtained within the
Spanish Government Grant TIN2017-84452-R.

Experimental Protocol. Our experiment was based on that carried out by


Hoffman et al. [13]. Six images were presented one by one in random order, and
the user was asked to silently count the times when a target image was repeated.
A Fine Dry-Electrode Selection to Characterize Event-Related Potentials 233

Each presentation of an image lasted 100 ms, and during the following 300 ms
no image was presented. The presentation of six images was called a trial and
was repeated 15 times, which was called a run. Our stimulation module was
programmed so that the last image of a trial cannot be repeated in the first
one of the next. This condition prevents the characteristics of the brain activity
associated with the same image from overlapping. In order to analyze inter and
intra subject variability, we designed the experiment to execute on 3 different
days. Every day 2 sessions were executed, each session with 6 runs. The delay
period between the first day and the second was one day and between the second
and the third three days.

2.2 EEG Recording and Preprocessing

Data was acquired using Wireless EEG Enobio with 8 passive dry electrodes
positioned according to the 10–20 international electrode placement system: Pz,
Cz, Oz, P4, P3, P8, P7 and Fpz. The signal was digitized with a sampling rate
500 Hz. In offline preprocessing, the data were filtered with a sixth-order forward-
backward Butterworth bandpass filter, and cut-off frequencies were set to 1.0 Hz
and 12.0 Hz, following previous tests [13]. These data were downsampled 500 Hz
32 Hz, and each run was standardized to a mean of 0 and standard deviation of
1. An epoch of 1,000 ms from the onset of the stimulus was considered for our
analysis.

2.3 Description of the MaxAUC Method: Continuous ERP


Characterization

Our maxAUC method was proposed in [7] as a way to characterize the ERPs
within the signal, which preserves temporospatial information and proved to
contribute to an optimal selection of electrodes. This methodology benefits from
the continuous calculation of the AUC in each epoch of the EEG signal related to
the presented stimuli. In each epoch (six in the present work) of a trial, a sliding
window was configured and the AUC was calculated, which keeps track of the
temporal and spatial information through the continuous measurement of the
area under the curve in ERP components. Finally, as a result of this continuous
tracking, we obtain a hit vector ĥ. Two hit vectors can be obtained: one for the
maximum AUC and the other for the minimum ones. In this work, we use the
hit vector related to the minimum AUCs, since we showed in the work [7] that
they characterized the P300 wave more adequately.

2.4 Electrode Selection Methods

For the selection of electrodes, two methods were considered: the standard choice
and the selection of maxAUC-N. Below we show the details of each one. For their
comparison, we will use different number of electrodes: from a minimal choice of
one electrode, up to seven electrodes, as we will see below.
234 V. Changoluisa et al.

Standard Electrode Selection (SES): Named for its wide use by default.
This electrode system gives priority to the posterior brain region where a greater
presence of signal has been evidenced that contributes to improving the precision
of the P300-based BCIs [3,10,13,17], one can find further detail in Table 1 of our
previous work [7]. Following this methodology, we categorized the electrodes into
5 sets according to the number of electrodes in the analysis. First set: Pz; second
set: Pz, Cz; third set: Pz, Cz, Oz; fourth set: Pz, Cz, Oz, P8, P7; and fifth set:
Pz, Cz, Oz, P4, P3, P8, P7.
Electrode selection by the maxAUC-N method was raised as an alterna-
tive to SES. This methodology takes advantage of the characterization of the
EEG signal, explained in the previous section, to select electrodes. The hit vec-
tor ĥ is scored with a metric summarizing the property of each electrode. In this
work, we use two type of score:
i. AUC as score in electrode selection: to score each electrode, we measure the
area in each one, considering the good results achieved in our previous work
[7].
ii. Variance as score in electrode selection: additionally, in this work, we also
incorporate the variance as a metric to score our hit vector ĥ. We must
remember that P300 is accompanied by other components such as P1, N1,
P2, N2 [29,32,33] which contribute to the detection of the target signal. All
these components can cause the variance of the vector ĥ to increase when
there is a P300-ERP, and therefore that is the motivation to use this as a
score.
The electrode sets are obtained as follows. For the set of an electrode, the one
with the highest score is chosen. For the rest of the sets: 2,3,5 and 7; We divide
the electrodes into two groups F/C that correspond to electrodes of the frontal
and central regions and P/O corresponding to the parietal and occipital regions.
We chose an electrode with the best score in the F/C region and the rest of the
P/O region, considering the score obtained. With this last selection, we value
the scientific evidence of the importance of the posterior region, manifested from
the seminal articles [3,10,13,17]. This selection methodology is the one used in
[7], where it is explained in more detail.

2.5 Classification
Target signal recognition in P300 ERPs can be thought of as a binary classifica-
tion problem: P300 vs non-P300 signals. Bayesian Linear Discriminant Analysis
(BLDA) was proposed by Hoffman et al. [13] for such problems in P300-based
BCI. BLDA is an extension of Fisher’s Linear Discriminant Analysis (FLDA) [2]
but runs the regression in a Bayesian framework, allowing it to automatically
estimate the degree of regularization. In this way, BLDA prevents overfitting
in noisy and high-dimensional datasets. BLDA considers class labels t can be
expressed as a weighted sum of the features in the corresponding feature vec-
tor x, and it is assumed that this linear dependence is corrupted by a certain
A Fine Dry-Electrode Selection to Characterize Event-Related Potentials 235

amount of Gaussian noise n: t = wT x + n. The class probability can be obtained


by calculating the probability of the target value during the training process.
Class labels indicate the target (P300) and non-target (non-P300) as a num-
ber y ∈ {1, −1}. The EEG signal of each stimulus is a feature vector which is
represented as x ∈ RD , where D indicates the number of features. The weight
vectors w ∈ RD are obtained by solving the parameter selection problem using
maximum-likelihood estimates [13]. Once the w parameters are adjusted to dif-
ferent classes, this Bayesian formalism allows us to calculate the probability
that a new feature vector x has class label y = 1 or y = −1. The probabilistic
model used is expressed in terms of a predictive distribution, for simplicity, with
a Gaussian form that can be characterized by its mean and its variance. Here
only the mean value of the predictive distribution was assessed for the decision
making.

2.6 Cross-Validation
The classifier BLDA was validated using a K-fold cross-validation scheme, where
K is equal to the number of sessions in each analysis. We present three types of
analysis in this work:
i. Sessions by days (three days): We have two sessions (each of one with six
runs) by subject. We train the model with one session and test with another.
These results were was averaged for each subject on each day. This analysis
allows us to observe how the detection of P300s varies from day to day for all
subjects.
ii. Two-day sessions.- The model was training with three sessions and testing
with one.
iii. Three-day sessions.- The model was training with five sessions and testing
with one. For the two and three-day sessions, the test results were averaged
and one accuracy was obtained for each subject. This analysis and the pre-
vious one allows us to observe how the combination of data from three and
two days helps us improve the information related to the target stimulus for
all users.
The validation was executed for each set of electrodes (1, 2, 3, 5, 7) considered
in our analysis.

2.7 Statistical Analysis


We use the Wilcoxon signed-rank test (WSR test, p < 0.05) to assess whether
the difference between the accuracies provided by our maxAUC-N method and
that of the standard electrode selection was statistically significant.

3 Results
For a study of inter- and intrasubject variability, we analyzed the accu-
racy on each day. The cross-validation was carried out according to Sect. 2.6.
236 V. Changoluisa et al.

Figure 1 shows how our electrode selection method outperforms the standard
selection method. We can see in the figure that our selection method outper-
forms the standard every day. Our method obtains the greatest advantage with
the minimum number of electrodes. We can see that one electrode can exceed
the performance of 7 standard electrodes (see 2nd day in Fig. 1). The new pro-
posal to score the feature vector corresponding to each electrode according to
the variance is superior in all the electrode sets compared to the selection of
standard electrodes.

Fig. 1. Accuracy per day: Each panel corresponds to the analysis of two sessions
per day of 12 subjects. The accuracy achieved by our electrode selection methodology
(blue and red line) is superior to the traditional methodology and adapts to the daily
features of each subject. (Color figure online)

The results of applying Wilcoxon rank statistical test in the accuracy of each
set of electrodes show significant differences in Table 1. Especially in the data
sets with fewer electrodes and with which the variance was used to select them.
The second day shows a significant difference in all electrode sets. The set of
seven electrodes on the first and second day did not show significant differences.
Like electrode 5 when scored with the AUC.
We analyze the performance of the electrode selection by gathering data from
each day. In a first analysis joining two days (four sessions), as we explained in
Sect. 2.6. The accuracy achieved with our selection of electrodes exceeds in all
A Fine Dry-Electrode Selection to Characterize Event-Related Potentials 237

Table 1. p-values of Wilcoxon test by days. The table shows the significant
differences in the accuracies achieved on each set of electrodes day by day.

Number of electrodes in each set


1 2 3 5 7 Score
−6 −7
1st day 7.49 × 10 8.48 × 10 0.01 0.23 0.45 AUC
2nd day 1.68 × 10−10 1.49 × 10−9 1.14 × 10−5 6.60 × 10−3 0.01
3rd day 6.75 × 10−7 6.3 × 10−3 21.3 × 10−3 0.73 0.29
1st day 8.55 × 10−10 2.21 × 10−9 7.89 × 10−4 2.97 × 10−2 0.40 VAR
2nd day 1.06 × 10−14 7.54 × 10−9 1.74 × 10−7 2.10 × 10−3 0.01
3rd day 2.81 × 10−15 6.57 × 10−4 8.61 × 10−6 5.8 × 10−3 0.53

cases compared to the standard selection. Panel B and C in Fig. 2 show how
many times the accuracy achieved by our selection of electrodes exceeds the
traditional selection. In panel B we can see that with three electrodes our method
was better in all subjects. Panels D, E, F of Fig. 2 show a joint analysis of three
days (six sessions). We can see that our method maintains an advantage over the
standard methodology. Even when considering seven electrodes, there is a slight
improvement. In panel E, we can see that with two electrodes our method was
better in all subjects. Please note in panels B, C, E, and F, that our selection
method always outperforms the standard electrodes, i.e. the blue and red bars
are always above 50% of the total cases for all electrode sets (especially for few
electrodes).
The statistical analysis shows us that, like the daily analysis, when we join
sessions of different days, there are also significant differences between the accu-
racies of each electrode selection methodology, see Table 2.

Table 2. p-values of Wilcoxon test with two and three days.The table shows
the significant differences in the accuracies achieved on each set of electrodes with the
sessions of two days and three days.

Number of electrodes in each set


1 2 3 5 7 Score
Two-days 1.76 × 10−6 7.79 × 10−7 5.84 × 10−4 0.03 0.09 AUC
Three-days 4.62 × 10−14 9.71 × 10−6 9.02 × 10−4 0.18 0.40
Two-days 9.41 × 10−10 5.70 × 10−9 3.70 × 10−6 0.02 0.08 VAR
Three-days 1.24 × 10−14 4.24 × 10−8 1.47 × 10−5 7.90 × 10−3 0.44
238 V. Changoluisa et al.

Fig. 2. Accuracy joining two and three days. Panel A shows the accuracy achieved
when the sessions of days one and two (of the 12 subjects) joined. The accuracy of two
electrodes selected with the maxAUC method (red and blue) is sufficient to outperform
the standard selection (black). Panel B and C shows how many times the maxAUC
method exceeded the standard; maxAUC method always exceeds the standard with
three electrodes (see panel B). Panel D, E, F shows the comparison of accuracy joining
the sessions of the three days. (Color figure online)

4 Discussion and Conclusions


A methodology to improve the management of inter- and intrasubject variability
and precision in P300-based BCIs with a dry-electrode system is presented in
this work. Our study shows the importance of selecting electrodes appropriately,
even more so in systems with few or a minimum number of electrodes and with
the dry-electrodes technology, which generates a poor signal quality compared
with wet electrode systems [12]. This advantage is achieved through the use of
a vector of characteristics obtained with the calculation of the AUC that has
shown good results with wet electrodes [7]. This method, which consists of a
A Fine Dry-Electrode Selection to Characterize Event-Related Potentials 239

personalized adaptation, is robust between different day sessions and reaches


the maximum significant difference with few electrodes compared to traditional
standard electrode methodology.
Our analysis shows that one properly selected electrode can outperform the
set of seven standard electrodes, see Fig. 1. The maxAUC-N method shows an
advantage on all days analyzed, on the third day the accuracy is lower with 5
and 7 electrodes with the scoring metrics provided by the AUC. We show that
electrodes scored with the variance feature perform better than those scored
with AUC. We hypothesize that our maxAUC method continuously characterizes
multiple components such as N1, P1, P2, N2; which are detected when scoring
with the variance. Abundant evidence has shown that the presence of other
components influences the performance of the P300-based BCI [1,3,7,13,17,27,
29,32,33], several of them highlighting the importance of N1 [1,29] or N2 [3,11,
13,17,27,32,33]. Further studies are needed to explain the nature of the precision
improvement and the different ways to take advantage of our feature vector. The
maxAUC method characterizes the temporal evolution of ERPs, recognizing the
features related to the target stimuli in each subject and adapting between days,
which contributes to the challenge of developing algorithms that attempt to
model the BCI user [20].
The results obtained in this work are based on a new data set with the EEG
signals of 12 subjects recorded in the context of a dry electrode BCI, and lay the
foundations for exploring new algorithms with the aim of improving adaptability
and performance in dry electrode systems. By last, this work contributes to the
design of more friendly BCIs through a reduction in the number of electrodes,
thus promoting more precise, comfortable and lightweight equipment for real-life
BCI applications [25].

Acknowledgments. The authors thank Vanessa Salazar for her collaboration in


acquiring data from 4 of the 12 subjects as part of her master thesis. This
work was funded by Spanish projects of Ministerio de Economı́a y Competi-
tividad/FEDER TIN2017-84452-R, PGC2018-095895-B-I00, PID2020-114867RB-I00
(http://www.mineco.gob.es/), Predoctoral Research Grants 2015-AR2Q9086 of the
Government of Ecuador through the Secretarı́a de Educación Superior, Ciencia, Tec-
nologı́a e Innovación (SENESCYT) and Universidad Politécnica Salesiana 041-02-2021-
04-16.

References
1. Bianchi, L., Sami, S., Hillebrand, A., Fawcett, I.P., Quitadamo, L.R., Seri, S.:
Which physiological components are more suitable for visual ERP based brain-
computer interface? A preliminary MEG/EEG study. Brain Topography 23(2),
180–185 (jun 2010)
2. Bishop, C.M.: Pattern recognition and machine learning (2006)
3. Blankertz, B., Lemm, S., Treder, M., Haufe, S., Müller, K.r.: Single-trial analysis
and classi fi cation of ERP components — A tutorial. NeuroImage 56(2), 814–825
(2011)
240 V. Changoluisa et al.

4. Chang S. Nam, Anton Nijholt, F.L.e.: Brain-Computer Interfaces Handbook Tech-


nological and Theoretical Advances, vol. 73. CRC Press (2018)
5. Changoluisa, V., Varona, P., Rodriguez, F.B.: How to reduce classification error
in ERP-based BCI: Maximum relative areas as a feature for p300 detection. In:
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics). vol. 10306 LNCS, pp. 486–497.
Springer Verlag (2017)
6. Changoluisa, V., Varona, P., Rodriguez, F.B.: An electrode selection approach
in P300-based BCIs to address inter- A nd intra-subject variability. In: 2018 6th
International Conference on Brain-Computer Interface, BCI 2018. vol. 2018-Janua,
pp. 1–4. Institute of Electrical and Electronics Engineers Inc. (mar 2018)
7. Changoluisa, V., Varona, P., Rodriguez, F.B.: A Low-Cost Computational Method
for Characterizing Event-Related Potentials for BCI Applications and beyond.
IEEE Access 8, 111089–111101 (2020)
8. Colwell, K.A., Ryan, D.B., Throckmorton, C.S., Sellers, E.W., Collins, L.M.: Chan-
nel selection methods for the P300 Speller. Journal of Neuroscience Methods
232(Supplement C), 6–15 (2014)
9. Eimer, M.: The N2pc component as an indicator of attentional selectivity. Elec-
troencephalography and Clinical Neurophysiology 99(3), 225–234 (sep 1996)
10. Farwell, L.A., Donchin, E.: Talking off the top of your head: toward a mental pros-
thesis utilizing event-related brain potentials. Electroencephalography and Clinical
Neurophysiology 70(6), 510–523 (1988)
11. Frenzel, S., Neubert, E., Bandt, C.: Two communication lines in a 3 × 3 matrix
speller. Journal of Neural Engineering 8(3), 036021 (jun 2011)
12. Habibzadeh Tonekabony Shad, E., Molinas, M., Ytterdal, T.: Impedance and Noise
of Passive and Active Dry EEG Electrodes: A Review. IEEE Sensors Journal
20(24), 14565–14577 (dec 2020)
13. Hoffmann, U., Vesin, J.M., Ebrahimi, T., Diserens, K.: An efficient P300-based
brain-computer interface for disabled subjects. Journal of Neuroscience Methods
167(1), 115–125 (jan 2008)
14. van Horn, J.D., Grafton, S.T., Miller, M.B.: Individual variability in brain activity:
A nuisance or an opportunity? Brain Imaging and Behavior 2(4), 327–334 (dec
2008)
15. Intriligator, J., Polich, J.: On the relationship between EEG and ERP variability.
International Journal of Psychophysiology 20(1), 59–74 (jun 1995)
16. Kaplan, A.Y., Fingelkurts, A.A., Fingelkurts, A.A., Borisov, S.V., Darkhovsky,
B.S.: Nonstationary nature of the brain activity as revealed by EEG/MEG:
Methodological, practical and conceptual challenges. Signal Processing 85(11),
2190–2212 (nov 2005)
17. Krusienski, D.J., Sellers, E.W., McFarland, D.J., Vaughan, T.M., Wolpaw, J.R.:
Toward enhanced P300 speller performance. Journal of Neuroscience Methods
167(1), 15–21 (2008)
18. Li, F., Tao, Q., Peng, W., Zhang, T., Si, Y., Zhang, Y., Yi, C., Biswal, B., Yao,
D., Xu, P.: Inter-subject P300 variability relates to the efficiency of brain net-
works reconfigured from resting- to task-state: Evidence from a simultaneous event-
related EEG-fMRI study. NeuroImage 205, 116285 (jan 2020)
19. Lopez-Gordo, M., Sanchez-Morillo, D., Valle, F.: Dry EEG Electrodes. Sensors
14(7), 12847–12870 (jul 2014)
A Fine Dry-Electrode Selection to Characterize Event-Related Potentials 241

20. Lotte, F., Jeunet, C., Mladenovic, J., N’Kaoua, B., Pillette, L.: A BCI challenge
for the signal-processing community: considering the user in the loop. In: Signal
Processing and Machine Learning for Brain-Machine Interfaces, pp. 143–172. Insti-
tution of Engineering and Technology (sep 2018)
21. Mccann, M.T., Thompson, D.E., Syed, Z.H., Huggins, J.E.: Electrode subset selec-
tion methods for an EEG-based P300 brain-computer interface. Disability and
Rehabilitation: Assistive Technology 10(3), 216–220 (may 2015)
22. McFarland, D.J., Wolpaw, J.R.: EEG-based brain–computer interfaces (dec 2017)
23. Ouyang, G., Hildebrandt, A., Sommer, W., Zhou, C.: Exploiting the intra-subject
latency variability from single-trial event-related potentials in the P3 time range:
A review and comparative evaluation of methods (apr 2017)
24. Polich, J.: Updating P300: An integrative theory of P3a and P3b. Clinical Neuro-
physiology 118(10), 2128–2148 (oct 2007)
25. Ramadan, R.A., Vasilakos, A.V.: Brain computer interface: control signals review.
Neurocomputing 223, 26–44 (feb 2017)
26. Rashid, M., Sulaiman, N., P. P. Abdul Majeed, A., Musa, R.M., Ahmad, A.F., Bari,
B.S., Khatun, S.: Current Status, Challenges, and Possible Solutions of EEG-Based
Brain-Computer Interface: A Comprehensive Review (2020)
27. Riccio, A., Schettini, F., Simione, L., Pizzimenti, A., Inghilleri, M., Olivetti-
Belardinelli, M., Mattia, D., Cincotti, F.: On the Relationship Between Attention
Processing and P300-Based Brain Computer Interface Control in Amyotrophic
Lateral Sclerosis. Frontiers in Human Neuroscience 12, 165 (may 2018)
28. Seghier, M.L., Price, C.J.: Interpreting and Utilising Intersubject Variability in
Brain Function (jun 2018)
29. Shishkin, S.L., Ganin, I.P., Basyul, I.A., Zhigalov, A.Y., Kaplan, A.Y.: N1 wave in
the P300 BCI is not sensitive to the physical characteristics of stimuli. In: Journal
of Integrative Neuroscience. vol. 8, pp. 471–485. Imperial College Press (dec 2009)
30. Sirvent Blasco, J.L., Iáñez, E., Úbeda, A., Azorı́n, J.M.: Visual evoked potential-
based brain-machine interface applications to assist disabled people. Expert Sys-
tems with Applications 39(9), 7908–7918 (jul 2012)
31. Sutton, S., Braren, M., Zubin, J., John, E.R.: Evoked-potential correlates of stim-
ulus uncertainty. Science 150(3700), 1187–1188 (nov 1965)
32. Treder, M.S., Schmidt, N.M., Blankertz, B.: Gaze-independent brain-computer
interfaces based on covert attention and feature attention. Journal of Neural Engi-
neering 8(6), 066003 (dec 2011)
33. Treder, M.S., Blankertz, B.: (C)overt attention and visual speller design in an ERP-
based brain-computer interface. Behavioral and Brain Functions 6(1), 28 (may
2010)
34. Vidal, J.J.: Toward Direct Brain-Computer Communication. Annual Review of
Biophysics and Bioengineering 2(1), 157–180 (1973)
35. Yadav, D., Yadav, S., Veer, K.: A comprehensive assessment of Brain Computer
Interfaces: Recent trends and challenges. Journal of Neuroscience Methods 346,
108918 (dec 2020)
Detection of Emotions
from Electroencephalographic Recordings
by Means of a Nonlinear Functional
Connectivity Measure

Beatriz Garcı́a-Martı́nez1,2(B) , Antonio Fernández-Caballero1,2,3 ,


Raúl Alcaraz4 , and Arturo Martı́nez-Rodrigo5,6
1
Departamento de Sistemas Informáticos, Escuela Técnica Superior de Ingenieros
Industriales, Universidad de Castilla-La Mancha, Albacete, Spain
[email protected]
2
Instituto de Investigación en Informática de Albacete,
Universidad de Castilla-La Mancha, Albacete, Spain
3
CIBERSAM (Biomedical Research Networking Centre in Mental Health),
Madrid, Spain
4
Research Group in Electronic, Biomedical and Telecommunication Engineering,
Escuela Politécnica de Cuenca, Universidad de Castilla-La Mancha, Cuenca, Spain
5
Research Group in Electronic, Biomedical and Telecommunication Engineering,
Facultad de Comunicación, Universidad de Castilla-La Mancha, Cuenca, Spain
6
Instituto de Tecnologı́as Audiovisuales de Castilla-La Mancha,
Universidad de Castilla-La Mancha, Cuenca, Spain

Abstract. The brain has been typically assessed as a group of inde-


pendent structures focused on the realization of determined processes
separately. Nevertheless, recent findings have confirmed the existence
of interconnections between all brain regions, thus demonstrating that
the brain works as a network. These areas can be interconnected either
physically, by anatomical links, or functionally, through functional asso-
ciations created for a coordinated development of mental tasks. In this
sense, the assessment of functional connectivity is crucial for discovering
new information about the brain’s behavior in different scenarios. In the
present study, the nonlinear functional connectivity metric cross-sample
entropy (CSE) is applied in the research field of emotions recognition
from EEG recordings. Concretely, CSE is computed to discern between
four different emotional states. The results obtained indicated that the
strongest coordination appears in intra- and inter-hemispheric interac-
tions of central, parietal and occipital brain regions, whereas associations
between left frontal and temporal lobes with the rest of areas show the
most dissimilar dynamics, thus a higher uncoordinated activity. In addi-
tion, coordination is globally higher under emotional conditions of high
arousal/low valence (like fear or distress) and low arousal/high valence
(such as relaxation or calmness).

Keywords: Electroencephalography · Emotion recognition ·


Functional connectivity · Cross-sample entropy
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 242–252, 2021.
https://doi.org/10.1007/978-3-030-85030-2_20
Detection of Emotions from Electroencephalographic Recordings 243

1 Introduction
The brain can be considered as the processing unit responsible for all tasks
developed by every part of the body [37]. Consequently, numerous studies have
analyzed the brain with the purpose of discovering new insights about the per-
formance of this essential organ under a wide variety of scenarios [42]. The tradi-
tional perspective of analysis has been focused on the association of a particular
function or process with a determined anatomical system, thus considering the
brain as an ensemble of independent structures [47]. Nonetheless, it has been
demonstrated that the brain works as a network in which all regions are inter-
connected for the development of every mental process [3]. This connectivity of
brain areas can be anatomical (given by the existence of physical and structural
links between areas) or functional (determined by the statistical dependencies
among separate non-physically linked regions) [32]. The organization of the brain
in functional networks denotes a synchronized activity of separated regions for
the development of different mental processes [41]. Therefore, it is essential to
evaluate the interactivity among brain areas for properly describing the mecha-
nisms of information processing of the brain [11].
The assessment of this connectivity has been mainly focused on the appli-
cation of linear measures of correlation and coherence of functional connections
among different areas [4,6]. However, the interactions between neurons in the
brain present a completely nonlinear and nonstationary behavior [3,15]. Conse-
quently, the application of linear measures for the evaluation of connectivity may
not provide a complete characterization of mental processes [14]. In this sense,
the use of nonlinear methodologies like permutation dissalignment index, trans-
fer entropy, or cross-mutual information, would report complementary informa-
tion about connectivity and functional interactions in the brain not revealed by
linear techniques [18,39]. Indeed, these nonlinear measures have already been
computed for the study of mental disorders like epilepsy [26], Alzheimer [29],
schizophrenia [30] or depression [20], among others.
The evaluation of the brain connectivity would also provide interesting infor-
mation in the research field of emotions recognition. Indeed, in the last years
a number of studies have focused on the application of connectivity metrics
for that purpose [1,27,28]. One of those metrics is the well-known cross-sample
entropy (CSE), which estimates the functional connectivity between brain areas
through the evaluation of the repetitiveness of sequences within two brain time
series extracted from different regions [36]. The CSE has already reported valu-
able outcomes when applied to the identification of emotional states of calm and
distress [16].
As a result, in this work CSE is applied for the recognition of four different
emotional states from EEG recordings. More concretely, the emotions detected
correspond to the four quadrants of the circumplex emotional model of Russell.
It is a well-known scheme of distribution of emotions according to two dimensions
namely valence (i.e., degree of pleasantness or unpleasantness of a stimulus) and
arousal (i.e., level of activation or relaxation produced by a stimulus), as can be
observed in Fig. 1. Therefore, the application of CSE would provide new infor-
244 B. Garcı́a-Martı́nez et al.

mation about the nonlinear interactivity dynamics between brain areas under
these four emotional states, namely HAHV (high arousal/high valence), HALV
(high arousal/low valence), LAHV (low arousal/high valence) and LALV (low
arousal/low valence).

AROUSAL
HALV HAHV
(anger, fear, (happiness, pleasure,
nervousness...) excitement...)

VALENCE

LALV LAHV
(sadness, boredom, (calmness, serenity,
depression...) relaxation...)

Fig. 1. Representation of the four quadrants of the valence/arousal model proposed by


Russell [40].

2 Methods

2.1 Database

The EEG signals analyzed in this study were extracted from the publicly avail-
able Database for Emotion Analysis using Physiological Signals (DEAP) [25].
This dataset contains a total of 1,280 samples from 32 healthy subjects between
19 and 37 years old (mean age 26.9, 50% male) during an emotional experiment.
More precisely, these participants visualized a total of 40 emotional videoclips of
1-min length, while EEG and other phisiological variables were recorded. Con-
cretely, the EEG recordings were obtained with 32 channels distributed over
the scalp following the International Standard 10–20 system for positioning of
electrodes [22]. After each visualization, subjects rated their level of valence and
Detection of Emotions from Electroencephalographic Recordings 245

arousal by means of self-assessment manikins (SAM), a graphical scale repre-


senting nine intensity levels of the aforementioned emotional parameters [31].
The samples in the DEAP database covered the whole valence/arousal space.
However, a subset of those samples was selected and grouped in four classes fol-
lowing these criteria: samples in HAHV group presented arousal and valence
ratings ≥6; HALV samples were those with arousal ≥6 and valence ≤4; samples
in LAHV had an arousal ≤4 and valence ≥6; finally, LALV group contained sam-
ples with arousal and valence ≤4. Hence, the samples in the borderline between
two groups were discarded, such that only the trials with a strongly elicited
emotion were chosen in this study. The final number of samples in each group
was 267 in HAHV, 101 in HALV, 154 in LAHV and 124 in LALV.

2.2 EEG Preprocessing


The raw EEG signals often contain non-desired noise and interferences that need
to be eliminated before further analyses. In this work, the preprocessing proce-
dure was developed using the Matlab toolbox EEGLAB, created for the analysis
and process of EEG recordings [10]. Firstly, signals were downsampled from the
original sampling frequency 512 Hz 128 Hz. Then, all electrodes were referenced
to the average of all channels by means of computing the mean potential of all
electrodes and subtracting it from every single channel [9]. Furthermore, two
forward/backward high-pass and low-pass filtering approaches were applied at
4 and 45 Hz of cutoff frequency, respectively. In this sense, baseline and power
line interferences were removed, while the frequency bands of interest were main-
tained [21]. Subsequently, other types of interferences were rejected by means
of a blind source separation technique called independent component analysis
(ICA). Concretely, this method is used for the elimination of artifacts that can
be produced by either physiological factors (like facial and muscular movements,
heart bumps, etc.) or technical aspects (such as electrode pops, or bad contacts
of the electrodes over the scalp). Once the independent components were com-
puted, those corresponding to artifacts were rejected, and thus only the neural
information was maintained in the EEG recordings. Finally, channels contami-
nated with high-amplitude noise were eliminated and reconstructed through the
interpolation of adjacent channels [34].
Although the duration of the trials in the DEAP database was of 60 s, only
the last 30 ones were selected for their assessment in this study [19,25]. With the
purpose of avoiding the effect of amplitude differences between EEG channels,
the original time series x were normalized and transformed into y as follows:
x − x̄
y= , (1)
σ
being x̄ and σ the mean and standard deviation of the signal x. Normalized
recordings were then divided into six non-overlapped equally-sized segments of
5 s of length (N = 3840 samples). CSE was computed for each segment to
compare the similarity of patterns among each EEG channel with the rest of
signals by pairs. This process was repeated for all the segments, and the final
246 B. Garcı́a-Martı́nez et al.

CSE result for each pair of channels was calculated as the average of the partial
values obtained for the six segments.

2.3 Cross-Sample Entropy (CSE)


CSE is a variation of cross-approximate entropy for the improvement of its rela-
tive consistency in different conditions [38]. Concretely, this index estimates the
degree of dissimilarity or asynchrony by comparing signals from two different
yet intertwined variables [35,38]. In addition, as CSE assesses both dominant
and secondary patterns in the data, it allows to quantify changes in the under-
lying dynamics that are not measurable in amplitudes or peak occurrences [46].
Mathematically, having two signals x1 (n) and x2 (n) of N samples of length,
CSE quantifies the frequency or conditional regularity of patterns in x1 similar
to a given pattern of x2 of length m within a tolerance r. A reduced number of
pattern matches is interpreted as a greater asynchrony among the two signals
and is represented with larger CSE values [38].
The calculation of CSE starts with the creation of N − m vectors of length
m samples from signals x1 (n) and x2 (n) [38]:

1,i = {x1 (i), x1 (i + 1), . . . , x1 (i + m − 1)}


Xm and (2)

Xm
2,j = {x2 (j), x2 (j + 1), . . . , x2 (j + m − 1)}, (3)
with i and j ranging from 1 to N − m in both cases. Then, these vectors are
conformed by m consecutive samples of x1 and x2 starting at i-th and j-th
ij between X1,i and X2,j
points, respectively. On the other hand, the distance dm m m

is obtained as the maximum absolute difference in their respective scalar com-


ponents:

ij = d[X1,i , X2,j ] =
dm |x1 (i + k) − x2 (j + k)|
m m
max (4)
k∈(0,m−1)

The probability of having patterns from x2 similar to another from x1 of


length m, within a tolerance r, is then defined as
−m 
1
N
1
N−m
 
φm (r)(x1 ||x2 ) = Θ r − dm , (5)
N − m i=1 N − m j=1 ij

where Θ(x) is defined as the Heaviside function [38]:



1, if x ≥ 0,
Θ(x) = (6)
0, if x < 0.
In the same manner, this probability is calculated as well considering patterns
of length m + 1:
−m 
1
N
1
N−m
 
φm+1 (r)(x1 ||x2 ) = Θ r − dm+1 . (7)
N − m i=1 M − m j=1 ij
Detection of Emotions from Electroencephalographic Recordings 247

CSE is then the negative natural logarithm of the conditional probability


φm+1 /φm [38]:
 φm+1 (r)(x1 ||x2 ) 
CSE(m, r, N )(x1 ||x2 ) = − lim ln , (8)
N →∞ φm (r)(x1 ||x2 )
Since N is a finite value, CSE is finally estimated as [38]
φm+1 (r)(x1 ||x2 )
CSE(m, r, N )(x1 ||x2 ) = − ln . (9)
φm (r)(x1 ||x2 )
It is important to note that this algorithm is direction independent, which
means that CSE(m, r, N )(x1 ||x2 ) is equal to CSE(m, r, N )(x2 ||x1 ). The reason
is that φm just considers the number of pairs of vectors from the two signals
matching within r, which is independent of the signal selected as either the
template or the target [38]. On the other hand, the selection of proper values
of m and r parameters is essential for a correct computation of CSE. Hence,
following the recommendations of the authors of this metric, values of m = 2
and r = 0.2 were chosen in this study [36].

3 Results
The mean values of CSE from each pair of channels obtained for the four emo-
tional groups of study are depicted in Fig. 2. In these images, blue colors are
used for the representation of the lowest values of CSE, thus indicating a higher
similarity of the dynamics of two brain regions and a more synchronized activity
between them. Contrarily, red colors represent the highest CSE outcomes, which
corresponds to a dissimilar and asynchronized behavior among brain areas.
It can be observed that, in all the cases, the most synchronized regions (those
with the lowest CSE values) appear in the posterior half of the brain, i.e., central,
parietal and occipital lobes. In addition, this synchronized behavior is present
within each hemisphere separately, and also in the connections between both
hemispheres. Therefore, the intra- and inter-hemispheric interactions among
these brain lobes present strongly coordinated and similar dynamics.
On the other hand, the highest levels of CSE appear in the interactions
between left frontal and temporal channels with the rest of brain areas, indi-
cating an asynchornized behavior and a high dissimilarity of the dynamics that
interconnect those regions with the rest of the brain. Concretely, the uncoordina-
tion between left frontal region (including channels FP1, AF3, F3, F7 and FC5)
and other lobes is more relevant for the emotions with a low level of valence
(i.e., HALV and LALV). On the other hand, the dissimilarity in the dynamics
between left temporal lobe (channel T7) and the rest of areas is more prominent
for the groups of emotions with a high valence (i.e., HAHV and LAHV). In this
regard, no great differences are observed depending on the arousal level of the
emotional states assessed.
In general terms, the coordinated activity of the brain follows the same afore-
mentioned tendencies for the four groups of emotions under study. Nevertheless,
248 B. Garcı́a-Martı́nez et al.

HALV HAHV
Fp1
AF3
F3
F7
FC5
FC1
C3
T7
CP5
CP1
P3
EEG Channels

P7
PO3
O1
Oz
Pz
Fp2
AF4
Fz
F4
F8
FC6
FC2
Cz
C4
T8
CP6
CP2
P4
P8
PO4
O2
Fz

Fz
F3

F8

F3

F8
Fp1

Fp2

Fp1

Fp2
C3

Oz

C4

C3

Oz

C4
P3

P4

P3

P4
FC5

CP5

FC2

CP6

FC5

CP5

FC2

CP6
PO3

PO4

PO3

PO4
AF3

AF4

AF3

AF4
FC1

O1

O2

FC1

O1

O2
F7

F4

F7

F4
Cz

Cz
T7

T8

T7

T8
Pz

Pz
P7

P8

P7

P8
CP1

FC6

CP2

CP1

FC6

CP2
LALV LAHV
Fp1
AF3
F3
F7
FC5
FC1
C3
T7
CP5
CP1
EEG Channels

P3
P7
PO3
O1
Oz
Pz
Fp2
AF4
Fz
F4
F8
FC6
FC2
Cz
C4
T8
CP6
CP2
P4
P8
PO4
O2
Fz

Fz
F3

F8

F3

F8
Fp1

Fp2

Fp1

Fp2
C3

Oz

C4

C3

Oz

C4
P3

P4

P3

P4
FC5

CP5

FC2

CP6

FC5

CP5

FC2

CP6
PO3

PO4

PO3

PO4
AF3

AF4

AF3

AF4
FC1

O1

O2

FC1

O1

O2
F7

F4

F7

F4
Cz

Cz
T7

T8

T7

T8
Pz

Pz
P7

P8

P7

P8
CP1

FC6

CP2

CP1

FC6

CP2

EEG Channels

0.55 0.65 0.75 0.85 N/A


CSE level

Fig. 2. Representation of the mean CSE values obtained from all pairs of EEG chan-
nels for the four groups of emotions. Red colors represent the most dissimilar dynamics
between different regions. Blue colors are used for the representation of highly coordi-
nated pairs of channels. White cells represent the interaction of one channel with itself,
which is not applicable (N/A) with CSE. (Color figure online)
Detection of Emotions from Electroencephalographic Recordings 249

it can be seen that HALV and LAHV groups present a generalized lower level of
CSE, represented by a greater presence of cold colors in the corresponding maps.
This indicates a more self-coordinated performance of the brain and more simi-
larities in the dynamics between all regions when experiencing emotional states
contained in these two groups (concretely, fear, anger or distress, in the case of
HALV, and relaxation, serenity or satisfaction, in the LAHV group, among other
emotions).

4 Discussion and Conclusions

The connectivity and coordinated performance of different brain regions has


been studied in various scenarios [8,13,44,48]. In this sense, the computation
of CSE for the analysis of EEG signals would report interesting information in
the emotions recognition research field. Precisely, this metric has been selected
because of the relevance of the results reported by similar measures applied for
the detection of emotions from EEG signals from isolated brain areas [17]. In
addition, CSE has also provided notable insights when discerning between calm
and distress emotional states [16]. For this reason, CSE has been applied for the
first time for the identification of the emotions contained in the four quadrants
of the valence/arousal space, with the purpose of discovering new information
about the coordinated activity in the brain under these emotional conditions.
The highest similarities (i.e., the strongest coordination) appeared in the
posterior half of the brain, especially involving the intra- and interhemispheric
connections of parietal regions. Contrarily, the interactions between frontal lobes,
especially the left one, with the rest of brain areas have reported the most dis-
similar dynamics, thus being the most uncoordinated. The role of these brain
regions has been already related to emotional processes. Indeed, the parietal lobe
has been associated with the processing of aspects related to the arousal dimen-
sion of emotions, whereas the activity in the frontal area has been identified with
both valence and arousal dimensions [2,12].
In addition, a generalized increase of self-coordination has been detected for
the emotional groups HALV (including fear, anger, distress, or alertness) and
LAHV (with emotions like relaxation, calmness, serenity, or satisfaction). In the
case of HALV group, it could be interpreted as a mechanism of defense of the
brain, which increments its coordinated activity among areas for preparing the
body to face negative stimulus that could pose a threat for the integrity of the
subject [23,24]. On the other hand, during LAHV conditions, the brain con-
nections may increase its coordination for an enhancement of self-consciousness,
thus improving attentional and cognitive control processes [5].
However, the fact that the coordination is higher for those emotional states
does not imply a lack of coordination under other emotional conditions. Con-
trarily, the activity among separate areas is continuously coordinated, although
with a variable intensity depending on the requirements of the mental process
developed [33,43]. In the same manner, the connections in the brain of an idle
individual are also certainly synchronized despite not being performing cognitive
250 B. Garcı́a-Martı́nez et al.

tasks [45]. This is justified by the existence of a “default system” that presents
a higher activity under resting states and without the presence of stimuli [7].

Acknowledgments. This work was partially supported by Spanish Ministerio de


Ciencia, Innovación y Universidades, Agencia Estatal de Investigación (AEI)/European
Regional Development Fund (FEDER, UE) under EQC2019-006063-P, PID2020-
115220RB-C21, and 2018/11744 grants, and by Biomedical Research Networking Cen-
tre in Mental Health (CIBERSAM) of the Instituto de Salud Carlos III. Beatriz Garcı́a-
Martı́nez holds FPU16/03740 scholarship from Spanish Ministerio de Educación y For-
mación Professional.

References
1. Al-Shargie, F., Tariq, U., Alex, M., Mir, H., Al-Nashash, H.: Emotion recogni-
tion based on fusion of local cortical activations and dynamic functional networks
connectivity: an EEG study. IEEE Access 7, 143550–143562 (2019)
2. Alia-Klein, N., et al.: Trait anger modulates neural activity in the fronto-parietal
attention network. PloS one 13(4), e0194444 (2018)
3. Anzellotti, S., Coutanche, M.N.: Beyond functional connectivity: investigating net-
works of multivariate representations. Trends Cogn. Sci. 22, 258–269 (2018)
4. Aydın, S., Demirtaş, S., Tunga, M.A., Ateş, K.: Comparison of hemispheric asym-
metry measurements for emotional recordings from controls. Neural Comput. Appl.
30(4), 1341–1351 (2017)
5. Barrós-Loscertales, A., Hernández, S.E., Xiao, Y., González-Mora, J.L., Rubia,
K.: Resting state functional connectivity associated with Sahaja Yoga Meditation.
Front. Hum. Neurosci. 15, 65 (2021)
6. Breakspear, M.: Nonlinear phase desynchronization in human electroencephalo-
graphic data. Hum. Brain Mapp. 15(3), 175–198 (2002)
7. Buckner, R.L., Andrews-Hanna, J.R., Schacter, D.L.: The brain’s default network:
anatomy, function, and relevance to disease. Annals New York Acad. Sci. 1124,
1–38 (2008)
8. Cai, L., Wei, X., Wang, J., Yu, H., Deng, B., Wang, R.: Reconstruction of functional
brain network in Alzheimer’s disease via cross-frequency phase synchronization.
Neurocomputing 314, 490–500 (2018)
9. Cohen, M.X.: Analyzing Neural Time Series Data: Theory and Practice. MIT
Press, Cambridge (2014)
10. Delorme, A., Makeig, S.: EEGLAB: an open source toolbox for analysis of single-
trial EEG dynamics including independent component analysis. J. Neurosci. Meth-
ods 134(1), 9–21 (2004)
11. Deshpande, G., Santhanam, P., Hu, X.: Instantaneous and causal connectivity in
resting state brain networks derived from functional MRI data. Neuroimage 54(2),
1043–1052 (2011)
12. Dolcos, F., Cabeza, R.: Event-related potentials of emotional memory: encoding
pleasant, unpleasant, and neutral pictures. Cogn. Affect. Behav. Neurosci. 2(3),
252–263 (2002)
13. Fan, M., Chou, C.A.: Detecting abnormal pattern of epileptic seizures via temporal
synchronization of EEG signals. IEEE Trans. Biomed. Eng. 66(3), 601–608 (2018)
14. Farokhzadi, M., Hossein-Zadeh, G.A., Soltanian-Zadeh, H.: Nonlinear effective con-
nectivity measure based on adaptive neuro fuzzy inference system and Granger
causality. Neuroimage 181, 382–394 (2018)
Detection of Emotions from Electroencephalographic Recordings 251

15. Friston, K.J.: Book review: brain function, nonlinear coupling, and neuronal tran-
sients. Neuroscientist 7, 406–418 (2001)
16. Garcı́a-Martı́nez, B., Fernández-Caballero, A., Alcaraz, R., Martı́nez-Rodrigo, A.:
Cross-sample entropy for the study of coordinated brain activity in calm and dis-
tress conditions with electroencephalographic recordings. Neural Comput. Appl.
33, 9343–9352 (2021)
17. Garcı́a-Martı́nez, B., Fernández-Caballero, A., Zunino, L., Martı́nez-Rodrigo, A.:
Recognition of emotional states from EEG signals with nonlinear regularity-and
predictability-based entropy metrics. Cogn. Comput. 13(2), 403–417 (2021)
18. Garcı́a-Martı́nez, B., Martı́nez-Rodrigo, A., Alcaraz, R., Fernández-Caballero, A.:
A review on nonlinear methods using electroencephalographic recordings for emo-
tion recognition. IEEE Trans. Affect. Comput. (2019)
19. Garcı́a-Martı́nez, B., Martı́nez-Rodrigo, A., Zangróniz, R., Pastor, J.M., Alcaraz,
R.: Application of entropy-based metrics to identify emotional distress from elec-
troencephalographic recordings. Entropy 18(6), 221 (2016)
20. Hasanzadeh, F., Mohebbi, M., Rostami, R.: Graph theory analysis of directed
functional brain networks in major depressive disorder based on EEG signal. J.
Neural Eng. 17(02), 026010 (2020)
21. Ismail, W.W., Hanif, M., Mohamed, S., Hamzah, N., Rizman, Z.I.: Human emotion
detection via brain waves study by using electroencephalogram (EEG). Int. J. Adv.
Sci. Eng. Inf. Technol. 6(6), 1005–1011 (2016)
22. Klem, G.H., Lüders, H.O., Jasper, H., Elger, C., et al.: The ten-twenty electrode
system of the International Federation. Electroencephal. Clin. Neurophysiol. 52(3),
3–6 (1999)
23. Knyazev, G.G.: Cross-frequency coupling of brain oscillations: an impact of state
anxiety. Int. J. Psychophysiol. 80(3), 236–245 (2011)
24. Knyazev, G.G., Savostyanov, A.N., Levin, E.A.: Alpha synchronization and anx-
iety: Implications for inhibition vs. alertness hypotheses. Int. J. Psychophysiol.
59(2), 151–158 (2006)
25. Koelstra, S., et al.: DEAP: a database for emotion analysis using physiological
signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2012)
26. Lee, A., Litt, B., Pathmanathan, J.: Normalized transfer entropy used as an infor-
mational transfer measure of ictal pathophysiology in patients undergoing stereo-
EEG for epilepsy surgery (P4.5-023). Neurol. 92(15 Supplement) (2019)
27. Li, P., Liu, H., Si, Y., Li, C., Li, F., Zhu, X., Huang, X., Zeng, Y., Yao, D., Zhang,
Y., et al.: EEG based emotion recognition by combining functional connectivity
network and local activations. IEEE Trans. Biomed. Eng. 66(10), 2869–2881 (2019)
28. Liu, X., Li, T., Tang, C., Xu, T., Chen, P., Bezerianos, A., Wang, H.: Emotion
recognition and dynamic functional connectivity analysis based on EEG. IEEE
Access 7, 143293–143302 (2019)
29. Mammone, N., et al.: Permutation disalignment index as an indirect, EEG-based,
measure of brain connectivity in MCI and AD patients. Int. J. Neural Syst. 27(05),
1750020 (2017)
30. Min, B., et al.: Prediction of individual responses to electroconvulsive therapy in
patients with schizophrenia: machine learning analysis of resting-state electroen-
cephalography. Schizophr. Res. 216, 147–153 (2019)
31. Morris, J.D.: Observations SAM: the Self-Assessment Manikin - an efficient cross-
cultural measurement of emotional response. J. Advert. Res. 35(6), 63–68 (1995)
32. O’Reilly, C., Lewis, J.D., Elsabbagh, M.: Is functional brain connectivity atypical in
autism? a systematic review of EEG and MEG studies. PLoS ONE 12(5), e0175870
(2017)
252 B. Garcı́a-Martı́nez et al.

33. Park, H.J., Friston, K.: Structural and functional brain networks: from connections
to cognition. Science 342(6158), 1238411 (2013)
34. Pedroni, A., Bahreini, A., Langer, N.: Automagic: standardized preprocessing of
big EEG data. Neuroimage 200, 460–473 (2019)
35. Pincus, S.M.: Irregularity and asynchrony in biologic network signals. Methods
Enzymol. 321, 149–82 (2000)
36. Pincus, S.M.: Assessing serial irregularity and its implications for health. Ann. N.
Y. Acad. Sci. 954, 245–67 (2001)
37. Popper, K.R., Eccles, J.C.: The Self and its Brain. Springer, Heidelberg (2012).
https://doi.org/10.1007/978-3-642-61891-8
38. Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approxi-
mate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 278(6),
H2039–H2049 (2000)
39. Rodrı́guez-Bermúdez, G., Garcia-Laencina, P.J.: Analysis of EEG signals using
nonlinear dynamics and chaos: a review. Appl. Math. Inf. Sci. 9(5), 2309 (2015)
40. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178
(1980)
41. Sala-Llonch, R., Bartrés-Faz, D., Junqué, C.: Reorganization of brain networks in
aging: a review of functional connectivity studies. Front. Psychol. 6, 663 (2015)
42. Sanei, S.: Adaptive Processing of Brain Signals. Wiley, Hoboken (2013)
43. Trujillo, L.T., Peterson, M.A., Kaszniak, A.W., Allen, J.J.: EEG phase synchrony
differences across visual perception conditions may depend on recording and anal-
ysis methods. Clin. Neurophysiol. 116(1), 172–189 (2005)
44. Tu, P.C., et al.: Reduced synchronized brain activity in schizophrenia during view-
ing of comedy movies. Sci. Rep. 9(1), 1–11 (2019)
45. Perez Velazquez, J.L., Erra, R.G., Wennberg, R., Dominguez, L.G.: Correlations of
cellular activities in the nervous system: physiological and methodological consid-
erations. In: Velazquez, J., Wennberg, R. (eds.) Coordinated Activity in the Brain.
Springer Series in Computational Neuroscience, vol. 2. Springer, New York (2009).
https://doi.org/10.1007/978-0-387-93797-7 1
46. Veldhuis, J.D., Pincus, S.M., Garcia-Rudaz, M.C., Ropelato, M.G., Escobar, M.E.,
Barontini, M.: Disruption of the joint synchrony of luteinizing hormone, testos-
terone, and androstenedione secretion in adolescents with polycystic ovarian syn-
drome. J. Clin. Endocrinol. Metab. 86(1), 72–9 (2001)
47. Zola-Morgan, S.: Localization of brain function: the legacy of Franz Joseph Gall
(1758–1828). Annu. Rev. Neurosci. 18(1), 359–383 (1995)
48. Zuchowicz, U., Wozniak-Kwasniewska, A., Szekely, D., Olejarczyk, E., David, O.:
EEG phase synchronization in persons with depression subjected to transcranial
magnetic stimulation. Front. Neurosci. 12, 1037 (2019)
P300 Characterization Through Granger
Causal Connectivity in the Context
of Brain-Computer Interface Technologies

Vanessa Salazar1(B) , Vinicio Changoluisa1,2 , and Francisco B. Rodriguez1(B)


1
Grupo de Neurocomputación Biológica, Dpto. de Ingenierı́a Informática,
Escuela Politécnica Superior, Universidad Autónoma de Madrid, 28049 Madrid, Spain
[email protected], [email protected]
2
Grupo de Investigación en Electrónica y Telemática (GIETEC),
Universidad Politécnica Salesiana, Quito, Ecuador
[email protected]

Abstract. The analysis of connectivity in brain networks has been


widely researched and it has been shown that certain cognitive processes
require the integration of distributed brain areas. Functional connectiv-
ity attempts to statistically quantify the interdependence between these
brain areas. In this paper, we propose an analysis of functional connec-
tivity in the Event-Related Potential (ERP) context, more specifically on
the P300 component using the Granger Causality measure. To this end,
we propose a methodology that consists in quantifying the causality in
the P300 and non-P300 signals in the context of Brain-Computer Inter-
faces (BCIs). Causality is calculated using two approaches: i) using stan-
dard electrodes and, ii) using electrodes selected using Bayesian Linear
Discriminant Analysis and sequential forward electrode selection (BLDA-
FS). Based on this analysis, it is shown that the Granger Causality metric
is valid to show a significant connectivity difference between P300 and
non-P300 signals. The electrodes selected using BLDA-FS were found to
be more discriminative in this regard. Studying functional connectivity
using Granger Causality allowed us to identify the changes in connec-
tivity detected during the presence of a target stimulus compared to a
non-target stimulus. This additional information about the connectivity
differences found can be incorporated as a new feature in further stud-
ies, allowing for better detection of the P300 signal and consequently
improving the performance of P300-based BCIs.

Keywords: Brain networks · Standard electrodes · Functional


connectivity · Event-related potential · Bayesian linear discriminant
analysis · Sequential forward electrode selection · EEG signal · Oddball
paradigm · Inter-subject variability

1 Introduction
Due to the complexity of understanding brain activity, and despite significant
advances in existing computational analyzes and experimental techniques, it is
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 253–264, 2021.
https://doi.org/10.1007/978-3-030-85030-2_21
254 V. Salazar et al.

not yet fully understood how cognitive functions are generated in the brain.
Studies suggest a common behavior of brain areas integrated in networks that
require a coordinated flow of information [4,14,18,20,23]. Event-Related Poten-
tial (ERP) enables the recording of neural activity associated with cognitive
processes. It is manifested by electrical potentials generated by the brain in rela-
tion to a given stimulus, and it is therefore necessary to capture these electrical
potentials with high precision in order to use them as control signals in BCIs.
The P300 signal is an ERP component that is considered one of the most rel-
evant to this type of technology [9]. Several studies have shown that different
brain areas are involved in the generation of a P300 signal [16,22]. Based on this
evidence, the need arises to use analytic methods with the goal of understanding
the relationships between the structure and dynamics of networks in the context
of ERP generation and, more specifically, the P300 ERP, in order to identify
measures of connectivity between neuronal regions.
Functional connectivity measures can be used for the classification of visually
evoked responses in the context of BCI, as shown in [5,12,21,25], as significant
differences in the characterization of connectivity in the two conditions: target
and non-target trials. In this paper, we compute functional connectivity measures
using Granger causality to determine and characterize the existing connectivity
differences between a P300 and a non-P300 signals. As far as we know, this
functional connectivity metric, which quantifies the direction of interaction, has
never been used to characterize the P300 wave in the context of BCI technologies.
The goal of this study is that these results can serve as a new feature for P300
detection in future studies that, when combined with existing methods, can
improve the performance of BCI technologies.The dataset used for this analysis
belongs to a BCI system, and the experiment used to generate it is based on the
oddball paradigm [1], which consists of presenting a set of visual stimuli, one
of which is a target stimulus that elicits the P300 signal, all other images are
non-target stimuli. The experimental characteristics are explained in detail in
[11].

2 Materials and Methods


2.1 Dataset: Characteristics and Structure
A dataset of a P300-based BCI system from [11] was used to develop this work,
using a six-choice P300 paradigm for data collection. The experiment was con-
ducted on a population of 8 subjects, 4 of whom were disabled (see [11] for
specific details of participants). The visual stimuli considered for the experi-
ment, consist of six images shown to the participants: a television, a telephone,
a lamp, a door, a window, and a radio. These images were shown in random
order during a 100 ms period, with a 300 ms period in which nothing was shown
until the next image was presented, one of these images being the target stimulus
for each trial. Considering the display time of the image and the time when no
image is shown in an Inter-Stimulus Interval (ISI) of 400 ms can be defined. The
experiment consisted of 4 sessions, two per day, with an interval of less than two
P300 Characterization Through Granger Causal Connectivity in BCIs 255

Fig. 1. Dataset structure for one subject considering 2 experimental days with 2 session
each one, and each session with 6 runs.

weeks between days. Each of the sessions had six runs, one run for each of the six
selected images. The sequence of flashes was block-randomized, and the number
of blocks was randomly set between 20 and 25 (each of these blocks is called a
trial). On average, 22.5 target images and 22.5 × 5 = 112.5 non-target images
were shown (135 trials). A session consists of approximately 810 trials (6 runs ×
6 images × 22.5 trials) and the dataset for each subject consists of an average of
3,240 trials (4 sessions × 810 trials). For this analysis, we took 20 trials for each
image presented in the run and neglected the rest. EEG signals were recorded
at a sampling frequency 2048 Hz from 32 electrodes positioned according to the
international 10–20 system. A simple scheme of the dataset structure is shown
in Fig. 1.

2.2 Preprocessing of EEG Data

Data processing techniques used in this study are based on the prepossessing
performed in [11]. The mean value of the two electrodes corresponding to the
mastoids was used as a reference. Then, a sixth-order forward-backward Butter-
worth bandpass filter was applied. Cutoff frequencies of 1.0 Hz and 12.0 Hz were
set. To scale the data, it was normalized considering the mean and standard devi-
ation. After filtering, the data were downsampled 32 Hz with every 64th sample
selected from the filtered data (sample rate in EEG signal 2048 Hz). Single trials
of 1000 ms duration were extracted, with 600 ms of each trial overlapping with
the first 600 ms of the following trial. In order to reduce the effects of biological
artifacts such as blinks, eye movements, muscle activity of the subject that may
affect the EEG data, data from each electrode were winsorized. This statistical
256 V. Salazar et al.

transformation technique consists of excluding the values that are below the 10th
percentile and above the 90th percentile.

2.3 Connectivity Analysis

As mentioned earlier, connectivity analysis has been increasingly studied because


of the information it provides about network dynamics. Several studies have
shown that connectivity can be useful in detecting a P300 signal, and several
metrics have been used for this purpose, such as PLV [12], PLI [5], coherence
[21], etc. All these functional connectivity metrics are not able to quantify the
direction of interaction. For this reason, it was decided to explore the use of
Granger causality to analyze the dynamics of the network in P300 signals, which
is able to quantify the direction of interaction.

2.4 Linear Granger Causality: Theory

The concepts of causality are generally related to the idea of cause and effect: a
variable X1 is causal for a variable X2 , i.e. X1 is the cause of X2 or vice versa.
Granger causality (GC) [8,10], on the other hand, is a statistical concept of
causality based on predictions. It does not test for a true cause-effect relationship,
but attempts to infer whether the past behavior of a time series X1 can predict
the behavior of a time series X2 . Given two time series X1 (t), X2 (t), we intend
to predict X1 (t + 1) based on the past terms of X1 (t). Then, we try to predict
X1 (t + 1) using past terms of X1 (t) and X2 (t). If the second prediction proves
to be more successful, then the past of X2 (t) seems to contain information that
is useful in predicting X1 (t + 1), which is not in the past X1 (t). Therefore, X2 (t)
“G-causes” (has information flow with) X1 (t + 1) if: i) X2 (t) precedes X1 (t + 1),
and ii) X2 (t) contains useful information for predicting X1 (t + 1) that is not
contained in other variables.

2.5 Linear Granger Causality: Calculation

After preprocessing, and due to the pursued goal of differentiating the existing
causality between the P300 signal and the non-P300 signal, the dataset was
divided into two groups: i) the first group containing the trials of the P300
signal, i.e., those generated by the target stimulus, and ii) the second group of
non-P300 trials, i.e., non-target stimulus. To get an intuitive idea of the visual
differences of P300 and non-P300 signals, the data corresponding to the Cz
channel from 4 subjects corresponding to the healthy subjects (subjects 5 to 8)
is used. For this analysis, all runs of each session and all sessions of each subject
were averaged since many trials were used for this average. Thus, in this way, it
is possible to clearly visualize the behavior of both signals as shown in Fig. 2.
For the calculation of Granger Causality the MVGC Toolbox is used [2]. As a
first approximation, the data is analyzed using all sessions for each subject. After
stacking the trials corresponding to the 6 runs of each session and considering
P300 Characterization Through Granger Causal Connectivity in BCIs 257

the 4 sessions, the structure of the dataset is: 32 × 32 × 480 for the P300 signal
and 32 × 32 × 2400 for the non-P300 signal, which corresponds to electrodes ×
samples × trials.

Fig. 2. P300 ERP characterization, Cz channel corresponding to 4 subjects, the peak


of the P300 signal occurs at approximately 400 ms. Black line represents target trials
and red line represents non-target trials. (Color figure online)

Collinearity and Non-stationary: In estimating the Granger Causality two


problems have been identified that arise with time series, these are collinear-
ity and non-stationarity, which are also described in detail in [2]. Collinearity
occurs when there are linear relationships between each time series, the presence
of collinearity is detected. In this case, there are 32 time series corresponding
to each of the electrodes where the EEG signals were measured. In order to
reduce the possibilities of linear dependencies, it was decided to limit the set of
variables analyzed, leaving only a small group of EEG signals that provide more
information about the P300 event.
For the present analysis, stationarity was achieved by splitting the data into
overlapping windows of shorter periods, suggested by the authors of the MVGC
tool in their paper [2]. To define these windows, some sizes of the overlapping
windows were proposed. These values are chosen independently for each ana-
lyzed subject in order to adapt the connectivity calculation to the individual
characteristics of each subject. Within each window, the GC calculation is per-
formed and one causality value in defined per window. These individual values
are later used to estimate causality for the entire signal.

Electrode Selection: The use of electrode selection techniques was proposed


because we hypothesized that electrodes that allow detection of a P300 signal
with greater accuracy might also allow better detection of the possible causal
dependencies between electrodes in the generation of a P300 signal. There is
evidence that the electrodes located in the occipital and parietal lobes, consider-
ing the international 10/20 system, are the ones that provide more information
about the P300 detection [3,11,13,17,24], thus achieving high precision in the
identification of the signals generated by the target stimuli was achieved. In this
context, the first approach considered was the combination of 4 standard elec-
trodes used in [15,19], these electrodes are Fz, Cz, Pz, Oz. As a second approach,
258 V. Salazar et al.

an electrode selection for each subject is proposed using BLDA-FS to search for
the electrodes that provide the highest accuracy in P300 detection. This approxi-
mation is considered to analyze whether the GC metric is able to characterize the
differences between the two conditions, P300 and non-P300 when the electrodes
used for the analysis are adapted to the specific characteristics of the ERPs of
each subject. Therefore, electrode selection was performed using BLDA-FS by
testing electrode combinations and finding the highest values for classification
accuracy, as well as selecting the electrode combination with the best results.

3 Results
In this section, we present the results according to the two proposed approaches
for electrode selection: i) selection of standard electrodes, and ii) and electrode
selection using BLDA-FS.

3.1 GC Calculation Using Standard Electrodes


As mentioned earlier, an analysis with overlapping windows is proposed, to elim-
inate non-stationarity without affecting the accuracy of the model fitting. The
selection of the appropriate window size is an important step, and for this pur-
pose different window and overlap sizes were tested: i) window sizes of 188 ms,
219 ms, 250 ms, 281 ms, and ii) overlaps of 31 ms, 63 ms, 94 ms, 125 ms, 156
ms, this window and overlap sizes are proposed as result of a previous analysis.
To quantify the causality value calculated in each window, three measures were
proposed: i) the sum of GC values considered significant per window, ii) the
mean of p-values considered significant per window, and iii) the sum of signifi-
cant values per window. Figure 3 shows a representative example of the result
of the GC calculation for the window from 156 ms to 344 ms. From this case, it
can be seen that there are 9 values that are considered significant for the P300
signal condition and 3 values that are considered significant for the non-P300
signal condition. Once the GC results are obtained and the significant values
are known, the three proposed measures given for each window are calculated.
A curve is obtained to parameterize the P300 signal, the non-P300 signal, and
the difference between them. In the case of the analyzed example, a window of
188 ms and an overlap of 31 ms were chosen, with 27 points per curve. Figure 4
shows an example of the behavior of the 3 proposed measures characterizing the
connectivity differences under the two conditions, calculated for subject 6. From
the observation of this figure, the area under the curve was proposed to quantify
the causality under the two conditions. From here, only the first proposed mea-
sure is considered, i.e., the sum of the GC magnitudes considered significant, as
it is the measure that best captures the characteristics of the connectivity of the
signal by considering not only whether the connectivity is present or not, but
also at what level it is present. For this study, the values whose p-value is less
than 5% are considered significant. From the area calculation, a single value is
obtained for each overlapping window and the overlap used. Figure 5, shows the
area results obtained for both signals with an overlap of 31 ms.
P300 Characterization Through Granger Causal Connectivity in BCIs 259

Subject: 6 | time : 156 - 344


GC - P300 p-v - P300 sig - P300
1 1
Fz Fz Fz
0.02 0.8 0.8

to Cz 0.015 Cz 0.6 Cz 0.6

to

to
Pz
0.01 Pz 0.4 Pz 0.4

0.005 0.2 0.2


Oz Oz Oz
0 0 0

Oz

Oz

Oz
Fz

Fz

Fz
Cz

Cz

Cz
Pz

Pz

Pz
from from from
GC - non-P300 p-v - non-P300 sig - non-P300
1 1
Fz Fz Fz
0.02 0.8 0.8

Cz 0.015 Cz 0.6 Cz 0.6


to

to

to
Pz
0.01 Pz 0.4 Pz 0.4

0.005 0.2 0.2


Oz Oz Oz
0 0 0
Oz

Oz

Oz
Fz

Fz

Fz
Cz

Cz

Cz
Pz

Pz

Pz
from from from

Fig. 3. GC values, overlapping window from 156 ms to 344 ms. The first column
corresponds to GC values, the second column corresponds to probability values cal-
culated from GC, the third column corresponds to significant values, i.e. probability
values <0.05. Significant values are colored black.

Finally, the causality curves are used to determine in which interval of the
signal the causality can be maximized. To do this, a lower and upper threshold
must be set to limit the curve at the beginning and end of the signal, respectively.
An interval of 250 ms and 750 ms is taken as a reference, since this is the
approximate range in which the latency variation of the P300 signal is detected.
From this range, some values are proposed to define both the lower (θ min)
and upper (θ max) thresholds, these proposed values are: θ min = [220 ms,
255 ms, 290 ms] and θ max = [590 ms, 630 ms, 670 ms]. All possible combinations
between these thresholds are tested for each subject, and those that maximize
the causality values for the P300 signal are selected. After the final window size
is determined, the GC value within that window is calculated and the result is
the GC result for that subject. Figure 6 shows the results for all subjects. As can
be seen, for the selected electrodes, almost all subjects satisfy the premise that
the functional connectivity found in the P300 signal is greater than that found
in the non-P300 signal. As expected, the average area values for connectivity
calculated for the P300 signal are higher than the average area values estimated
for the non-P300 signal. It is observed that only subject 4 shows a different
behavior from the other subjects, in [11] it is explained that this subject belongs
to the subjects with disabilities and he is the only subject who has two disabilities
while the other disabled subjects have one, and thus the obtained connectivity
results could be attributed to this motive.
260 V. Salazar et al.

Fig. 4. Results of the area calculation for subject 6 with overlapping windows of 188 ms
and an overlap of 31 ms. The first column shows the area curve obtained by addition
of the significant GC values. The second column shows the area curve obtained by
averaging the significant P-values. The third column shows the sum of the significant
values. W represents the window size, while O represents the overlap.

3.2 GC Calculation Using BLDA-FS Electrodes


To analyze the connectivity adapted to the characteristics of each subject BLDA-
FS approach is used as explained previously. It should be noted that one of the
conditions for electrode selection was that the selected sensors are not adjacent,
in order to reduce the probability of collinearity problems. The process carried
P300 Characterization Through Granger Causal Connectivity in BCIs 261

Area comparison - Subject: 6


0.35
P300
non-P300
0.3

GC Area 0.25

0.2

0.15

0.1

0.05

window: 188 ms window: 219 ms window: 250 ms window: 281 ms

Fig. 5. Area values comparison calculated for different overlapping window sizes of
subject 6 using an overlap of 31 ms. Color blue represents the P300 connectivity values,
while color orange represents non-P300 connectivity values. (Color figure online)

Standard 4 electrodes | GC area


Area values per subject Mean area
0.35 0.14
P300
0.3 non-P300
P3 - no-P3 0.12

0.25
0.1

0.2
0.08
GC Area

0.15
0.06
0.1

0.04
0.05

0 0.02

-0.05 0
P300 non-P300 P3 - non-P3
je ct1 je ct2 ject3 je ct4 ject5 je ct6 ject7 ject8
sub sub sub sub sub sub sub sub

Fig. 6. Final area values for subjects 1 to 8 after window selection, and mean GC
results for all subjects considering 4 standard electrodes (differences are statistically
significant, p-value = 0.0368).

out in the previous case is repeated, we first proceed to the selection of the
overlapping window and the overlap size, then the causality values per window
are quantified. From here it is possible to choose the general window size, which
in this case was the 220 to 670 ms window for all subjects. Figure 7 illustrates
these results. From both analyzes, we can conclude that the connectivity found in
the P300 signal is larger than the non-P300 signal in most cases. It is important
to note that subject 4 again exhibits a behavior related to connectivity that is
different from the others, as shown in the previous section. This behavior could
be attributed to the specific cognitive disabilities of this subject, as we have
already commented [11]. On the other hand, in Figs. 6 and 7, it can be clearly
262 V. Salazar et al.

BLDA-FS 6 electrodes | GC area


Area values per subject Mean area
0.3 0.16
P300
non-P300
0.25 0.14
P3 - no-P3

0.12
0.2

0.1
GC Area

0.15
0.08
0.1
0.06

0.05
0.04

0 0.02

-0.05 0
P300 non-P300 P3 - non-P3
ct1 ct2 ct3 ct4 ct5 ct6 ct7 ct8
bje bje bje bje bje bje bje bje
su su su su su su su su

Fig. 7. Final area values for subjects 1 to 8 after window selection, and mean GC
results for all subjects considering 6 BLDA-FS electrodes with a p-value = 0.0497.

observed how the behavior of the first four subjects, which are the subjects with
some kind of disability, differs from the rest, showing a difference between the
lower connectivity results in these subjects compared to the others.

4 Conclusions and Discussion


In this paper, functional connectivity in P300 ERP signals was evaluated. For
this analysis, Granger Causality was used as a measure to quantify the connec-
tivity direction. This calculation was performed by limiting the number of elec-
trodes to eliminate the problem of collinearity, according to two approaches for
electrode selection: i) standard electrodes and ii) electrodes selected by BLDA-
FS. Moreover, the dataset was divided into overlapping windows to eliminate the
non-stationarity problem. In a first approximation, the calculation of functional
connectivity based on four standard electrodes is analyzed. In this analysis, as
can be seen in Fig. 6, it was found that after the presentation of a stimulus, i.e.
when a target image is presented and a P300 signal is generated, higher connec-
tivity is detected, consequently, less activity was detected when the subject is in
the resting state. The second approach considered, using electrodes selected with
BLDA-FS, showed similar results to the previous case, where higher causality
values are detected when a target stimulus is presented, as can be seen in Fig. 7.
The causality results are higher for the electrodes selected with BLDA-FS than
when standard electrodes are used. However, as an alternative to the proposed
electrode selection, other electrode selection techniques can be used in future
studies to analyze whether connectivity results improve as a function of better
electrode selection. In [6,7], novel selection techniques are proposed that have
obtained favorable results and advantages in terms of the number of electrodes
and accuracy of results in detecting P300 ERPs.
Using the Granger Causality measure to analyze functional connectivity
could establish that there is indeed a clear difference between activity during
P300 Characterization Through Granger Causal Connectivity in BCIs 263

a non-target trial (which tends to be lower) and brain activity results during a
target trial (where results are higher), showing that the brain areas analyzed are
closely connected and that there is a greater flow of information during the devel-
opment of cognitive functions. This distinction is clear, however, these results
would need to be examined in a larger study with more subjects in the future.
The results obtained in this study using Granger Causality are consistent with
those of [5,12,22,25], who, by using alternative measures (which are unable to
quantify the direction of interaction), shows that it is possible to clearly observe
high synchronization between brain regions during a target trial, but no signifi-
cant synchronization was observed during a non-target trial. Several connectivity
studies have focused on going beyond connectivity and analyzing the existing
information flow direction during cognitive processes. In this way, connectivity
and information flow direction characteristics obtained with the GC metric could
be used as a new feature for the classification of visually evoked responses in the
context of P300 ERPs. These new features in combination with the features
already present in BCIs could improve the accuracy of these technologies.

Acknowledgment. This work was funded by Spanish projects of Ministerio


de Economı́a y Competitividad/FEDER TIN2017-84452-R, PID2020-114867RB-I00
(http://www.mineco.gob.es/), postgraduate research grants CZ03-000292-2018, 2015-
AR2Q9086 of the Government of Ecuador through SENESCYT and Universidad
Politécnica Salesiana 041-02-2021-04-16.

References
1. Alexander, J.E., et al.: P300 hemispheric amplitude asymmetries from a visual
oddball task. Psychophysiology 32(5), 467–475 (1995)
2. Barnett, L., Seth, A.K.: The MVGC multivariate granger causality toolbox: a new
approach to granger-causal inference. J. Neurosci. Methods 223, 50–68 (2014)
3. Blankertz, B., Lemm, S., Treder, M., Haufe, S., Müller, K.R.: Single-trial analysis
and classification of ERP components - a tutorial. NeuroImage 56(2), 814–825
(2011)
4. Bressler, S.L., Menon, V.: Large-scale brain networks in cognition: emerging meth-
ods and principles. Trends Cogn. Sci. 14(6), 277–290 (2010)
5. Chang, W., Wang, H., Lu, Z., Liu, C.: A concealed information test system based
on functional brain connectivity and signal entropy of audio-visual ERP. IEEE
Trans. Cogn. Dev. Syst. 12(2), 361–370 (2020)
6. Changoluisa, V., Varona, P., Rodrı́guez, F.: An electrode selection approach in
P300-based BCIs to address inter- and intra-subject variability. In: 2018 6th Inter-
national Conference on Brain-Computer Interface (BCI), pp. 1–4 (2018)
7. Changoluisa, V., Varona, P., Rodrı́guez, F.D.B.: A low-cost computational method
for characterizing event-related potentials for BCI applications and beyond. IEEE
Access 8, 111089–111101 (2020)
8. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: Basic theory and application
to neuroscience (2006)
9. Donchin, E., Coles, M.G.H.: Is the p300 component a manifestation of context
updating? Behav. Brain Sci. 11(3), 357–374 (1988)
264 V. Salazar et al.

10. Granger, C.W.J.: Investigating causal relations by econometric models and cross-
spectral methods. Econometrica: J. Econom. Soc. 37(3), 424–438 (1969)
11. Hoffmann, U., Vesin, J.M., Ebrahimi, T., Diserens, K.: An efficient p300-based
brain-computer interface for disabled subjects. J. Neurosci. Methods 167(1), 115–
125 (2008)
12. Kabbara, A., Khalil, M., El-Falou, W., Eid, H., Hassan, M.: Functional brain con-
nectivity as a new feature for p300 speller. PLoS One 11(1), e0146282 (2016)
13. Krusienski, D.J., et al.: A comparison of classification techniques for the p300
speller. J. Neural Eng. 3(4), 299–305 (2006)
14. Li, Y., et al.: Brain anatomical network and intelligence. PLoS Comput. Biol. 5(5),
e1000395 (2009)
15. Piccione, F., et al.: P300-based brain computer interface: reliability and perfor-
mance in healthy and paralysed participants. Clin. Neurophysiol. 117(3), 531–537
(2006)
16. Polich, J., Margala, C.: P300 and probability: comparison of oddball and single-
stimulus paradigms. Int. J. Psychophysiol. 25(2), 169–176 (1997)
17. Qin, Y., et al.: Classifying four-category visual objects using multiple ERP com-
ponents in single-trial ERP. Cogn. Neurodynamics 10(4), 275–285 (2016)
18. Rubinov, M., Sporns, O.: Complex network measures of brain connectivity: uses
and interpretations. NeuroImage 52(3), 1059–1069 (2010)
19. Serby, H., Yom-Tov, E., Inbar, G.: An improved p300-based brain-computer inter-
face. IEEE Trans. Neural Syst. Rehabil. Eng. 13(1), 89–98 (2005)
20. Sporns, O., Tononi, G., Kötter, R.: The human connectome: a structural descrip-
tion of the human brain. PLoS Comput. Biol. 1(4), e42 (2005)
21. Thee, K.W., Nisar, H., Yeap, K.H., Soh, C.S.: Evaluation of oddball cases: sin-
gle trial EEG connectivity study based on p300 and motor response. In: 2018
12th International Conference on Signal Processing and Communication Systems
(ICSPCS), pp. 1–6 (2018)
22. Tian, Y., Liang, S., Yao, D.: Attentional orienting and response inhibition: insights
from spatial-temporal neuroimaging. Neurosci. Bull. 30(1), 141–152 (2014)
23. Varela, F., Lachaux, J.P., Rodriguez, E., Martinerie, J.: The brainweb: phase syn-
chronization and large-scale integration. Nat. Rev. 2(4), 229–239 (2001)
24. Vidal, J.: Real-time detection of brain events in EEG. Proc. IEEE 65(5), 633–641
(1977)
25. Wang, H., Chang, W., Zhang, C.: Functional brain network and multichannel anal-
ysis for the p300-based brain computer interface system of lying detection. Expert
Syst. Appl. 53, 117–128 (2016)
Feature and Time Series Extraction
in Artificial Neural Networks for Arousal
Detection from Electrodermal Activity

Roberto Sánchez-Reolid1,2 , Francisco López de la Rosa2 ,


Daniel Sánchez-Reolid2 , Marı́a T. López1,2 ,
and Antonio Fernández-Caballero1,2,3(B)
1
Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha,
02071 Albacete, Spain
[email protected]
2
Instituto de Investigación en Informática de Albacete,
Universidad de Castilla-La Mancha, 02071 Albacete, Spain
3
CIBERSAM (Biomedical Research Networking Center in Mental Health),
28016 Madrid, Spain

Abstract. The detection of arousal is very important given its great


implication on daily well-being. In this regards, the use of artificial neu-
ral networks and other classifiers applied to physiological signals has
increased considerably. Different architectures for arousal detection using
electrodermal activity are presented in this paper. Moreover, two dif-
ferent strategies are analysed and compared. The first one is based on
the collection of 21 features (temporal, morphological, statistical and
frequential), whereas the second used the processed EDA data (phasic
component data) directly on different machine learning algorithms. The
first approach offers F1-scores 92.02% and 90.95% for a multilayer per-
ceptron and a one-dimensional convolutional network, respectively. For
the second scenario, it has been found that the best F1-scores are 91.02%
and 88.12% for bilateral long short-term memory and long short-term
memory, respectively.

Keywords: Arousal detection · Electrodermal activity · Artificial


neural networks · Feature extraction · Time series extraction

1 Introduction
Arousal can be described as the activation of the organism on a psycho-
physiological level. This activation can range from high activation to deep
sleep [1,20]. For this reason, this term refers to emotions that vary from over-
activation for intense emotions or stress to under-activation in the case of a calm
state and optimal state for carrying out specific tasks [23]. One of the physiolog-
ical variables related to arousal variation is electrodermal activity (EDA). This
signal is strongly connected to the sympathetic nervous system’s responses via
the sudomotor system.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 265–276, 2021.
https://doi.org/10.1007/978-3-030-85030-2_22
266 R. Sánchez-Reolid et al.

There is a need for tools that are able to discern between the different levels
of arousal. Fortunately, new techniques appearing everyday based on machine
learning (ML) and deep learning (DL) techniques allow us to process physiolog-
ical variables to obtain neurophysiological states [5,10,16]. In this regards, there
is a growing use of artificial neural networks (ANNs) [19,24].
The aim of this paper is to establish whether there are any differences in
the results obtained using classical methods based on the extraction of signal
features (time, morphology, statistics and frequency domain) [21,22] compared
to those calculated through directly introducing the processed EDA signals into
a classifier (time-series approach) for several classification models [7,18,25].

2 Materials

2.1 Acquisition Device

One of the most important issues in this analysis is the acquisition phase. This
process must be carried out as accurately as possible. In our case, the physi-
ological variable electrodermal activity (EDA) has been used to determine the
level of arousal of the participants. Normally, the skin responds to stimuli by
causing a variation in skin sweat. EDA signals are measured by sensing the
potential when a constant low current is placed between two metal electrodes
(usually chrome-silver electrodes). Therefore, when perspiration is produced, the
conductance increases (the resistance decreases) and, when sweating disappears,
the conductance reduces and, consequently, the skin resistance to current flow
increases.
For the acquisition of these data, the Empatica E4 device has been used [9]. It
is a commercial wearable device that is responsible for acquiring biosignals in real
time from the participant. The biosignals collected are: electrodermal activity
(EDA), blood volume pressure (BVP), acceleration (ACC) and skin temperature
(SKT). To avoid signal artefacts, the device must be properly attached to the
wrist.

2.2 Dataset and Experimental Design

The experiment was conducted on a total of 39 old adults (26 women and 13
men) with an average age of M = 68.51 (SD = 6.66). All participants were volun-
teers and belonged to Universidad de la Experiencia, Albacete, Spain (academic
courses for older adults). All the volunteers were in good health and had no
history of drug abuse or psychological or psychiatric disorders. In order to par-
ticipate in the experiment, volunteers had to fill out a consent form explaining
how the experiment was to be conducted. Due to the design of the experiment,
people with vision and hearing problems were not excluded.
The experiment was conducted in a comfortable and controlled environment.
It was performed by placing the volunteer in a comfortable seat and showing the
video clips on a 27 monitor with a pleasant audio level. Physiological signals
Artificial Neural Networks for Arousal Detection 267

were collected by placing the capture device in the non-dominant hand. Once
the experiment started, no interaction with the participant took place [4]. From
the experimental design viewpoint, a series of videos with a length of 147 s were
shown with the aim of eliciting different high and low arousal emotional states.
Concretely, we were interested in producing great arousal variation (high arousal
or excitement vs low arousal state or calm). In order to change the state of mind
produced by each clip and induce a neutral state, a series of distracting tasks
were placed between the clips. Physiological signals were collected for subse-
quent analysis, being the sampling frequency of the 4 Hz for the EDA. Further
information on the proceeding can be consulted in a recent paper [17].

2.3 Evaluation Metrics


All metrics used during the comparison are in terms of true positives (T P ), true
negatives (T N ), false positives (F P ) and false negatives (F N ):
– The precision (P) refers to the ratio of successful positive predictions.
TP
P= (1)
TP + FP
– The recall (R) is defined as the proportion of positive cases captured.
TP
R= (2)
TP + FN
– The F1-score is a measure of the accuracy of a test. It is defined as the
harmonic mean between precision and recall.
2×P×R
F1-score = × 100 (3)
P+R
– The area under the curve (AUC) and receiver operating characteristics (ROC)
curve are performance measurements for classification problems at various
threshold settings. ROC is a probability curve and AUC represents a degree
or measure of separability.

2.4 Statistical Analysis of EDA Classical Feature Extraction


In order to test whether there are statistically significant differences between the
signals corresponding to high and low arousal conditions, the non-parametric
Mann-Whitney U test was used. The statistical significance p-value was consid-
ered significant only when p < 0.05.

3 Methods
This section shows the different methods used to evaluate the effectiveness of
the classifiers used during this comparison. It has been divided into EDA signal
processing techniques and the different methods for obtaining the features that
will be used by the evaluated classifiers.
268 R. Sánchez-Reolid et al.

3.1 Electrodermal Activity Processing

Electrodermal activity (EDA) is a physiological variable that effectively reflects


the changes in activation that occur in the human body. The variation and level
of the activation are not the same for all people. It depends on race, gender,
physical state and age. Therefore it is necessary to know how these signals are
to be processed to generalise this research.
One of the first steps involved in the signal is the filtering and artefacts
removal. To minimise noise, the signal has been filtered with a cut-off frequency
4 Hz. Moreover, a Gaussian filter has been applied to smooth the signal. Finally,
the artefacts have been eliminated. To carry out this process, Ledalab, a Matlab
framework, has been used [14]. Then, a separation process called deconvolution
or decomposition operation has to be carried out. This process consists on sep-
arating the skin conductance (SC) into its two constituent components, called
tonic (SCL) and phasic (SCR) [3], as it can be seen in Eqs. (4), (5) and (6). The
first component is the skin conductance level (SCL), also known as the tonic,
and the second is known as the skin conductance response (SCR), commonly
referred to as the phasic.

SC = SCL + SCR = SCtonic + SCphasic (4)

SC = SCtonic + Driverphasic ∗ IRF (5)


SC = (Drivertonic + Driverphasic ) ∗ IRF (6)
where IRF is the impulse response function and ∗ is the convolution operator.
SC
= DriverSC (7)
IRF
where/corresponds to the continuous deconvolution operation [3].

DriverSC = Drivertonic + Driverphasic (8)


SC
Driverphasic = − Drivertonic (9)
IRF
The Driverphasic is in charge of normalising all the data obtained from the
different participants. It is regarded as the most effective signal for determining
an individual’s response to a stimulus.

3.2 Feature Extraction

Classical Feature Extraction. One fundamental aspect is how to obtain the


features to feed the subsequent classifiers. Parameters are obtained from the pha-
sic component (SCR), due to the fact that this component measures the partici-
pant’s state of arousal. Traditionally, signals have been processed to obtain time
frequency, morphological and statistical features. In this sense, there is many
available literature showing what parameters to look for in order to quantify
Artificial Neural Networks for Arousal Detection 269

Table 1. Features obtained from skin conductance response (SCR)

Analysis Features
Temporal M, SD, MA, MI, DR, D1, D2, D1M, D2M, D1SD, D2SD
Morphological AL, IN, AP, RMS, IL, EL
Statistical SK, KU, MO
Frequency F1, F2, F3

the signs of EDA [2,22]. Table 1 details all the features used to measure each
segment of the SCRDriver .
The parameters in the temporal domain provide information regarding the
mean values and their variability over a given period of time. These parameters
are the mean value (M), standard deviation (SD), maximum and minimum peak
value (MA and MI) and dynamic range (DR). Other features used are the first
and second derivative (D1, D2), their means (D1M, D2M) and their standard
deviations (D1SD and D2SD). The use of these indicators is based on the idea
that it will produce a higher gradient when the stimulus is intense than when it
is less intense. If the gradient has peaks the time needed for the recovery results
in a smoother slope of opposing sign.
From a morphological perspective, the following parameters have been used:
arc length (AL), integral area (IN), normalised mean power (AP), root mean
square (RMS), energy and perimeter ratio (EL) and perimeter and area ratio
(IL). In this sense, these features are responsible for displaying the shape of
the signal (SCRDriver ). Statistical parameters have also been used to provide
information on the distribution and variability of the data series. These are
skewness (SK), kurtosis (KU) and momentum (MO). In the frequency domain,
the fast Fourier transform (FFT) for bandwidths F1 (0.1, 0.2), F2 (0.2, 0.3)
and F3 (0.3, 0.4) has been computed. The aim of these parameters is to find
a quantitative characterisation that allows us to identify in a simple way the
different states of the participant.
In this sense, it should be highlighted that statistically significant differences
were found in the time domain for MA (p < 0.012), MI (p < 0.0345), DR (p <
0.007) and D1M (p < 0.003). In addition, significant differences in morphological
features were found for AL (p < 0.042), IN (p < 0.022) and EL (p < 0.002). For
the statistical parameters, KU (p < 0.0001) and SK (p < 0.005) show statistically
significant differences. Lastly, no statistically significant differences were found
in the frequency domain for any feature.

Feature Extraction for Time-Dependent Algorithms. In this second pro-


posal, the parameters directly extracted from the phasic signal are used. Signal
segments sized to the same length as those used in the classical method are
introduced directly. This can be achieved by means of the sliding window (WS)
segmentation method [6,15]. This method consists in creating consecutive sub-
segments from a signal of longer temporal length. These segments overlap each
270 R. Sánchez-Reolid et al.

other so that between one segment and the next there are data in common. The
sliding window algorithm is compelling due to its great simplicity.

3.3 Binary Classifiers Used


In our case, it must be understood that we want to compare different binary
classifiers. Accordingly, this section shows the different types of classifiers used
in the study. In this sense, classifiers based on metric distance, clustering, trees
and neural networks of different types have been employed. They have been
grouped into two main groups: classical binary classifiers and artificial neu-
ral networks (ANNs). The first group has been subdivided into four classifica-
tion groups: support vector machines, clustering algorithms, decision trees and
ensemble methods, while the second group has been subdivided into four differ-
ent types of topologies and configurations of ANNs. In most cases, a parameter
search method called grid-search has been used. This technique allows us to per-
form iterations on the topology and parameters of the model in order to find the
optimal one.

Artificial Neural Networks for Binary Classification. Different ANN


topologies have been selected (see Table 2). These configurations include per-
ceptron (MLP), convolutional (1D-CNN) networks and also time-dependent net-
works such as long short-term memory (LSTM), bilateral long short-term mem-
ory (BiLSTM) and gated recurrent unit (GRU) [11,13,26].
The setups employed for this experiment have been developed based on pre-
vious studies. The multilayer perceptron with backpropagation (MLP-BP) con-
sists of an input layer, three fully-connected hidden layers and a softmax output
layer. A dropout layer (DP) is established between each of them. Secondly, for
the one-dimensional convolutional neural network (1D-CNN) a total of 11 lay-
ers consisting on 3 convolutional layers, 3 max-pooling layers between, 2 fully
connected, 1 dropout layer in between and lastly 1 softmax are arranged [8].
An 8-layer model has been proposed for the LSTM architecture, consisting of
an input layer, 3 layers of LSTM cells followed by two fully connected layers with
a dropout layer in between and a softmax layer as the output layer. Similarly,
for the Bilateral LSTM network model the previous model has been taken as
a basis, replacing the layers containing LSTM cells by bilateral LSTM cells
(BiLSTM) [12]. Finally, for a gated recurrent units network (GRU) the same
reasoning as in the two previous models has been followed, but using GRU cells
for each of the hidden layers. In each case, the grid-search algorithm will look
for the number of cells in each of the layers, the optimiser (Adam or SGD) and
the learning rate that lead to the optimal configuration.

Other Classical Binary Classifiers. As can be seen in Table 3, seven dif-


ferent configurations have been selected for support vector machines (SVMs).
Three polynomial configurations (linear, quadratic, cubic), one radial configura-
tion (Rbf) and three different Gaussian configurations (fine, medium and coarse)
have been used. For decision tree (DTs) three different setups were used (fine,
Artificial Neural Networks for Arousal Detection 271

Table 2. ANNs’ topologies and parameters

Classifier Layers configuration


MLP-BP I+FC1+DP+FC2+DP+FC3+DP+Sf
1D-CNN I+C1+MP+C2+MP+C3+MP+FC1+DP+FC2+SF
LSTM I1+LSTM1+LSTM2+LSTM3+FC1+DP+FC2+SF
BiLSTM I1+BiLSTM1+BiLSTM2+BiLSTM3+FC1+DP+FC2+SF
GRU I1+GRU1+GRU2+GRU3+FC1+DP+FC2+SF

medium and coarse). Fine setup with 100, medium with 30 and coarse with 5
splits respectively with Gini’s criterion has been performed. Moreover, for tree
ensemble methods, three setups has been chosen: random forest, boosted and
RUS boosted. In addition, for clustering methods, (KNN) we selected 5 different
configurations: fine, medium, coarse with 2, 10, 100 neighbours; cosine with 20
and angular metric; and 20 for weighted with Manhattan’s metrics.

Table 3. Classical binary classifiers’ configurations and parameters

Classifier Type Configuration Setup parameters


SVM Polynomial Linear, quadratic, cubic C, γ
SVM Gaussian Radial, fine, medium, coarse C, γ
Tree Decision tree Fine (100), Medium (30), Coarse (5) Gini’s criterion
Ensemble Tree Random forest, boosted, RUS Gini’s criterion
KNN K-neighbours Fine (2), Medium (10), Coarse (100) Euclidean metric
KNN K-neighbours Cosine (20) Angular metric
KNN K-neighbours Weighted (20) Manhattan metric

4 Results

Different deep learning frameworks such as Keras, Tensorflow 2.0 and Pytorch
have been used for the design and implementation of the different models and
architectures. All the results have been obtained using a cluster composed of an
Intel Xeon Extreme 10990k of 20 cores, 64 gigabytes of RAM and two graphics
cards Nvidia Quadro P5000 with 11 GB of VRAM working in parallel.
In this respect, a training process of 200 epochs has been carried out. Depend-
ing on the type of classifier, most of the models converge between 50 and 100
epochs. To ensure that the different topologies are consistent with the training
and the test, another 100 repetitions have been performed for each. Since a grid-
search process has been used to locate the optimal hyperparameters for each
of the models, early stopping conditions have been added. In addition, all the
result are stored for latest analysis.
272 R. Sánchez-Reolid et al.

4.1 Optimised Parameters for Different Classifiers

In order to create models that are as efficient as possible, a process of selection


and optimisation of the different parameters has been carried out. The initial
parameters selected for Tree, Ensemble, KNN already comply with the specifi-
cations, without any serious alteration of the results obtained by grid-search. In
this sense, only the parameters C, γ have been calculated for SVM, obtaining
for them the values C = 0.75 and γ = 420 for a classical approach and C = 0.87
and γ = 324 for a procedure based on temporal parameters.
Table 4 shows the configurations that have been found to perform well for the
two paths mentioned. At this point, and after performing different iterations on
the models, it has been obtained that there is no great variation in the number of
units needed for each layer in the mentioned neural networks, with the exception
of the number of data in the input, which has changed from the classical features
strategy to the raw signal. In this case, for the input of all models based on a
classical parameter approach, the number of units of the input layer corresponds
to 21, matching with the number of parameters. On the other hand, for a time-
based parameter strategy, we have to enter 80, 86, 92, and 100 units for 1D-CNN,
LSTM, BiLSTM and GRU, respectively.

Table 4. Optimised parameters for each neural network

Classifier Layers configuration Parameters


MLP-BP I(*)+FC1(64)+DP(0.6)+FC2(32)+ Opt: Adam
DP(0.5)+FC3(8)+DP(0.43)+Sf(2) LR: 1.06 ∗ 10−7
1D-CNN I(*)+C1+MP(2)+C2+MP(2)+C3+MP(2)+ Opt: Adam
FC1(100)+DP(0.5)+FC2(100)+Sf(2) LR: 1.06 ∗ 10−7
LSTM I(*)+LSTM1(48)+LSTM2(64)+LSTM3(12)+ Opt: SGD
+FC1(90)+DP(0.32)+FC2(100)+SF(2) LR: 1.87 ∗ 10−4
BiLSTM I(*)+BiLSTM1(82)+BiLSTM2(34)+ Opt: SGD
BiLSTM3(22)+FC1(34)+DP(0.65)+FC2(67)+SF(2) LR: 1.87 ∗ 10−4
GRU I(*)+GRU(64)+GRU(30)+GRU(26)+ Opt: SGD
+FC1(100)+DP(0.5)+FC2(84)+SF(2) LR: 1.87 ∗ 10−4

4.2 Consumption Time for Each of the Trained Models

Apart from the classification results, another aspect considered is the training
time needed for each of the predictive models. The times corresponding to the
best performing models have been selected. The model that takes the least time
in the classical approach is 5 min for the ensemble Tree method. In contrast, the
most time-consuming model is the BiLSTM with a time of 32 min. On the other
hand, using this second approach, the model that took the least time to train
was the Tree model with 4 min, while the model that took the longest to train
was the GRU-based model with 22 min.
Artificial Neural Networks for Arousal Detection 273

4.3 Results from Different Classifiers


Once the different classifiers have been trained, the following results have been
obtained. In the following table (see Table 5) we have the results for the two
approaches discussed above using the same dataset.

Table 5. F1-score (mean and standard deviation) and AUC value of the different
classifiers

Classic approach Time-dependent


Classifier Type F1-score (%) AUC F1-score (%) AUC
SVM Linear 75.0(0.02) 0.76(0.01) 54.2(0.5) 0.56(0.12)
SVM Quadratic 76.4(0.01) 0.76(0.01) 76.00(0.68) 0.79(0.03)
SVM Cubic 81.00(0.78) 0.76(0.32) 62.06(0.68) 0.53(0.0)
SVM Radial 86.10(0.01) 0.81(0.20) 75.35(0.0) 0.67(0.03)
SVM Fine Gaussian 75.5(0.12) 0.76(0.04) 68.35(0.02) 0.70(0.0)
SVM Medium Gaussian 74.8(0.09) 0.76(0.12) 69.35(0.01) 0.72(0.1)
SVM Coarse Gaussian 76.6(0.04) 0.76(0.31) 69.22(0.11) 0.73(0.01)
Tree Fine 61.0(1.82) 0.60(0.001) 75.35(0.0) 0.57(0.03)
Tree Medium 72.6(2.36) 0.73(0.0) 71.85(0.0) 0.69(0.05)
Tree Gaussian 68.0(2.82) 0.69(0.001) 75.35(0.0) 0.75(0.007)
Ensemble Random forest 79.3(0.14) 0.78(0.10) 77.8(0.14) 0.78(0.10)
Ensemble Boosted 77.8(0.14) 0.78(0.10) 79.8(0.04) 0.79(0.02)
Ensemble RUS 76.3(0.18) 0.79(0.15) 81.0(0.01) 0.76(0.01)
KNN Fine 74.10(0.31) 0.82(0.001 71.01(0.01) 0.87(0.03)
KNN Medium 79.19(0.79) 0.86(0.02) 79.0(0.01) 0.79(0.03)
KNN Coarse 78.57(0.86) 0.78(0.00) 65.35(0.01) 0.68(0.03)
KNN Cosine 73.22(0.75) 0.89(0.10) 65.05(0.02) 0.67(0.03)
KNN Weighted 71.30(0.52) 0.82(0.10) 68.66(0.19) 0.57(0.03)
MLP Ours 92.02(0.17) 0.93(0.0) 81.95(0.59) 0.78(0.01)
1D-CNN Ours 90.95(0.59) 0.78(0.01) 81.83(0.69) 0.83(0.01)
LSTM Ours 84.95(0.03) 0.82(0.01) 88.12(0.13) 0.90(0.06)
BiLSTM Ours 85.12(0.59) 0.84(0.00) 91.02(0.09) 0.92(0.0)
GRU Ours 86.65(0.59) 0.86(0.00) 86.07(0.63) 0.89(0.16)

Within the group of support vector machines (SVMs), for the subgroup of
polynomial type machines, the best performing machines are undoubtedly the
cubic with the standard features approach with an F1-score of 81.0%. On the
other hand, for the subgroup of Gaussian-type kernels, the radial kernel achieves
the highest F1-Score (Rbf) with results of 86.10%. These good results are largely
due to the fact that these algorithms are able to handle a large number of
parameters very efficiently. For the second approach, the results are worse, being
76% and 69.35% for the quadratic and medium Gaussian kernel respectively.
Continuing with the group of decision trees (DT), the results show that for
the first approach, the best performing is the medium approach and, for the
274 R. Sánchez-Reolid et al.

second one, the Gaussian approach, with results of 72.6% and 75.35% respec-
tively. In this case and for the same length of signals, both can be used with
similar results. Following the same reasoning for the tree ensemble methods, it
can be verified that similar results are obtained using the random forest and
RUS method for the two case studies with 79.3% and 81% respectively. For the
clustering algorithms (KNN), the best algorithm in both cases is KNN medium
with 79.19% and 79.0% respectively.
Finally, artificial neural network-based algorithms are analysed. Within these
models, it is observed that the multilayer perceptron-based architecture has a
very high performance 92.02% in detecting these arousal variations within the
first approach. On the other hand, in the second approach, the best performing
model is the BiLSTM cell-based model with 91.02%. This is largely due to the
time-dependent operation of these networks. However, it can be seen that for all
other networks, the results are above 81%. We can conclude that these algorithms
are able to distinguish well the different states.

5 Conclusions
In this article, a comparison between using features (temporal, morphological,
statistical and frequential) and segments of raw phasic (time-series approach)
electrodermal activity signals employing different machine learning algorithms
has been performed. In this sense, different architectures and classifiers have been
proposed in order to determine which strategy works best with EDA signals. For
our first scenario, obtaining 21 features based on time, morphology, statistics
and frequency dependent magnitudes, we have found that it works well with any
classical binary classifier such as SVM, KNN, Tree and their ensembles (random
forest, RUS, boosted) with a 86%, 79.19%, 72.6% and 79.3% respectively. In
addition, it also works well with network architectures such as MLP-BP and one-
dimensional convolutional networks with a performance of 92% and 90.95%. For
the second proposal, as expected, only those networks that are time-dependent
achieve remarkable results.
Our contribution is based on determining whether these network topologies
are sustainable for the determination of the levels of arousal variation in contrast
with other binary classification models. In addition, we have found out which
algorithms perform best for the two proposed scenarios. It can be concluded that
time-series approaches (our second scenario), as expected, works really well with
LSTM, BiLSTM and GRU as EDA signals are intrinsically time-dependent.

Acknowledgements. This work has been partially supported by Spanish Ministerio


de Ciencia e Innovación, Agencia Estatal de Investigación (AEI)/European Regional
Development Fund (FEDER, UE) under EQC2019-006063-P and PID2020-115220RB-
C21 grants, and by CIBERSAM of the Instituto de Salud Carlos III. Roberto Sánchez-
Reolid holds BES-2017-081958 scholarship from Spanish Ministerio de Educación y
Formación Profesional.
Artificial Neural Networks for Arousal Detection 275

References
1. Bakker, I., van der Voordt, T., Vink, P., de Boon, J.: Pleasure, arousal, dominance:
Mehrabian and Russell revisited. Curr. Psychol. 33(3), 405–421 (2014)
2. Bartolomé-Tomás, A., Sánchez-Reolid, R., Latorre, J.M., Fernández-Sotos, A.,
Fernández-Caballero, A.: Arousal detection in elderly people from electrodermal
activity using musical stimuli. Sensors 20(17), 4788 (2020)
3. Benedek, M., Kaernbach, C.: A continuous measure of phasic electrodermal activ-
ity. J. Neurosci. Methods 190(1), 80–91 (2010)
4. Braithwaite, J.J., Watson, D.G., Jones, R., Rowe, M.: A guide for analysing elec-
trodermal activity (EDA) & skin conductance responses (SCRs) for psychological
experiments. Psychophysiology 49(1), 1017–1034 (2013)
5. Castillo, J.C., Fernández-Caballero, A., Castro-González, Á., Salichs, M.A., López,
M.T.: A framework for recognizing and regulating emotions in the elderly. In: Pec-
chia, L., Chen, L.L., Nugent, C., Bravo, J. (eds.) IWAAL 2014. LNCS, vol. 8868, pp.
320–327. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13105-4 46
6. Chu, C.S.J.: Time series segmentation: a sliding window approach. Inf. Sci. 85(1–
3), 147–173 (1995)
7. Dar, M.N., Akram, M.U., Khawaja, S.G., Pujari, A.N.: CNN and LSTM-based
emotion charting using physiological signals. Sensors 20(16), 4551 (2020)
8. Dutande, P., Baid, U., Talbar, S.: LNCDS: a 2D–3D cascaded CNN approach for
lung nodule classification, detection and segmentation. Biomed. Signal Process.
Control 67, 102527 (2021). https://doi.org/10.1016/j.bspc.2021.102527
9. Empatica: E4 Wristband from Empatica (2019). https://www.empatica.com/en-
eu/research/e4/
10. Fernández-Rodrı́guez, Á., Medina-Juliá, M.T., Velasco-Álvarez, F., Ron-Angevin,
R.: Preliminary results using a P300 brain-computer interface speller: a possible
interaction effect between presentation paradigm and set of stimuli. In: Rojas, I.,
Joya, G., Catala, A. (eds.) IWANN 2019, Part I. LNCS, vol. 11506, pp. 371–381.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20521-8 31
11. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction
with LSTM. In: 1999 Ninth International Conference on Artificial Neural Networks,
p. 470. IET (1999)
12. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional
LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610
(2005)
13. Greff, K., Srivastava, R.K., Koutnı́k, J., Steunebrink, B.R., Schmidhuber, J.:
LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10),
2222–2232 (2016)
14. Karenbach, C.: Ledalab-a software package for the analysis of phasic electroder-
mal activity. Tech. rep., Allgemeine Psychologie, Institut für Psychologie (2005).
http://www.ledalab.de/
15. Keogh, E., Chu, S., Hart, D., Pazzani, M.: Segmenting time series: a survey and
novel approach. In: Data Mining in Time Series Databases, pp. 1–21. World Sci-
entific (2004)
16. Loza, C.A., Principe, J.C.: The generalized sleep spindles detector: a generative
model approach on single-channel EEGs. In: Rojas, I., Joya, G., Catala, A. (eds.)
IWANN 2019, Part I. LNCS, vol. 11506, pp. 127–138. Springer, Cham (2019).
https://doi.org/10.1007/978-3-030-20521-8 11
276 R. Sánchez-Reolid et al.

17. Martı́nez-Rodrigo, A., Fernández-Aguilar, L., Zangróniz, R., Latorre, J.M., Pastor,
J.M., Fernández-Caballero, A.: Film mood induction and emotion classification
using physiological signals for health and wellness promotion in older adults living
alone. Expert Syst. 37(2), e12425 (2020)
18. Mou, L., et al.: Driver stress detection via multimodal fusion using attention-based
CNN-LSTM. Expert Syst. Appl. 173, 114693 (2021)
19. Pineda, A.M., Ramos, F.M., Betting, L.E., Campanharo, A.S.L.O.: Use of complex
networks for the automatic detection and the diagnosis of Alzheimer’s disease. In:
Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019, Part I. LNCS, vol. 11506, pp.
115–126. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20521-8 10
20. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178
(1980). https://doi.org/10.1037/h0077714
21. Sánchez-Reolid, R., Martı́nez-Rodrigo, A., Fernández-Caballero, A.: Stress iden-
tification from electrodermal activity by support vector machines. In: Ferrández
Vicente, J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Toledo Moreo, J., Adeli,
H. (eds.) IWINAC 2019, Part I. LNCS, vol. 11486, pp. 202–211. Springer, Cham
(2019). https://doi.org/10.1007/978-3-030-19591-5 21
22. Sánchez-Reolid, R., Martı́nez-Rodrigo, A., López, M.T., Fernández-Caballero, A.:
Deep support vector machines for the identification of stress condition from elec-
trodermal activity. Int. J. Neural Syst. 30(07), 2050031 (2020)
23. Setz, C., Arnrich, B., Schumm, J., Marca, R.L., Troster, G., Ehlert, U.: Discrim-
inating stress from cognitive load using a wearable EDA device. IEEE Trans.
Inf. Technol. Biomed. 14(2), 410–417 (2010). https://doi.org/10.1109/titb.2009.
2036164
24. Shakeel, M.F., Bajwa, N.A., Anwaar, A.M., Sohail, A., Khan, A., Haroon-ur-
Rashid: Detecting driver drowsiness in real time through deep learning based
object detection. In: Rojas, I., Joya, G., Catala, A. (eds.) Advances in Computa-
tional Intelligence. IWANN 2019. Lecture Notes in Computer Science, vol. 11506.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20521-8 24
25. Susanto, I.Y., Pan, T.Y., Chen, C.W., Hu, M.C., Cheng, W.H.: Emotion recogni-
tion from galvanic skin response signal based on deep hybrid neural networks. In:
2020 International Conference on Multimedia Retrieval, pp. 341–345 (2020)
26. Wu, H., Yang, M., Yang, S., Lu, H., Wang, C., Rao, Y.: A novel DAS signal
recognition method based on spatiotemporal information extraction with 1DCNNs-
BiLSTM network. IEEE Access 8, 119448–119457 (2020)
Deep Learning
Context-Aware Graph Convolutional
Autoencoder

Asma Sattar(B) and Davide Bacciu(B)

Dipartimento di Informatica, Università di Pisa, L.Go B. Pontecorvo 3, Pisa, Italy


[email protected], [email protected]

Abstract. Recommendation problems can be addressed as link predic-


tion tasks in a bipartite graph between user and item nodes, labelled
with rating on edges. Existing matrix completion approaches model the
user’s opinion on items by ignoring context information that can instead
be associated with the edges of the bipartite graph. Context is an impor-
tant factor to be considered as it heavily affects opinions and preferences.
Following this line of research, this paper proposes a graph convolutional
auto-encoder approach which considers users’ opinion on items as well
as the static node features and context information on edges. Our graph
encoder produces a representation of users and items from the perspec-
tive of context, static features, and rating opinion. The empirical analysis
on three real-world datasets shows that the proposed approach outper-
forms recent state-of-the-art recommendation systems.

Keywords: Context-aware recommendation · Deep learning for


graphs · Graph neural networks

1 Introduction
Recommender systems provide a methodology to recognize user’s requirements,
and foresee interest by mining information on the history of users and their inter-
actions with items. These systems aim to suggest interesting items to the user.
Nowadays, recommender frameworks have been effectively deployed in numer-
ous application settings, e.g., movie recommendation on Netflix and Movielens,
friend recommendations on Facebook, and product recommendation on Amazon
and Ebay, etc. Collaborative Filtering (CF) is the most widely used approach
for recommendation tasks [8,22]. The core idea behind CF based approaches is
that similar users will have similar behaviour and they will hence like similar
items/services. These approaches leverage user-item interactions, e.g., rating,
clicks, and purchases, to predict user’s preferences. The recent renewed popu-
larity of Graph Neural Networks (GNNs) [1] promoted their exploitation also
in the development of recommendation systems [17]. Many of these approaches
treat recommendation tasks as link prediction in bipartite graphs via matrix
completion [2,23]. An adjacency matrix is used to represent a bipartite graph
between user and item nodes, where matrix entries represent the edges between
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 279–290, 2021.
https://doi.org/10.1007/978-3-030-85030-2_23
280 A. Sattar and D. Bacciu

nodes. Recently, many researchers contributed towards the development of GNN


based Collaborative filtering for modelling user-item interactions in the form of
a message passing neural network between user and item nodes [15,16].
Despite the popularity of CF based approaches, their effectiveness is limited
by the fact that they neglect the static features (user’s and item’s profile) and
dynamic context information of users and items. The term dynamic reflects
the fact that user choices change with time and are highly dependent on the
context under which they interact with the item. For example, weather and time
information highly impact the choice of users in restaurant recommendation,
while the user’s company influences which movie they are most likely to watch.
As such, it is critical to develop context-aware recommender systems that can
adequately accommodate the user’s static features as well as dynamic context
information while making predictions [12].
In [6], context-aware recommendation problems are mapped to tensor com-
pletion tasks, inspired by the matrix completion view of CF, but it suffers
from high complexity. SocialMF [5] proposes a matrix factorization approach
by integrating the social context information as a trust factor between users
for social recommendation. Following SocialMF, several solutions put forward
a deep learning-based matrix factorization approach for context-aware recom-
mendation tasks [4,9,20], but these models are generally characterized by high
computational costs at prediction time. Existing approaches for context-aware
recommendation are unable to capture dynamic user-item-context deep interac-
tion and ignore the fact that the same person can behave differently when inter-
acting with the same item under different context [10]. It is therefore reasonable
to expect an improvement in the quality of personalized recommendations when

Co Context
nte
xt
+ c_id: int

(u_id, v_id, c_id) = r


(121, 1013, 9) =3
3

U1 r=5,C V1
={c ,c
1 3}

,c 4}
C={c 2
U2 r=3, V2
r=1,C
={c ,c
4 7}
Us
ers ms
Ite U3 V3
r=2,C={c4,c7}
User Item

+ u_id: int + v_id: int

+ age: int + director: string

--- ---
+ gender: char + genre: string

Fig. 1. The data used for building context-aware recommender systems takes the form
of a 3D matrix between user, item and context. This can be represented as bipartite
graph between user and item nodes. The edges are labelled with context and rating
information.
Context-Aware Graph Convolutional Autoencoder 281

incorporating dynamic context information. This is the key focus and motivation
underlying this work.
We introduce a novel GNN-based matrix completion approach that effec-
tively integrates three kinds of information from the bipartite graph (Fig. 1):
(i) User’s opinion/rating on items, (ii) Context information on edges between
user and items, (iii) Static features of users and items. In particular, we propose
a context-aware graph convolutional autoencoder for matrix completion. Our
graph convolutional autoencoder learns from rating information, static features,
and dynamic context to produce final embeddings for user and item nodes from
different perspectives. We explicitly tackle the importance of context features
for individual users. The final embedding generated from the graph encoder is
given as input to the decoder, which attempts to reconstruct the matrix with
minimum loss. In summary, the key contributions of this paper are as follows:
– We highlight the limitation of the existing supervised deep learning
approaches for matrix completion tasks;
– We present a method to incorporate context information for graph neural
network based matrix completion tasks;
– We propose a novel graph convolutional autoencoder, for integrating context
information for link prediction in bipartite graphs, set to capture the impor-
tance of context for individual user and items, during recommendations.
– We provide an empirical validation of the effectiveness of our approach on
three real-world datasets.

2 Related Work
Most of the work in the field of context-aware recommender systems has been
dedicated to the improvement of matrix factorization (MF) approaches [4,9,14],
which is a typical methodology for recommendation models based on latent
factors [19]. Despite the good performance of MF, the primary drawback is their
inability to capture the deep interactions between users and items. These matrix
factorization based techniques can predict unseen items that a user may have
interest in, but model building is expensive and often requires to trade between
accuracy and scalability [13].
Graph neural network (GNN) solutions leverage the structural information
of graphs for recommendation tasks [17]. In bipartite graphs user-item interac-
tions are used to learn node representations for the users. GCMC [2] extends the
graph-based approach to bipartite graphs with rating on edges. STAR-GCN [23],
instead, stacks multiple identical GCN encoder-decoders combined with inter-
mediate supervision to improve the final prediction performance. Both STAR-
GCN and GCMC treat all neighbours of a node equally and message propagation
depends only on the neighbouring node. IGMC [24] provides an inductive matrix
completion way for recommendation tasks without using side information.
The previous deep learning based CF approaches [18,25] discard collabora-
tive signals which are hidden in user-item interaction and are unable to have
collaborative filtering effects. In NGCF [15], the embedding propagation layer
282 A. Sattar and D. Bacciu

exploits a user-item bipartite graph and encodes high-order connectivity signals


into representation. This has been further improved by an attention mechanism
for selecting neighbours during embedding propagation. GCF-YA [21] is a deep
graph neural network implementation of collaborative filtering, based on infor-
mation propagation and attention mechanism to predict missing links between
users and items. GraphRec [3] tackles social recommendation by aggregating the
historical behaviour of individuals from user-item and user-user bipartite graph
for recommendation.
The GNN-based surveyed above only consider rating information on edges
between users and items and static node features, ignoring the importance of con-
text features. In the following, we show how it is possible to extend such approaches
to consider dynamic and time-varying contextual features influencing recommen-
dations.

3 Problem Definition
In this work, we build our approach by considering a 3D rating matrix Muvc ∈
RNu ×Nv ×Nc , where Nu is total number of users in the system, Nv represents the
total number of items and Nc is total number of different contexts (as shown in
Fig. 1). Typically, the rating scale ranges from one to five stars such that Muvc ∈
{1, ...5}Nu ×Nv ×Nc . Here, the context comprises dynamic attributes such as loca-
tion, time, mood, weather, company, etc. Note that the importance of contextual
attributes varies from person to person and item to item. Hence, context selection
is an important issue. Both users and items can be represented by a multitude of
characteristics: examples of static user features are age, gender, etc. Let NFu and
NFv represents the total number of features of users and items, respectively.
Given such data, the recommendation problem is then cast as a task aiming
to predict the existence of a labelled link between a user and an item. This
work aims to introduce context information to matrix completion tasks with
mechanisms for finding which context attributes are important for a target user
and item. Details of the learning model are discussed in Sect. 4.

4 Context-Aware Graph Neural Network


We extend the link prediction problem in bipartite user-item graphs to consider
matrix completion leveraging context information on graph edges. Our solu-
tion, dubbed context-aware graph convolutional matrix completion (cGCM C F ),
extends the graph convolutional autoencoder in [2] (GCM C+f eat, in the fol-
lowing). GCM C+f eat operates on a 2D rating matrix between users and items
with side node features, ignoring context features on edges. As shown in Fig. 2,
GCM C+f eat is the innermost block (depicted in grey) inside the graph encoder,
while cGCM C F (in purple) is our proposed extension that operates on user-
item-context interaction mapped in a 3D matrix. Overall, the proposed archi-
tecture has two components: a graph encoder and a decoder. The graph encoder
consists of two graph convolutional neural network layers and two dense neural
Context-Aware Graph Convolutional Autoencoder 283

User-Item Opinion Matrix (A) GCMC + feat

Nu x Nv Weight Sharing GCN


User/Item-Feature Matrix (U & V )
F F Reconstructed Matrix
Nu x NFu Nv x NF Dense Neural Network Layer
v
Nu x Nv
Concat Zu
User-Item Context Matrix (Auvc)
Zv Decoder
Nu x Nv x Nc GCN

User/Item-Context Importance (UC & VC)


Dense Neural Network Layer
Nu x Nc Nv x Nc
cGCMCF
Input Data

Fig. 2. High-level architecture of the proposed context-aware graph convolutional


autoencoder. From left to right: the green box shows the various form of input data,
the blue box is a graph encoder with four neural network layers operating on different
types of input data, and the right-most box is s decoder for reconstructing the user-item
matrix. (Color figure online)

network layers. Each layer operates on different data to produce a representation


of users and items with respect to opinion, context, and feature information, as
explained in Sect. 4.1. The decoder is, instead, used for link prediction in the
bipartite graph and its details are discussed in Sect. 4.2.

4.1 Encoder
The first component of our model is a graph convolutional encoder. Its inputs
are the following matrices:
User×Item Rating Matrix (A): the matrix A represents user’s opinion
on items. It is composed of Ar sub-matrices where Ar ∈ RNu ×Nv is defined as
Ar [u][v] = 1 ⇐⇒ (u, v) = r : r ∈ {1, 2..., R}
User×Item×Context Matrix (Auvc ). The binary matrix Auvc ∈
RNu ×Nv ×Nc represents the context under which a user has provided a specific
rating to items.
User×Context Matrix (UC ). The matrix UC ∈ RNu ×Nc denotes the
importance of context for individual users. It leverages the information from
above matrices and gives more weight (α) to the context in which a user has
given the maximum rating, compared to the context under which the user has
rated less.
Item×Context Matrix (VC ). We use Ar in a similar way as for UC above,
giving more importance to the context attributes under which an item is rated
maximally. Here, VC ∈ RNv ×Nc .
User×Feature Matrix (UF ). This matrix ∈ RNu ×NFu represents normal-
ized static feature attributes for users.
Item×Feature Matrix (VF ). This matrix ∈ RNu ×NFv represents normal-
ized static feature attributes for items.
284 A. Sattar and D. Bacciu

Next, we explain how a graph encoder operates on the matrices defined above,
to learn the representations of users and items with respect to rating, context
and static features.
Modelling User’s Opinion. User opinion on items is encoded in the adjacency
matrix A, representing a bipartite graph between users and items. For modeling
the user opinion, we have used a graph convolutional layer with a weight sharing
mechanism [18] across the locations of the graph. Here, our choice of the weight
sharing policy allows differentiating convolutional weights based on the edge
types. The number of weight matrices depends on the rating level on the edge.
Let R denote the number of available rating levels; given an edge labelled with
rating r, then we use the rating-specific parameter matrix Wr for propagation of
the message from that edge. We found this customized weight sharing mechanism
for graph spectral convolutional on adjacency matrix effective, compared to using
a single global weight matrix. Details of this spectral convolutional layer are
defined in the following:


R R
zuo = GCN (Xv , Ai ) = σ( Ãi Xv Wiv ) (1)
i=0 i=0

R R
zvo = GCN (Xu , ATi ) = σ( ÃTi Xu Wiu ) (2)
i=0 i=0

where Xu and Xv are the one-hot unique vectors for the user and item node.
The term R is the maximal rating a user can give to an item, Wiu and Wiv
represents R trainable weight matrices and σ is non linear activation function
such as ReLU. The matrix Ãi and ÃTi are the normalized adjacency matrix Ai
and its transpose, respectively.

Ãi = D−1/2 Ai D−1/2 ∀ i = 0 to R (3)

where the term D represents a diagonal degree matrix, containing the square root
of degree on diagonal. Similarly, ATi is normalized to get ÃTi (using Eq. (3)).
Modelling Contextual Features. We have normalized all context attributes
in Auvc by dividing each context attribute with the total count of contexts rated
by the user. After that, we have accumulated the normalized context information
between users and items to get Ac ∈ RNu ×Nv
Ncuv
 cuv
Ac [u] [v] = i
(4)
i=0
N uv
c

where u and v are user and item indexes in the matrix, Ncuv represents the
count of occurrences of context c when user u has rated item v, cuv
i denotes the
individual context value under which user u has rated item v.
For modelling this context information, we have proposed a spectral graph
convolution that operates on Ac with the same propagation rule used for mod-
eling user’s opinion (Eq. (1) and (2)) but using a single weight matrix. We
Context-Aware Graph Convolutional Autoencoder 285

call the user and item representation with respect to context attributes zuc1 and
zvc1 , respectively. The same user behaves differently for the same item when the
context changed, making the context a dynamic attribute. Similarly, items get
different ratings when the context is changed. To model this type of dynamic
relation between users and items under varying context, we performed a statis-
tical analysis of the (training-only portion of the) data to identify the effect of
importance factor α that gives more importance to the favourite context of users
and items. The extracted user preferences are stored in UC :
uvi
Nu
,Nc
UC [u][c] = Auvc [u][vi ][cj ] ∗ α[r] : r ∈ {1, · · · , R} (5)
i,j

Where Nu denotes the neighbours of user u, Ncuv represents the number of con-
text attributes in which the user provides opinion r. Similarly, VC is obtained
using Eq. (5). Both matrices are normalized to have values between 0 to 1. We
have added the simplest dense neural network layer to process this information.
The weight matrices chosen for this purpose are randomly and uniformly dis-
tributed and node dropout is applied to the hidden layers to prevent overfitting.
The operations on this layer are defined as :
zuc2 = σ(UC W3c + bc ) , zvc2 = σ(VC W4c + bc ) (6)
We have integrated zuc1 with zuc2 , and zvc1 with zvc2 to get final representation of
user and item, associated with context attributes.
zuc = σ([zuc1 ⊕ zuc2 ]W5c + bc ) (7)
zvc = σ([zvc1 ⊕ zvc2 ]W6c + bc ) (8)
where as W represents trainable weight matrices and b is a bias.
Modelling Static Content Features. We have represented the static features
of user and item nodes with UF and VF , respectively. These features are not
directly inputted in the graph convolutional layer as this causes suboptimal
performances when user-item content information is too sparse. Therefore, we
introduce a separate dense neural network layer to get the representation of users
and items with respect to static features.
zuf = σ(UF W3f + bf ) (9)
zvf = σ(VF W4f + bf ) (10)
where W3f and W4f represent trainable weight matrices and bf is a bias.
We have concatenated the user representation from rating/opinion (Eq. (1)),
context (Eq. (7)) and features (Eq. (9)) perspective, and passed to another dense
hidden layer to get final user embedding:
zu = σ([zuo ⊕ zuc ⊕ zuf ]W6 + b). (11)
Similarly, the items’ representation from rating/opinion, context and features
perspective are concatenated to get the final item embedding
zv = σ([zvo ⊕ zvc ⊕ zvf ]W7 + b). (12)
286 A. Sattar and D. Bacciu

4.2 Decoder
The embedding of the user-item interaction is inputted to the bilinear decoder to
reconstruct the rating matrix (Â) between users and items. Here, we have treated
each rating as a separate class and addressed the problem as a classification task.
Our decoder produces probability distributions for all possible classes and it is
defined as
 T
eui Qr vj
Âij = p(Âij = r) : p(Âij = r) =  ui Qk vjT
(13)
r∈R k∈R e

where Qr are R trainable matrices of dimension D × D, D is the hidden dimen-


sion of user’s and item’s embedding obtained from encoder and R is the possible
rating levels.
We have tested our model under two settings and represented them with
different names: cGCM C and cGCM C F . cGCM C models the context effect
with an opinion matrix, while cGCM C F brings the context effect with opinion
as well as static node features (Eq. (11) and (12)).
Rating Prediction and Model Training. Our model is trained in end-to-end
fashion by minimizing the root mean square error between the actual (Aij ) and
reconstructed rating (Âij ), that is


 (Âi,j − Ai,j )2
RM SE =  (14)
i,j
n

5 Experimental Setup
5.1 Datasets
We have evaluated the performance of our proposed algorithms cGCM C and
cGCM C F on three real-world publicly available datasets with context informa-
tion for movie and travel recommendations.
LDOS-CoMoDa1 is a famous dataset for movie recommendation. It con-
sists of 268 users, 4381 movies, 2287 ratings (scale 1–5) and 12 context variables.
The context information includes Time, Location, Day-type, Decision, Weather,
Mood, Season, End Emotion, Interaction, Physical, Social. Besides this informa-
tion, LDOS-CoMoDa also has static features for users and movies.
DaDePaulMovie2 is a movie dataset collected by researchers of the DePaul
University, with ratings acquired by survey. Students have been asked to rate
movies subject to 3 context variables: location, time, and companion information.
The dataset contains 79 movies, 97 users and 2720 ratings (scale 1–5).
Travel-STS3 contains information about the places visited by the tourist.
It is composed of 249 places, 325 users who rated (scale: 1–5) the visited place,
and 14 context variables.
1
https://www.lucami.org/en/research/ldos-comoda-dataset/.
2
https://cran.r-project.org/web/packages/contextual/vignettes/.
3
https://github.com/irecsys/CARSKit/blob/master/context-aware data sets/.
Context-Aware Graph Convolutional Autoencoder 287

5.2 Evaluation Methodology

Predictive performance is evaluated in terms of mean absolute error (M AE) and


root mean square error (RM SE). Note that even a small improvement in M AE
or RM SE can have a significant effect on the top few recommendations quality.
The empirical analysis compares the performance of our approach against
several link prediction algorithms from literature:

– GraphRecuu uv [3] models user-user, user-item, and item-user relations for link
prediction in a bipartite graph. This algorithm leverages user’s rating on items
as well as the user’s social relationship for item recommendation.
– GCMC [2] models user’s opinion leveraging the rating matrix between users
and items for matrix completion task.
– GCMC+feat [2] extended GCM C by integrating static features inside the
user and item nodes for link prediction in a bipartite graph.
– SocialMF [5] uses a matrix factorization approach to recommendations
incorporating trust information on the social side.
– SVD++ [19] is a collaborative filtering algorithm based on matrix factor-
ization approach, which is based on implicit feedback of items without con-
sidering explicit rating opinion of users.
– PMF [11] is a matrix factorization approach for sparse datasets. It uses rating
information to capture collaborative signals between users and items.
– BiasedMF [7] is an improvement to traditional matrix factorization and it
incorporates bias for user, item, and global bias factors.

We have implemented our proposed algorithm (cGCM C and cGCM C F ) and


all baseline algorithms in Pytorch4 . For each dataset, we have used 60% data as
training set, 20% as hold-out validation set and 20% as hold-out test set (ran-
domly sampled). We have tested different configurations of cGCM C by varying
the embedding size for user opinion representation (do ∈ [600, 500, 400, 300]), for
user static feature representation (df ∈ [5, 10, 15, 20, 25]), and for user contextual
representation (dc1 ∈ [50, 100, 150, 200, 250] for GCN and dc2 ∈ [5, 10, 15, 20, 25]
for the dense). Batch size is chosen in [40, 80, 120, 150, 200]. The last layer of
the encoder is set to produce embeddings of size 75. Node dropout (Pdrop ) is
defined to be the probability to randomly drop all outgoing messages from spe-
cific nodes to train under denoising setup, where Pdrop ∈ [0.3, 0.4, 0.5, 0.6, 0.7].
The importance factor α ∈ [0.2, 0.3, 0.5, 0.7, 0.8] ∀ r ∈ R, is initially defined
with random values, considering the fact: α[r1 ] < α[r2 ] ⇐⇒ r1 < r2 . Any
set of initial values can be used if it satisfies the following fact: the context in
which the user gives more rating should have more weight. All neurons use ReLU
nonlinearity and Adam is employed as optimization algorithms. The model is
trained for 200 epochs. The best value for each hyper-parameter is shown in bold
in Table 1. For baseline algorithms, all parameters are initialized as mentioned
in the corresponding papers.

4
https://github.com/asmaAdil/cGCMC.
288 A. Sattar and D. Bacciu

Table 1. Test-set performance comparison with state-of-art algorithms. Best results


are marked in bold.

Algorithm LDOS-CoMoDa DePaul Travel-STS


MAE RMSE MAE RMSE MAE RMSE
cGCMC 0.938 ± 0.01 1.15 ± 0.01 1.04 ± 0.01 1.21 ± 0.01 0.96 ± 0.02 1.17 ± 0.02
cGCMCF 0.918 ± 0.01 1.127 ± 0.01 NA NA 0.932 ± 0.02 1.14 ± 0.02
GCM C 1.12 ± 0.01 1.33 ± 0.01 1.18 ± 0.00 1.42 ± 0.00 1.09 ± 0.02 1.32 ± 0.02
GCM C + f eat 1.001 ± 0.01 1.24 ± 0.01 NA NA 0.95 ± 0.01 1.23 ± 0.01
SocialM F 0.96 ± 0.01 1.28 ± 0.02 1.06 ± 0.01 1.29 ± 0.01 1.12 ± 0.01 1.46 ± 0.01
SV D + + 1.10 ± 0.01 1.45 ± 0.01 1.17 ± 0.02 1.40 ± 0.01 1.20 ± 0.01 1.36 ± 0.02
P MF 1.38 ± 0.00 1.75 ± 0.00 1.19 ± 0.01 1.44 ± 0.01 1.14 ± 0.00 1.49 ± 0.00
BiasedM F 1.46 ± 0.02 1.78 ± 0.02 1.20 ± 0.02 1.46 ± 0.02 1.13 ± 0.01 1.45 ± 0.01
GraphRecuu
uv 1.16 ± 0.02 1.32 ± 0.02 1.25 ± 0.03 1.45 ± 0.03 1.20 ± 0.02 1.36 ± 0.02

6 Performance Comparison

Table 1 compares the performance of our approach with other state-of-art algo-
rithms (NA stands for not applicable). The basic matrix factorization approaches
P M F and BiasedM F perform worse on almost every dataset. The reason is
that both approaches rely only on rating information on edges while mapping
interactions, without considering any side context and content information. The
SocialM F , SV D++ and GraphRecuu uv approaches capture such side informa-
tion in terms of social trust information or by using implicit feedback. While
they show improved performances compared to matrix factorization approaches,
they are clearly below our methods because of the advantageous effect of context
learning. When comparing our algorithm with GNN based approaches (GCM C
and GCM C +f eat), we can note a significant improvement in performance moti-
vated by the capability of providing context-aware recommendations. Overall,
our method outperforms all baselines in all datasets, providing sufficient ground-
ing to state the importance of being able to take into consideration dynamic
context to provide accurate recommendations.

Fig. 3. Effect of importance factor α on cGCM C in terms of RMSE.


Context-Aware Graph Convolutional Autoencoder 289

Impact of Context Modeling. Our proposed approach uses the α importance


factor to learn which context is more important for the target user and the item,
when predicting the link between them. We hence perform an ablation study to
demonstrate the effectiveness of α. As we already defined, context importance
varies from person to person. The Fig. 3 demonstrates the positive effect of
catering this importance factor in our model. This is clearly due to prioritizing
(giving more weights) those context attributes which are important for users and
items.

7 Conclusion

Throughout this paper, we highlighted the importance of modelling and process-


ing context information on the edges of bipartite graphs representing user-item
interactions in recommendation systems. To this end, we have proposed a novel
approach based on a graph convolutional auto encoder for matrix completion
tasks. Our graph encoder leverages context information by producing context-
aware user and item representations. Furthermore, the bilinear decoder predicts
the labelled edges between the user and item. We discussed mechanisms to sin-
gle out which context is more important for a particular user and item pair. An
empirical comparison between the proposed model and state-of-the-art works
showed a significant performance improvement of our context-aware approach.
In this work, a cumulative approach is used to represent the context infor-
mation between users and items. As a future work, we plan to explore the use of
separate embeddings for user and item contexts. We also plan to allow multiway
interactions between user and items, to capture more realistically the dynamic
behaviour of bipartite graphs.

References
1. Bacciu, D., Errica, F., Micheli, A., Podda, M.: A gentle introduction to deep
learning for graphs. Neural Netw. 129, 203–221 (2020). https://doi.org/10.1016/
j.neunet.2020.06.006
2. Berg, R.V.D., Kipf, T.N., Welling, M.: Graph convolutional matrix completion.
arXiv preprint arXiv:1706.02263 (2017)
3. Fan, W., et al.: Graph neural networks for social recommendation. In: The World
Wide Web Conference, pp. 417–426 (2019)
4. He, X., Chua, T.S.: Neural factorization machines for sparse predictive analytics.
In: Proceedings of the 40th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 355–364 (2017)
5. Jamali, M., Ester, M.: A matrix factorization technique with trust propagation for
recommendation in social networks. In: Proceedings of the Fourth ACM Conference
on Recommender Systems, pp. 135–142 (2010)
6. Karatzoglou, A., Amatriain, X., Baltrunas, L., Oliver, N.: Multiverse recommen-
dation: n-dimensional tensor factorization for context-aware collaborative filtering.
In: Proceedings of the Fourth ACM Conference on Recommender Systems, pp. 79–
86 (2010)
290 A. Sattar and D. Bacciu

7. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender
systems. Computer 42(8), 30–37 (2009)
8. Yin, C., Shi, L., Sun, R., Wang, J.: Improved collaborative filtering recommen-
dation algorithm based on differential privacy protection. J. Supercomput. 76(7),
5161–5174 (2020)
9. Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., Sun, G.: xDeepFM: combining
explicit and implicit feature interactions for recommender systems. In: Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 1754–1763 (2018)
10. Liu, H., Zhang, H., Hui, K., He, H.: Overview of context-aware recommender
system research. In: 3rd International Conference on Mechatronics, Robotics and
Automation. Atlantis Press (2015)
11. Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in
Neural Information Processing Systems, pp. 1257–1264 (2008)
12. Shi, Y., Larson, M., Hanjalic, A.: Collaborative filtering beyond the user-item
matrix: a survey of the state of the art and future challenges. ACM Comput. Surv.
(CSUR) 47(1), 1–45 (2014)
13. Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering
approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656
(2009)
14. Unger, M., Bar, A., Shapira, B., Rokach, L.: Towards latent context-aware recom-
mendation systems. Knowl.-Based Syst. 104, 165–178 (2016)
15. Wang, X., He, X., Wang, M., Feng, F., Chua, T.S.: Neural graph collaborative
filtering. In: Proceedings of the 42nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 165–174 (2019)
16. Wang, X., Jin, H., Zhang, A., He, X., Xu, T., Chua, T.S.: Disentangled graph
collaborative filtering. In: Proceedings of the 43rd International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, pp. 1001–1010
(2020)
17. Wu, S., Zhang, W., Sun, F., Cui, B.: Graph neural networks in recommender
systems: a survey. arXiv preprint arXiv:2011.02260 (2020)
18. Wu, Y., Liu, H., Yang, Y.: Graph convolutional matrix completion for bipartite
edge prediction. In: KDIR, pp. 49–58 (2018)
19. Xian, Z., Li, Q., Li, G., Li, L.: New collaborative filtering algorithms based on
SVD++ and differential privacy. Math. Probl. Eng. 2017, 1–14 (2017)
20. Xin, X., Chen, B., He, X., Wang, D., Ding, Y., Jose, J.: CFM: convolutional factor-
ization machines for context-aware recommendation. IJCAI 19, 3926–3932 (2019)
21. Yin, R., Li, K., Zhang, G., Lu, J.: A deeper graph neural network for recommender
systems. Knowl.-Based Syst. 185, 105020 (2019)
22. Zarzour, H., Maazouzi, F., Soltani, M., Chemam, C.: An improved collaborative
filtering recommendation algorithm for big data. In: Amine, A., Mouhoub, M.,
Ait Mohamed, O., Djebbar, B. (eds.) CIIA 2018. IAICT, vol. 522, pp. 660–668.
Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89743-1 56
23. Zhang, J., Shi, X., Zhao, S., King, I.: Star-GCN: stacked and reconstructed graph
convolutional networks for recommender systems. In: The 28th International Joint
Conference on Artificial Intelligence, pp. 4264–4270 (2019)
24. Zhang, M., Chen, Y.: Inductive matrix completion based on graph neural networks.
In: International Conference on Learning Representations (2020)
25. Zheng, L., Lu, C.T., Jiang, F., Zhang, J., Yu, P.S.: Spectral collaborative filtering.
In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 311–
319 (2018)
Development and Implementation of a Neural
Network-Based Abnormal State Prediction
System for a Piston Pump

Mauricio Andrés Gómez Zuluaga1(B) , Ahmad Ordikhani2 , Christoph Bauer2 ,


and Peter Glösekötter3
1 University of Applied Science Münster, Stegerwaldstr. 39, 48565 Steinfurt, Germany
[email protected]
2 Timmer GmbH, Dieselstraße 37, 48485 Neuenkirchen, Germany
3 University of Applied Science Münster, Stegerwaldstr. 39, 48565 Steinfurt, Germany

[email protected]

Abstract. Piston pumps play a key role in factory automation and their avail-
ability is very critical for the smooth running of production processes. Modern
installations, such as production plants and machines, are becoming increasingly
complex. Therefore, the probability of a complete system failure due to a single
critical component also increases. Maintenance processes with intelligent devices
are therefore very important to achieve maximum economic efficiency and safety.
Periodic or continuous monitoring of system components provides key informa-
tion about the current physical state of the system, enabling early detection of
emerging failures. Knowledge of future failures makes it possible to move from
the concept of preventive maintenance to intelligent predictive maintenance. In
this way, consequential damage and complete system failure can be avoided, max-
imizing system availability and safety. This paper reflects the development and
implementation of a neural network system for abnormal state prediction of piston
pumps. After a short introduction into piston pumps and their potential abnormal
states, statistical and periodical analysis are presented. Then the design and imple-
mentation of suitable neural networks are discussed. Finally, a conclusion is drawn
and the observed accuracies as well as potential next steps are discussed.

Keywords: Artificial intelligence · Internet of things · Computer science ·


Piston pump

1 Introduction
This paper shows the implementation of an abnormal state detection system based on
neural networks for a piston pump. For this implementation, several analyses of the
available information were performed, so that this information was expressed in fea-
tures that would allow the development and conception of a model that could identify
the possible abnormal states of the piston pump. With the help of a small smart box,
periodical data of the operating pressure and the angular speed under which the piston

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 291–302, 2021.
https://doi.org/10.1007/978-3-030-85030-2_24
292 M. A. Gómez Zuluaga et al.

pump operates can be obtained at a frequency of 100 Hz, which somewhat limits the
information available. Not only does the limited information suppose a relative problem
for the implementation of the system, but the fact that there are only two sources of
information (pressure and angular speed) increases the difficulty even more. These two
limitations make it necessary to analyze the available information with the greatest depth
to create a correct characterization of the possible states that the piston pump may have.

2 Piston Pump
A variety of different types of piston pumps are available in the markets which fit to
different applications. Piston pumps essentially consist of two units, the drive unit, and
the pump unit. The drive unit consists of a 3-phase AC motor which its speed is controlled
by a variable-frequency drive (also known as adjustable-frequency drive). The control
takes place according to the V/f characteristic, so that the torque is kept constant. Behind
the motor there is a gearbox with fixed reduction, which reduces the speed and increases
the torque. The second unit is the pump unit. This unit has two pistons, which oscillate
alternately up and down, thus promoting the medium to be pumped. The pistons are
pushed up by the power of the motor and are guided back down by springs. Valves above
the piston chamber are open when the fluid flows in the direction of flow (piston moves
upwards) and are close when it flows back (piston moves downwards). Additional valves
in the intake area work exactly the opposite way and only let the medium pass through
when it flows into the pump. Figure 1 shows an exemplarily piston pump from Timmer
GmbH [2].
The piston pumps were developed especially for larger paint supply systems. In
addition to a small, compact design that allows the pumps to be integrated without
much effort, especially in retrofit projects, the pumps of the series also offer many other
advantages.

Fig. 1. Piston pump [5]


Development and Implementation of a Neural Network-Based Abnormal State 293

This piston pump presents a pressure profile with a stable and easily identifiable
behavior, which is proportional to the frequency and nominal pressure with which the
pump operates, see Fig. 2. This pressure profile shows a clear periodical behavior, as
well as a noise associated to the moment that there is a piston change in each cycle and to
the precision of the analog-digital converter integrated to the intelligent system, which
performs the pump measurements. In the diagram shown in the figure it can be identified
that each of the periods correspond to each of the pistons of the pump.

Fig. 2. Pressure profile for a piston pump under normal operating conditions.

2.1 Abnormal States of the Piston Pump

In the presence of anomalous operating conditions, either due to configuration errors


or external errors, the pressure profile presented in Fig. 2 shows noticeable changes in
its behavior. This piston pump shows most commonly four abnormal operating states,
which are shown below (Figs. 3, 4, 5 and 6).
Like the pump pressure profile under normal operating parameters, these four cases
have the same basic characteristics, which allows the use of different mathematical tools
to characterize their behavior.
294 M. A. Gómez Zuluaga et al.

Fig. 3. Wrong direction of rotation of the pistons.

Fig. 4. One piston jam.


Development and Implementation of a Neural Network-Based Abnormal State 295

Fig. 5. Air volume in system.

Fig. 6. Defective seal.


296 M. A. Gómez Zuluaga et al.

3 Features
3.1 Statistical Analysis

From the collected patterns, several analyses were carried out, which allowed to find
behaviors and particularities of each of the operating states that the piston pump may
present. Among the tools used were statistical, stochastic, mathematical and frequency
analysis tools.
The Fig. 7 shows a box plot with different samples of the patterns collected from
each of the states of the piston pump. The diagram shows three samples for each of the
operating states at a nominal pressure of 15 bar and different angular velocities, just to
illustrate an example. It is possible to observe trends between the samples for each state,
as well as possible problems for their characterization.

Fig. 7. Boxplot of the States.

• Standard operation → 1
• Wrong direction of rotation → 2
• One piston jam → 3
• Air volume in system → 4
• Defective seal → 5

With this simple analysis it is possible to find trends among the patterns of each
state, allowing to obtain the first set of features to determine a model that fits for the
classification of the patterns.
Development and Implementation of a Neural Network-Based Abnormal State 297

• Interquartile range (IQR): is the difference between the upper and lower quartile.
IQR = Q3 − Q1 = qn (0.75) − qn (0.25) (1)
• IQR proportion is the ratio between the upper, lower quartiles and the median.
Q3 − median qn (0.75) − median
proportionIQR = = (2)
median − Q1 median − qn (0.25)
• Outliers: are values outside 1.5 times the upper and lower quartiles.

By using these characteristics with the characteristics obtained from a basic statistical
analysis (mean, standard deviation, etc.) it is possible to construct a simple neural net-
work that catalogs each of the states with a good degree of accuracy. But due to the
unstable behavior of two of the states, when the pump operates at low pressures, these
characteristics are insufficient to describe the model, which requires extending this set
of characteristics using periodic and frequency analysis.

Fig. 8. Abnormal states for a pressure of 15 bar and 2 bar.

The Fig. 8 shows different tests taken from the possible operating states that the
piston pump may present. These tests consist of two groups with a constant pressure of
15 bar and 2 bar for angular speed values that vary between 10 rpm and 60 rpm. Each
of the diagrams presented shows the distribution of the features previously presented in
terms of the coefficient of variation.

3.2 Periodic Analysis


Performing a periodic analysis of each of the whole periods of the pressure profile, it is
possible to increase the number of characteristics. Ideally each of the periods should not
298 M. A. Gómez Zuluaga et al.

show variations in each of the characteristics already presented (mean value, standard
deviation, etc.), but because there are no ideal operating conditions, there are appreciable
variations, see Fig. 2. The new features obtained through periodical analysis are:

• Period factor: is the ratio between the period calculated with an algorithm developed
by an autocorrelation of the pressure profile to determine in an analytical way the
period of the profile, and the period obtained by the sensor integrated in the system.
Tsensor
Tfactor = (3)
Tcalculated
• Standard deviation of the basic statistical features: each of the basic features is deter-
mined for each of the whole periods of the pressure profile and the standard deviation
of each of these features is calculated.
• Mean value factor: assuming that for each of the periods the mean value is constant,
it is divided by the calculated period.

μ
μτ = (4)
τ

Fig. 9. Periodic analysis for abnormal states.

The Fig. 9 shows a new distribution of the tests according to the coefficient of
variation. In this case, a better grouping of data can be observed in cases of air volume
in the system and defective seal. It is necessary to specify that the distributions shown
have only been made by reasons of the coefficient of variation, but the neural network
Development and Implementation of a Neural Network-Based Abnormal State 299

takes all possible distributions in consideration for the correct grouping of the states, so
it is necessary to obtain linearly independent characteristics for the conceptualization of
the model.

4 Design and Implementation of the Neural Network


During the implementation phase of the neural network, different neural network topolo-
gies were tested as well as different activation functions, but in the end a recurrent network
with a sigmoid activation function was more suitable for the problem, since the aim was
to give a relative probability at the end, to determine the type of state of the pump.
The Fig. 10 shows the accuracy of the first model implemented with respect to a
different number of hidden layers for the recurrent model structure. The result shown
by this model is an accuracy between 90% and 95% for relatively small learn rates and
for a small number of epochs for the error function to converge (the error function used
to adjust the weights of this model was the least squares error function). However, since
the model is stable for such learn rates, it is quite inflexible when it comes to fitting or
training, since small learn rates imply a small effect of training patterns on the model,
which is a problem in the future. One solution to this problem is to increase the number
of epochs for the model training process in proportion to the learn rate, but this does not
imply that the model will be better, since a high or excessive number of epochs means
more computational power and large numbers of training patterns.

Fig. 10. Accuracy for the model for different number of hidden layers.

As already mentioned, this model meets the expected requirements by being able to
determine the operating state with acceptable accuracy by means of a pressure profile.
However, in the case of two of the five possible operating states of the piston pump,
the model loses accuracy since these states are hardly recognizable at low pressures.
The first model proposed was the product of a continuous and periodic analysis of the
states, which grew as the patterns were collected, so it was not sized in the first instance
300 M. A. Gómez Zuluaga et al.

how complex these would be. From this model, a second model was implemented using
the open-source Python library PyTorch. This second model was largely equivalent to
the first model with some differences in its structure, the most relevant being the cross-
entropy function to properly adjust the weights and the topology used for the neural
network structure.
The second model shows higher stability and accuracy than the first one, even for
an extremely simple configuration, in which only 5 neurons are required to detect the
5 possible operating states. By progressively increasing the number of neurons it is
possible to achieve remarkable improvements of the model, due to the changes made
with respect to the first model and thanks to the PyTorch library that allows much faster
computations, which facilitates the search for the appropriate parameters for the model.
It is necessary to clarify that increasing the number of neurons disproportionately in the
model is not recommended, since the higher the number of neurons, the more difficult it
becomes for the model to find a stable state of the weights, as can be seen in the figure
below (Fig. 11).

Fig. 11. Accuracy for the second model for different number of neurons.

5 Conclusion

The results presented in this paper are the general description of the implementation and
conceptualization of the model for determining anomalous operating states of a piston
pump. The first model largely fulfilled the requirements, having an accuracy between
90% and 95% allowed to further analyze the characteristics obtained from the analyzed
pressure profiles and to expand the number of these using other tools for the analysis of
the information. Also, this first model allowed the analysis of different topologies and
connections of basic neural networks, as well as different functions and algorithms for
the adjustment of the network weights (such as the least squares function). In the first
tests performed with the recurrent model, the following aspects were identified:
Development and Implementation of a Neural Network-Based Abnormal State 301

• Complex topologies do not necessarily improve the behavior of the model, for exam-
ple a topology, which has different numbers of neurons in each hidden layer or a
convolutional network, did not show an improvement of the model.
• Searching empirically for features to feed the network is not a good idea, since based
on the results obtained, it was observed that several of these features showed a strong
linear dependence with other features, which did not allow to improve the model, but
only caused longer computation times.
• Designing a neural network with a very small learn rate is a great disadvantage, since
decreasing it increases the number of patterns needed to train the system and at the
same time increases the computation time (For the first model, the time required to
train the neural network ranged from 5 to 40 min approximately).

The implementation of the second model using tools already available in the PyTorch
library, allowed to reduce the complexity of the first model and to concentrate the efforts
not on improving the neural network structure or the number of features fed into the
network, but on using in a more efficient way the algorithms that allow the adjustment
of the weights to minimize the error function.
The results obtained from the implementation of this system are the basis for future
implementations on other similar systems, such as diaphragm pumps. Not only new
implementations of the model described in this paper are foreseen in the future, but also
improvements of this model using more complex tools that help to improve the behavior
of the model, as well as its dynamism and learning capacity in front of new unknown
states that a piston pump may present.

References
1. Russell, S., Norvig, P.: Artificial Intelligence A Modern Approach. Pearson, London (2020)
2. Ordikhani Seyedlar, A.: Development and implementation of anintelligent Real-Time data
analyzer, event detector and a web-based IoT monitoring system for piston pumps. Steinfurt
(2018)
3. Maladkar, K.: 6 Types of Artificial Neural Networks Currently Being Used in Machine Learn-
ing. Analytic Indian Magazine, 15 Jan 2018. https://analyticsindiamag.com/6-types-of-artifi
cial-neural-networks-currently-being-used-in-todays-technology/. Accessed 9 Oct 2020
4. PyTorch: https://pytorch.org/docs/stable/index.html. Accessed 5 Oct 2020
5. Timmer GmbH, Flüssigkeitspumpen (2018). https://timmer-pumps.com/kolbenpumpen-ele
ktrisch/
6. Maladkar, K.: 6 Types of Artificial Neural Networks Currently Being Used in Machine Learn-
ing, Analytic Indian Magazine, 15 Jan 2018. https://analyticsindiamag.com/6-types-of-artifi
cial-neural-networks-currently-being-used-in-todays-technology/. Accessed 8 Oct 2020
7. missinglink.ai: 7 Types of Neural Network Activation Functions: How to Choose? https://mis
singlink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-
right/. Accessed 26 Oct 2020
8. CS231n, CS231n Convolutional Neural Networks for Visual Recognition (2020). https://cs2
31n.github.io/neural-networks-1/. Accessed 27 Oct 2020
9. missinglink.ai: Neural Network Bias: Bias Neuron, Overfitting and Underfitting
(2020). https://missinglink.ai/guides/neural-network-concepts/neural-network-bias-bias-neu
ron-overfitting-underfitting/. Accessed 27 Oct 2020
302 M. A. Gómez Zuluaga et al.

10. Jain, A.K., Mao, J.: Artificial neural networks: a tutorial. Computer. 29, 31–44 (1996)
11. Zhang, A., Lipton, Z.C., Li, M., Smola, A.J.: Dive into Deep Learning (2020). https://d2l.ai/
chapter_multilayer-perceptrons/backprop.html. Accessed 28 Oct 2020
12. Brownlee, J.: Machine Learning Mastery, A Gentle Introduction to Cross-Entropy for Machine
Learning (2020). https://machinelearningmastery.com/cross-entropy-for-machine-learning/.
Accessed 07 Dec 2020
13. Kingma, D.P., Ba, J.: arXiv.org (Cornell University). https://arxiv.org/abs/1412.6980.
Accessed 08 Dec 2020
14. Brownlee, J.: Machine Learning Mastery, How to Calculate the KL Divergence for Machine
Learning (2020). https://machinelearningmastery.com/divergence-between-probability-distri
butions/. Accessed 08 Dec 2020
Iterative Adaptation to Quantization
Noise

Dmitry Chudakov1,2(B) , Sergey Alyamkin1,2(B) , Alexander Goncharenko1,2(B) ,


and Andrey Denisov1,2
1
Novosibirsk State University, Novosibirsk, Russia
2
Expasoft LLC, Novosibirsk, Russia
{d.chudakov,a.goncharenko,a.denisov}@expasoft.tech,
[email protected]
https://expasoft.com/

Abstract. Quantization allows accelerating neural networks signifi-


cantly, especially for mobile processors. Existing quantization methods
require either training neural network from scratch or gives significant
accuracy drop for the quantized model. Low bits quantization (e.g., 4-
or 6-bit) task is a much more resource consumptive problem in compar-
ison with 8-bit quantization, it requires a significant amount of labeled
training data. We propose a new low-bit quantization method for mobile
neural network architectures that doesn’t require training from scratch
and a big amount of train labeled data and allows to avoid significant
accuracy drop.

Keywords: Neural networks · Quantization · Distillation · Machine


learning

1 Introduction
Neural networks are widely used in modern audio-, video- software solutions
including image classification, object detection, and segmentation. Neural net-
works are extremely demanding on computing resources not only at the stage
of training but also at the stage of inferencing. In order to use neural networks
in large-scale software solutions, it is necessary to apply different approaches for
network compression and reducing latency. Post-training quantization is one of
the major approaches for neural network speed and size. The main advantages
of quantization over other methods of neural network acceleration are:
1. Low-bit integer calculations are much faster than analogous calculations in
single-precision floating-point numbers.
2. Quantization significantly reduces the amount of memory needed to store
weights of the neural network model. In the case of 8-bit quantization the
size of the neural network is reduced by 4 times in comparison with 32-bit
3. The low bit precision digit of the neural network weights allows putting more
data in the CPU cache, which allows reducing the frequency of access to the
main memory. That leads to a decrease in power consumption and inference
acceleration.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 303–310, 2021.
https://doi.org/10.1007/978-3-030-85030-2_25
304 D. Chudakov et al.

4. The implementation of floating-point calculations at the hardware level is


more difficult than the implementation of integer calculations, and therefore
low-cost microcontrollers often support only integer calculations.

2 Related Work
2.1 Quantization Scheme
First of all, let us consider the quantization scheme proposed in [1]. The quanti-
zation scheme describes the process of translation of the original neural network
into a quantized one. This scheme uses asymmetric linear quantization. That
is, the conversion of quantized integer values of weights and activations into
floating-point numbers is affine:

r = S(q − Z) (1)

Where, r is the original floating-point number, S and Z are the scaling and
shift coefficients, q are quantized integers. S is represented as a real number,
and Z is represented as an integer, this allows the quantized zero to be exactly
matched to the real one. In modern neural network computation methods, the
convolution procedure is represented as matrix multiplication, using the image
representation as a vector. Thus in quantized form, the convolution operation
can be represented as:
  N    
(i,k) (i,j) (j,k)
S3 q3 − Z3 = S1 q1 − Z1 S2 q2 − Z2 (2)
j=1

Where, i, j, k are numbers of rows and columns in the matrix representation of


the linear convolution operator. Zl , Sl , ql are shift factor, scaling factor and
quantized weight value. The lower index l = 1, 2, 3 corresponds to the first
multiplier, the second multiplier, and the result of the convolution operation.
By opening the brackets, it is easy to obtain:
N 
  
(i,k) (i,j) (j,k)
q3 = Z3 + M q1 − Z1 q2 − Z2 (3)
j=1

S1 S2
M := (4)
S3
M = 2−n M0 (5)
Where M is the re-quantization coefficient. Experimentally the authors of the
scheme found that the coefficient is in the range [0, 1]. It means that we can
represent it as an integer 32-bit number and fixed shift, this type of calculation
can be effectively implemented on the processor. After the multiplication of 8-
bit numbers, calculation results are added to the 32-bit container, so subsequent
multiplications by re-quantization coefficients and shift additions are done with
32-bit integers.
Iterative Adaptation to Quantization Noise 305

2.2 Calculation of Quantization Thresholds

The described quantization scheme does not fix the way of obtaining scaling and
shift coefficients. The correct selection of those coefficients is the key to preserve
the initial neural network quality. The naive minimax approach suggests taking
the maximum and minimum values of weights and activations as thresholds.

clamp(r; a, b) := min(max(r, a), b) (6)

b−a
clamp(a, b, n) := (7)
n−1
 
c(r; a, b) − a
q(r; a, b, n) := (8)
s(a, b, n)
Where, n is number of intervals (2b where b bit capacity) of quantization,
a = min(W ) is left threshold of quantization, b = max(W ) right threshold
of quantization. W weights the neural network, r is the initial real number. The
naive approach does not take into account the probabilistic nature of the weights
and possible outliers, which may cause a significant drop in accuracy. In order
to reduce the effect of outliers in a single filter of the convolutional layer on the
entire layer, it was proposed [10] to use vector quantization for the weights.
Nvidia’s TenortRT library [12] proposes a time-consuming iterative method
to find suitable quantization thresholds based on Kullback-Leibler divergence
[7], which requires collecting activation statistics for all layers. Intel - Artificial
Intelligence Product Group [3] proposed an analytical method for selecting quan-
tization thresholds by assuming that the weights have a Laplace distribution and
minimizing the discretization error of the weights (9). Here X is the original dis-
tribution of weights, and Q(X) is the discretized distribution of weights, α is
the quantization threshold, Δ is the interval size, q is the discretized value of
weights, and M is the number of intervals:
 −α
QN1 = f (x) · (x + α)2 dx (9)
−∞

M
2 −1  −α+(i+1)Δ
2
QN2 = f (x) · (x − qi ) dx (10)
i=0 −α+iΔ
 ∞
QN3 = f (x) · (x − α)2 dx (11)
α

E (X − Q(X))2 = QN1 + QN2 + QN3 (12)
There was proposed a method [2] to adjust quantization thresholds during
training procedure with gradient descent by approximating the gradient through
undifferentiated functions with a constant equal to 1.
306 D. Chudakov et al.

2.3 Quantization Aware Training


In the scheme proposed in [1] it is also proposed a way of training a neural net-
work taking into account quantization, using differentiable real approximation
of the quantized network. Weights and activations are stored in single-precision
floating-point numbers, and all calculations are performed in single-precision
floating-point numbers. Special nodes are placed before each operation to sim-
ulate the quantized model by discretizing the weights. These nodes perform
quantization and dequantization of received input, thus obtaining discretized
real numbers in the same value range and approximating noise from quantiza-
tion effect. The gradients for rounding operations are approximated by a con-
stant equal to one. The advantage of this method is high accuracy as compared
to quantization using the minimax approach. The disadvantage of this approach
is that it leads to an increase in learning time.

3 Method Description
The proposed neural network quantization method improves the FAT [2] method.
As a result of the threshold adjusting training process, the accuracy increases
because of quantization noise reduction.

QN = E[(X − Q(X))2 ] (13)

Here we will call the quantization noise QN which is the expectation of the
standard deviation of the original distribution X (weights or activations) from
the discretized value Q(X). Since we have a finite and relatively small number
of intervals, it is not possible to truncate this noise to zero. Choosing a small
number of intervals results in a significant increase in noise. (e.g. 6-bit or 4-bit
quantization). Also, relatively small noise can strongly influence quantization
accuracy in the case of small neural networks (e.g. EfficientNet [5]). The proposed
algorithm, in addition to reducing quantization noise, adapts the neural network
to deal with the resulting noise. This is achieved through iterative adaptation to
the quantization noise. The algorithm consists of four stages:

1. The algorithm begins to iteratively quantize the network from the first layer
to the last
2. For the selected layer the initial quantization thresholds are calculated using
the minimax approach.
3. Using FAT [2] method, thresholds are optimized for this layer.
4. The quantized part of the neural network is frozen and the non-quantized
part of the neural network is trained using distillation [11] to minimize the
error for a given problem. The distillation approach is to minimize the devi-
ation of the output of the “student” neural network from the output of the
“teacher” neural network. The advantage of this approach is that there is
no requirement for data to be labeled. The distillation process optimizes the
Euclidean distance between the outputs of the quantized and original neural
networks.
Iterative Adaptation to Quantization Noise 307

4 Experiments and Results


To conduct experiments, we have developed our framework that allows us to
accurately simulate a quantized neural network. In this way, we could measure
the precision of a quantized neural network. The disadvantage of this approach
is that we cannot measure the real gain in inference speed and, for example,
power consumption. The issue of real acceleration of neural networks during
quantization is studied in detail in the article [1].

4.1 Measurement on a Imagenet Task

A number of experiments were conducted for different neural network architec-


tures trained for classification on the Imagenet dataset ImageNet [4]. Algorithms
were compared: Naive MinMax, Fast Adjustable Thresholds (FAT), Iterative
Quantization Noise Adaptation (IQNA).

Table 1. Results of quantization by different approaches. The obtained accuracy draw-


down is given in brackets. Quantization parameters are bit depth and type (vector or
scalar)

Architecture Quantization parameters MinMax, % FAT, % IQNA, %


ResNet50 8, Scalar 73.9 (−1.5) 73.8 (−1.6) 72.0 (−3.4)
ResNet50 4, Vector 1.0 (−74.4) 2.0 (−73.4) 66.4 (−9.0)
MobileNetV2 8, Vector 71.4 (−0.5) 71.5 (−0.4) 71.0 (−0.9)
EfficientNet-B0 8, Scalar 4.6 (−72.7) 74.8 (−2.2) 75.9 (−1.1)
EfficientNet-B0 8, Vector 72.0 (−5.0) 76.5 (−0.5) 76.0 (−1.0)
EfficientNet-B0 6, Vector 0.1 (−76.9) 1.1 (−75.9) 75.6 (−1.4)
EfficientNet-B0 4, Vector 0.1 (−76.9) 0.3 (−76.7) 62.3 (−14.7)
MobileNetV2 4, Vector 0.2 (−71.7) 0.3 (−71.6) 62.3 (−9.6)

From the acquired results it is clear (Table 1) that the minimax method leads
to a significant drop in accuracy, since it completely ignores the probabilistic
nature of the weights and activations of the neural network, for scalar quantiza-
tion resulting outliers can reduce the accuracy to zero. The FAT method avoids
significant accuracy degradation for 8-bit quantization, but for quantization of
6 or 4 bits, the network accuracy drops to zero. For the iterative adaptation
method, an Adam [6] optimizer with a learning rate of 5 ∗ 10−5 and parame-
ters β1 = 0.9 and β2 = 0.999 was used in the pre-training phase of the regular
network. In this case, the choice of parameters for training is not so important,
they should be selected based on the parameters used for training a regular
non-quantized neural network on a given task. We took the parameters carefully
selected in the article [2]. At each iteration, a random part of the training data
308 D. Chudakov et al.

of 100000 objects was taken and training took place during one epoch. The run-
ning time of the algorithm takes about 12 h for MobleNetV2 on Nvidia GeForce
1080ti.
The method showed to be effective in cases of quantization of networks with
a low number of bits when it is not possible to reduce the noise from quantization
sufficiently. Thus it was possible to quantize a MobileNetV2 [8] and EfficientNet-
B0 [5] network at 4 bits, avoiding a significant drop in accuracy (< 20%). In
this case, for neural networks with a large number of parameters and at 8-bit
quantization the method of iterative adaptation can give results worse than the
basic FAT method, as seen in the example of ResNet-50 [9].

4.2 Quantization-Induced Noise Measurements

From the obtained measurements it is clear that the method of iterative adapta-
tion to noise quantization significantly reduces noise in comparison to the naive
minimax approach and FAT method (Table 2), in some cases by an order of
magnitude. Figure 1 demonstrates how the method of iterative adaptation to
noise quantization, shrinks the peaks from noise into a single central peak at the
zero points. Figure 2 shows how the method of iterative adaptation to quantiza-
tion noise allows us to use a small number of intervals more efficiently and to
approximate the original activation distribution more accurately.

Table 2. Quantization-induced noise was also measured using the MobileNetV2 neural
network with vector 4-bit quantization as an example.

Convolutional layer number MinMax FAT IQNA


3 15.3 27.74 3.76
4 1.5 2.33 0.268
5 0.56 0.7 0.07
6 30.93 20.9 2.29

Fig. 1. Quantization-induced noise values at the output of the convolutional layers after
applying MinMax, FAT, and IQNA methods for the MobileNetV2 neural network after
4-bit vector quantization.
Iterative Adaptation to Quantization Noise 309

Fig. 2. Histograms of activation after 9-th convolutional layer in MobileNetV2 after


applying different quantization methods. (from left to right, from top to bottom: Orig-
inal non-quantized, MinMax, FAT, IQNA

5 Conclusion

The minimax approach can give acceptable accuracy for modern architectures
only by applying vector quantization. The FAT method allows scalar quanti-
zation of modern efficient neural network architectures. Nevertheless for 8-bit
scalar quantized EfficientNet-B0 the accuracy degradation is significant, while
6-bit quantization leads to accuracy equal to 0. The proposed method of iterative
adaptation to quantization noise proved to be promising for low-digit quantiza-
tion problems and allowed to increase the accuracy in cases where the accuracy
degradation was significant. Moreover, the method itself is not bound to any
particular quantization scheme and can be generalized to other schemes and
use another quantization algorithm as a base algorithm, for example, low-size
floating-point numbers quantization.
310 D. Chudakov et al.

References
1. Jacob, B., et al.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2704–2713 (2018)
2. Goncharenko, A., Denisov, A., Alyamkin, S., Terentev, E.: Fast adjustable thresh-
old for uniform neural network quantization. Int. J. Comput. Inf. Eng. 13(9),
495–499 (2019)
3. Banner, R., Nahshan, Y., Hoffer, E., Soudry, D.: Post-training 4-bit quantization
of convolution networks for rapid-deployment. arXiv preprint arXiv:1810.05723
(2018)
4. Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE, June
2009
5. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural net-
works. In: International Conference on Machine Learning, pp. 6105–6114. PMLR,
May 2019
6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
7. Kullback, S.: Information Theory and Statistics. John Riley and sons. Inc., New
York (1959)
8. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2:
inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
10. Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference:
a whitepaper. arXiv preprint arXiv:1806.08342 (2018)
11. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531 (2015)
12. Migacz, S.: GPU Technology Conference, pp. 10–30 (2017)
A BERT Based Approach for Arabic POS
Tagging

Rakia Saidi1(B) , Fethi Jarray1,2 , and Mahmud Mansour3


1
LIMTIC Laboratory, UTM University, Tunis, Tunisia
[email protected], [email protected]
2
Higher Institute of Computer Science of Medenine, Tunis, Tunisia
3
University of Tripoli, Tripoli, Libya

Abstract. Large pre-trained language models, such as BERT, have


recently achieved state-of-the-art performance in different natural lan-
guage processing tasks. However, BERT based models in Arabic lan-
guage are less abundant than in other languages. This paper aims to
design a grammatical tagging system for texts in Arabic language using
BERT. The main goal is to label an input sentence with the most likely
sequence of tags at the output. We also build a large corpus by combin-
ing the available corpora such as the Arabic WordNet and the Quranic
Arabic Corpus. The accuracy of the developed system reached 91.69%.
Our source code and corpus are available at GitHub upon request.

Keywords: Part of speech · Tagger · Arabic text · BERT ·


Transformer

1 Introduction

Part-of-speech (POS) tagging is a preprocessing phase for higher-order Natural


Language Processing (NLP) tasks that involves automatically assigning syntac-
tic category labels to text tokens. POS is still a hot research topic since it is a core
component for many NLP tasks such as named entity Recognition and speech
recognition. Indeed, the information that a grammatical tag conveys is useful
to determine the syntax or even the semantics of the treated word. Practically,
a POS tagger is a program that assigns each word of a text, a tag or a cate-
gory from a previously predefined set. The complexity of POS tagging strongly
depends on the characteristics of the language treated (its morphology). For
European languages like English and French, the grammatical tagging task is
relatively well mastered. Several taggers have been designed for these languages,
achieving good tagging rates that vary from 95% to 98%. However, the accuracy
of POS is around 90% for Arabic language with a large variability between tag-
gers since each tagger uses a different corpus and different tagset. For example,
the Arabic Stanford Part-Of-Speech Tagger is trained on Arabic Penn Treebank
(ATB) corpus and achieves an accuracy of 96.26% on the test set. However, this

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 311–321, 2021.
https://doi.org/10.1007/978-3-030-85030-2_26
312 R. Saidi et al.

accuracy is not maintained over other test corpora [34]. The main challenge of
POS for Arabic language is not only the derivational and inflectional morphol-
ogy of Arabic but also the lack of a standard research methodology where each
new work improves the previous ones. Unfortunately, each research group builds
his own corpus and his own method and generally finds good accuracy but the
model don’t well generalize to other corpora.
In this paper, we aim to fix the issue of normalization of Arabic POS by
combining multiple corpora into a single one and train a BERT model upon it.
We have taken advantage of BERT model for its contextual representation of
natural language and its ability to model POS.
Our contribution is twofold. First we create a large Arabic POS corpus by
merging the publicly available corpora. Second, we build a BERT based model
for Arabic POS. We evaluate the performance of this approach by: i) globally
assessing its accuracy on a test set and ii) locally measuring its accuracy on
each sub corpus. The experimental result show that the global accuracy is about
91.69%.
The remainder of this paper is organized as follows. Section 2 presents the
related work to Arabic POS. Section 3 describes our POS tagging approach. The
experimentation is presented in Sect. 4. Finally, Sect. 5 concludes the paper and
outlines future research directions.

2 Related Work
Automatic POS tagging is a complex task for the Arabic language because of
its agglutinative nature, its inflectional richness and the absence of vowels in the
majority of written Arabic texts. In general, Arabic POS Tagging approaches
can be classified into five categories rule-based [2–4], statistics [5–11], traditional
machine learning [12–16], metaheuristic [17] and deep learning [18,19]. Concern-
ing the later approach, Abu-Malloh et al. [18] developed an Arabic POS-tagger
using artificial neural network and achieved an accuracy of 89.04% on testing
dataset. Muaidi [19] proposed another neural for Arabic PO Stagging Arabic sen-
tences and trained it with Levenberg-Marquardt learning algorithm and achieved
an accuracy of 90.21% on the test part. As far as the authors know there are no
published works about the use of BERT based models for Arabic POS Tagging.
Table 1 summaries the accuracy and the used corpus for each method. We
note that the most of approaches reach a high accuracy and claim to beat
other previous approaches. However, it is not possible to compare these methods
because neither the datasets nor the codes are available. Moreover it’s not worth
implementing a method which is not detailed enough. This challenge motivates
us to establish a global corpus for Arabic POS Tagger evaluation.
A BERT Based Approach for Arabic POS Tagging 313

Table 1. Summary of the state of the art approaches; The accuracy is measured on
the data set; UNS stands for Unspecified; ML stands for machine learning; NA stands
for not available

Ref Category Tagset Dataset Accuracy Corpus


Debili et al. Rule-based 3 tagsets (264, UNS 97,51% NA
1584 and
1730)
Zemirli et al. Rule-based 35 tags test: 5 563 97.99% NA
Btouch et al. Rule-based 3 tags training: 6491, UNS % NA
test: 793
Kubler et al. Statistics 993 500 000 (ATB) 94,37% Proprietary
Biadsy et al. Statistics 52 UNS 98,22% NA
Gahbiche-Braham Statistics UNS UNS UNS % NA
Suryawati et al. Statistics UNS UNS 96.28% NA
Kadim et al. Statistics 118 UNS 96.28% NA
Kadim et al. Statistics 3 Nemlar UNS % Proprietary
Diab et al. [12] ML 24 ATB 95.49% Proprietary
Aldamarki and Diab [13] ML ERTS2-Tagset LDC(ATB) 97.15% Proprietary
Zribi et al. [14] ML 69 marcotags; Training: 18000, 97.54% for NA
223 microtags; test: 1200 marcotags,
465 composed 98.35% for
tags mircotags
Tlili-Guiassa [15] ML UNS UNS 85% NA
Alashqar [16] ML 9 and 33 77430(Quranic) ≈ 80% Available
Ali and Jarray [17] Metaheuristic UNS UNS ≈ 91% NA
Abu-Malloh et al. [18] Deep learning 18 20,620 89.04% NA
Muaidi [19] Deep learning 189 24,810 90.21% NA

Although, many datasets are used for training, only two are freely available:
Quranic and Arabic Wordnet.

3 Proposed Architecture

In this section, we describe the techniques we have used for Arabic POS tagging
as illustrated in Fig. 1. This architecture contains two blocks: Bert, dense and
classification layers.
314 R. Saidi et al.

Fig. 1. BERT based proposed architecture for Arabic POS.

3.1 Tansformer and BERT Models

Transformer [31] based architecture have outperformed Recurrent neural net-


works (RNN) and their extension with LSTM and GRU in various NLP tasks.
A transformer can be considered as an encoder-decoder-based neural network.
It is essentially based on a self attention mechanism that generates a contextual
embedding of each word. The main novelty of Transformer-based models is that
they are entirely based on attention mechanism without any need of recurrent
connections and therefore their capability of parallel processing.
BERT [22] stands for Bidirectional Encoder Representations from Trans-
formers. BERT’s model architecture is a multi-layer bidirectional Transformer
encoder. It is designed to condition both left and right background in all layers to
pretrain deep bidirectional representations from unlabeled text. Practicably, we
use BERT to determine a vector representation of each word of a sentence. There
are four versions of the pre-trained original BERT depending on the scale of
the model architecture: BERT-mini, BERT-medium, Bert-Base and Bert-Large.
There are also multi-lingual and domain-specific versions of BERT depending
on the training corpus and the architecture of the BERT (e.g. number of layers).
Concerning the Arabic language, there are three domain specific pretrained
BERT Models for Modern Standard Arabic (MSA): AraBERT [27], ARBERT
[28] and Arabic BERT [29]. It was tested in multiple NLP tasks such as
A BERT Based Approach for Arabic POS Tagging 315

Word Sense Disambiguation [36] and Sentiment analysis [39]. The multilingual
mBERT[35] can also handle Arabic texts since it is trained on multiple languages
including Arabic. We use these models as word representation in our proposed
architecture. Table 2 shows the characteristics of each use BERT model.

Table 2. Configuration of BERT models used in this paper

Model Layers Heads Hidden Parameters Vocab


Arabic BERT [29] 12 12 768 110M 23000
AraBERT [27] 12 12 768 110M 64000
mBERT [35] 12 12 768 110M 119547

In this paper, we propose three Arabic POS taggers POS-AraBERT, POS-


Arabic-BERT and POS-mBERT which are based on AraBERT [27], Ara-
bicBERT [29] and mBERT [35] respectively.
The input to the BERT model is the input sentences separated by a separator
token SEP. In reality, the first token of an input sequence is a special classification
token CLS. The final hidden state corresponding to CLS token is used as the
entire sentence representation for classification tasks.
We remind that the tokenization preprocessing is by default incorporated
into BERT. So BERT tokenizer splits the sentence into tokens and inserts the
special token CLS and SEP in their right positions. Even it is possible to apply
an “external” tokenizer to each sentence before feeding it to BERT, we should
not explicitly add the special token because BERT tokenizer will automatically
insert them.
Contrary to the classification tasks where the BERT representation of the
token “CLS” encodes the input sentence, for POS tagging, the BERT represen-
tation of each input token is fed into the same fully-connected layers to output
the part-of-speech tag of the token. Moreever due to WordPiece tokenizer, we
should exhit a correspondence between subword or word piece and labels. For
word-level tokenizer, there is a one-to-one correspondence between input tokens
and labels. However, when using a wordpiece tokenizer such as the BERT tok-
enizer, each word may be split into multiple tokens. We should establish a “token
mapping” strategy from wordpieces to labels. The original BERT paper, choose
the representation of the first sub-token as the input to the subsequent layers and
ignore the representation of the other sub-token. Practicly, we can implement
this strategy by assigning the word label to the first subword and a dummy label
‘X’ to the other subwords and when computing the loss function, we ignore the
‘X’ labels on the sub-tokens. We can also assign the label of the word to the last
wordpiece representation or even propagate the word label to all the subwords
and compute an average representation of the wordpieces. In this contribution
we adopt the first wordpiece representation and we leave the other mapping
strategies for future research.
316 R. Saidi et al.

The output of BERT block is feed to the dense layer.

3.2 Dense and Classification Layers

We fine-tuned all BERT layers. Again, the training objective was to perform a
sentence pair classification task.
There are mainly three different techniques for fine-tuning a pretrained BERT
model: 1) train the entire architecture, 2) train some layers and freeze the others,
3) freeze the entire architecture and add untrained layers of neurons on the end
and train the new model such that only the weights of the added layers will
be updated during training. In this paper, we adopt the later technique and
we freeze all the layers of BERT during fine-tuning and append a dense layer
and a classification layer to the architecture with a softmax activation function
to output the tag sequence of the input sentence. We also placed a dropout
regularization on the dense layer to prevent overfitting.

4 Experiments

4.1 Setup

Datasets. We merge the two publicly available datasets; the Arabic Word-
Net(AWN) [37] and the Quranic Arabic Corpus (QAC) [38] into a single corpus
called Arabic POS corpus. AWN is a lexical resource for modern standard Ara-
bic. It was constructed according to the material of Princeton WordNet. It is
structured around elements called Synsets, which are a list of synonyms and
pointers linking it to other synsets. Each word is annotated with its correspond-
ing pos tagging.
QAC stands for Quranic corpus and it has been manually created and anno-
tated. Each word is composed of a stem, prefix, and suffix. It consisted of a
sequence of words that are characterised by morphemes (prefixes, stems, suf-
fixes) and their corresponding tag sequences (prefix, stem, suffix tags).
We want to enrich this corpus with other corpora of Arabic POS. Cur-
rently Arabic POS corpus contains 151,710 words annotated with 47 tags.
The tagset is composed by the following tags: {“P”, “N”, “PN”, “DET”,
“ADJ”, “PRON”, “V”, “CONJ”, “REL”, “NEG”, “INL”, “DEM”, “REM”,
“ACC”, “EQ”, “CIRC”, “RES”, “T”, “PRO”, “PREV”, “INC”, “SUP”, “AMD”,
“SUB”, “INTG”, “LOC”, “COND”, “EMPH”, “VOC”, “RSLT”, “EXL”, “EXP”,
“CAUS”, “FUT”, “CERT”, “PRP”, “ANS”, “RET”, “EXH”, “INT”, “IMPV”,
“COM”, “SUR”, “AVR”, “IMPN”, “ADV”, “R”}.
The statistics of the different datasets and tagsets are shown in Table 3.
A BERT Based Approach for Arabic POS Tagging 317

Table 3. Statistics of the datasets and tagsets

Corpus Dataset size Tagset size


AWN 23.481 5
QAC 128,220 45
Arabic POS 151,701 47

Metric. Our tagging system is evaluated in terms of tagging accuracy.


Baselines. We compare our Bert-based Arabic POS tagging approach with
recent neural network approaches; Abumalloh et al. [18] and Miadi [19] and
with traditional approaches [16]. We refer the reader to [16] for the details about
the hyperparameter of all the traditional methods.
Hyper-parameters. We split the dataset into 80%, 10%, and 10% for training,
validation, and testing, respectively. We train the BERT models using Adam
optimizer [30] with an initial learning rate of 10−6 since the dataset is not so
large. We set the mini-batch size and dropout rate to 16 and 0.2, respectively.
We train the Bert based models for a maximum of 100 epochs and perform early
stop if the accuracy on the validation set does not improve for 10 consecutive
iterations. To extract the representation of a word, we average the last four
layers of the first subword. The learning objective was to minimize the binary
cross-entropy loss between the predicted labels and the golden labels.

4.2 Results and Analysis

Table 4 compares our three BERT based approaches with each other.

Table 4. Accuracy of our three POS taggers

Tagger Accuracy on AWN Accuracy on QAC Accuracy arabic POS


POS-AraBERT 69.07% 91.07% 88.96%
POS-Arabic-BERT 87.16% 91.60% 91.11%
POS-mBERT 92.60% 91.70% 91.69%

Table 4 shows that mBERT model slightly outperforms ArabicBERT and


AraBERT. This may be because mBERT is trained on different domains and
because of a hidden knowledge transfer between the different languages used
by mBERT. Comparing the multilingual mBERT model and language-specific
BERT models is a research theme in NLP.
Table 5 compares our contributions with the two most recent Arabic POS
tagging works based on neural network; Abumalloh et al. [18] and Miadi [19].
318 R. Saidi et al.

We note that we take the accuracy published by theses works since their cor-
pora are not available. This table reveals that our approaches based on mBERT
and Arabic BERT representations are better than the previous neural network
approaches. This may be due to the fact that they use a static encoding of Ara-
bic words where each letter is assigned a fixed code contrary to the contextual
embedding of BERT.

Table 5. Comparison with the most recent Neural network based Arabic POS tagging
works. NA stands for not available data set

Approach Corpus size Trainset Testset Accuracy


Abumalloh et al. [18] 16,672 (NA) 13,337 (NA) 3,335(NA) 89.04%
Miadi [19] 24,810 (NA) 19,848 (NA) 4,962(NA) 90.21%
POS-AraBERT 151,701 121,360 15,170 88.96 %
POS-Arabic-BERT 151,701 121,360 15,170 91.11%
POS-mBERT 151,701 106,191 121,360 91.69%

We note that it is very difficult to compare our approaches with previous


approaches that don’t share neither the code neither the corpus. So we compare
our proposed model against Arabic POS tagger that are trained on QAC corpus
for different tagsets with size 9, 33 and 45.
Table 6 compares our proposed models against Arabic POS tagger trained
on QAC [16].

Table 6. Comparison against Arabic POS tagger trained on QAC. NS stands for not
specified.

Approach Accuracy: 9 Tags Accuracy: 33 Tags Accuracy: 45 Tags


Unigram [16] 82.5% 80.4% NS
Bigram [16] 82.3% 80.5% NS
Trigram [16] 82.4% 80.3% NS
Brill [16] 83.2% 80.9% NS
HMM [16] 77.5% 75.2% NS
POS-AraBERT 93.65% 91.20% 91.07%
POS-Arabic-BERT 94.08% 91.76% 91.60%
POS-mBERT 94.32% 91.71% 91.70%

As a general remark, we note that for all methods, the larger the tagset, the
smaller the accuracy because the multi classification task becomes more difficult.
We note that our approaches based on different BERT models outperform the
A BERT Based Approach for Arabic POS Tagging 319

traditional approaches for both 9 and 33 tagsets. This is because POS tagging is
highly context dependent task and the BERT based model are strongly context-
sensitive models.
Finally, the results show that our method outperforms the recent neural
network and the traditional POS Tagging models. We note also that by using
the multilingual mBERT word representation we get better result than by using
language-specific BERT models such as araBERT and Arabic BERT.

5 Conclusion

This paper empirically investigates the advantages of using the BERT model for
end-to-end Arabic POS tagging. We have also started an initiative for collecting
and merging Arabic POS corpora. We demonstrate that the BERT based models
outperforms state-of-the-art Arabic POS tagging approaches. We also showed
that the multilingual BERT model gives better results than the arabic specific
BERT models. As a continuity of this research, we want to integrate a module
of automatic extraction of Arabic multiword expressions.

References
1. Cheraghui, M.A., Hoceini, Y., Abbas, M.: Une Approche de Désambiguı̈sation
Morpho-syntaxique de la Langue Arabe Basée sur l’Aide Multicritère à la Décision
(2012)
2. Debili, F., Tahar, Z., Souissi, E.: Analyse automatique vs analyse interactive: un
cercle vertueux pour la voyellation, l’étiquetage et la lemmatisation de l’arabe,
Journal Traitement Automatique des Langues Naturelles, Toulouse, France, pp.
347–356 (2002)
3. Zemirli, Z., Khabet, S.: TAGGAR: Un analyseur morphosyntaxique destiné à la
synthèse vocale de textes arabes voyellés. Journal Traitement Automatique de
l’Arabe Fès (2004)
4. Btoush, M.H., Alarabeyyat, A., Olab, I.: Rule based approach for Arabic part of
speech tagging and name entity recognition. Int. J. Adv. Comput. Sci. Appl. 7(6),
331–335 (2016)
5. Kubler, S., Mohamed, A., Ahmed, M.: Automatic tagging of Arabic text: from raw
text to base phrase chunks. Journal Natural Language Engineering: Short papers,
pp. 521–548 (2012)
6. Biadsy, F., Saabni, R., El-Sana, J.: Segmentation-free online Arabic handwriting
recognition. Int. J. Pattern Recogn. Artif. Intell. 25, 1009–1033 (2011)
7. Gahbiche-Braham, S., Bonneau-Maynard, H., Lavergne, T., Yvon, F.: Robust part-
of-speech tagging of Arabic text. In: Joint Segmentation and POS Tagging for
Arabic Using a CRF-Based Classifier, pp. 2107–2113 (2012)
8. Suryawati, E., Munandar, D., Riswantini, D., Abka, A. F., Arisal, A.: POS-Tagging
for informal language (study in Indonesian tweets). In: Journal of Physics: Confer-
ence Series, vol. 971, no. 1, p. 012055. IOP Publishing (2018)
9. Kadim, A., Lazrek, A.: Bidirectional HMM-based Arabic POS tagging. Int. J.
Speech Technol. 19, 303–312 (2016)
320 R. Saidi et al.

10. Darwish, K., et al.: Multi-dialect Arabic POS tagging: a CRF approach. In: Pro-
ceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018), May 2018
11. Kadim, A., Lazrek, A.: Parallel HMM-based approach for Arabic part of speech
tagging. Int. Arab J. Inf. Technol. 15(2), 341–351 (2018)
12. Diab, M., Hacioglu, K., Jurafsky, D.: Automatic tagging of Arabic text: from raw
text to base phrase chunks. In: Proceedings of HLT-NAACL 2004: Short Papers,
pp. 149–152 (2004)
13. Aldarmaki, H., Diab, M.: Robust part-of-speech tagging of Arabic text. In: Pro-
ceedings of the Second Workshop on Arabic Natural Language Processing, pp.
173–182 (2015)
14. Zribi, C.B.O., Torjmen, A., Ahmed, M.B.: A multi-agent system for POS-tagging
vocalized Arabic texts. Int. Arab J. Inf. Technol. 4(4), 322–329 (2007)
15. Tlili-Guiassa, Y.: Hybrid method for tagging Arabic text. J. Comput. Sci. 2(3),
245–248 (2006)
16. Alashqar, A.M.: A comparative study on Arabic POS tagging using Quran corpus.
In: 2012 8th International Conference on Informatics and Systems (INFOS), pp.
NLP-29. IEEE (2012)
17. Ali, B.B., Jarray, F.: Genetic approach for Arabic part of speech tagging. arXiv
preprint arXiv:1307.3489 (2013)
18. Abu-Malloh, R., Al-Serhan, H.M., Ibrahim, O.B., Abu-Ulbeh, W.: Arabic part-
of-speech tagger: an approach based on neural network modeling. Master’s thesis,
AlBalqa Applied University (2010)
19. Muaidi, H.: Levenberg-Marquardt learning neural network for part-of-speech tag-
ging of Arabic sentences. WSEAS Trans. Comput. 13, 300–09 (2014)
20. Tsai, H., Riesa, J., Johnson, M., Arivazhagan, N., Li, X., Archer, A.: Small and
practical BERT models for sequence labeling. arXiv preprint arXiv:1909.00100
(2019)
21. Ralethe, S.: Adaptation of deep bidirectional transformers for Afrikaans language.
In: Proceedings of The 12th Language Resources and Evaluation Conference, pp.
2475–2478 (2020)
22. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 (2018)
23. Nivre, J., Abrams, M., Agić, Ž., et al.: Universal Dependencies 2.2,
LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Lin-
guistics (ÚFAL), Faculty of Mathematics and Physics, Charles University (2018)
24. Puttkammer: Afrikaans NCHLT Annotated Text Corpora. South African Lan-
guage Resource Management Agency, Potchefstroom, 1.0, ISLRN 139-586-400-050-
9 (2014)
25. Augustinus, L., et al.: AfriBooms: an online treebank for Afrikaans. In: Proceed-
ings of the Tenth International Conference on Language Resources and Evaluation
(LREC 2016), pp. 677–682. ELRA; Paris (2016)
26. Tamburini, F.: UniBO@ KIPoS: Fine-tuning the Italian “BERTology” for PoS-
tagging Spoken Data (2020)
27. Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic
language understanding. arXiv preprint arXiv:2003.00104 (2020)
28. Abdul-Mageed, M., Elmadany, A., Nagoudi, E.M.B.: ARBERT & MARBERT:
Deep Bidirectional Transformers for Arabic. arXiv preprint arXiv:2101.01785
(2020)
A BERT Based Approach for Arabic POS Tagging 321

29. Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 task 12: BERT-
CNN for offensive speech identification in social media. In: Proceedings of the
Fourteenth Workshop on Semantic Evaluation, pp. 2054–2059 (2020)
30. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
31. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762
(2017)
32. Al-Serhan, H.M.: Extraction of Arabic word roots: an approach based on compu-
tational model and multi-backpropagation neural networks. Doctoral dissertation,
De Montfort University (2008)
33. Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure
of language?. In: ACL 2019–57th Annual Meeting of the Association for Compu-
tational Linguistics (2019)
34. AbuZeina, D., Abdalbaset, T.M.: Exploring the Performance of Tagging for the
Classical and the Modern Standard Arabic. Advances in Fuzzy Systems (2019)
35. Libovický, J., Rosa, R., Fraser, A.: How language-neutral is multilingual BERT?.
arXiv preprint arXiv:1911.03310 (2019)
36. El-Razzaz, M., Fakhr, M.W., Maghraby, F.A.: Arabic gloss WSD using BERT.
Appl. Sci. 11(6), 2567 (2021)
37. Arabic WordNet. http://globalwordnet.org/resources/arabic-wordnet/awn-
browser/. Accessed 20 Mar 2021
38. Quranic Arabic Corpus. https://corpus.quran.com/download/default.jsp.
Accessed 28 Apr 2021
39. Chouikhi, H., Chniter, H., Jarray, F.: Arabic sentiment analysis using BERT model.
In: International Conference on Computational Collective Intelligence. Springer,
Cham (2021, accepted paper)
Facial Expression Interpretation in ASD
Using Deep Learning

Pablo Salgado1 , Oresti Banos2 , and Claudia Villalonga2(B)


1
iungo Education SAS, Bogota, Colombia
[email protected]
2
Research Centre for Information and Communications Technology, University of
Granada, Granada, Spain
{oresti,cvillalonga}@ugr.es

Abstract. People with autism spectrum disorder (ASD) are known to


show difficulties in the interpretation of human conversational facial
expressions. With the recent advent of artificial intelligence, and more
specifically, deep learning techniques, new possibilities arise in this con-
text to support people with autism in the recognition of such expres-
sions. This work aims at developing a deep neural network model capa-
ble of recognizing conversational facial expressions which are prone to
misinterpretation in ASD. To that end, a publicly available dataset of
conversational facial expressions is used to train various CNN-LSTM
architectures. Training results are promising; however, the model shows
limited generalization. Therefore, better conversational facial expressions
datasets are required before attempting to build a full-fledged ASD-
oriented support system.

Keywords: Autism · AI · Deep learning · Emotions · Facial


expression

1 Introduction
The last decade has witnessed an explosive growth in the field of deep learning in
AI. Since the presentation of AlexNet [1] in 2012, a convolutional neural network
(CNN) developed for image recognition, deep learning has achieved an impressive
record in cognitive tasks. One such task is human facial expression recognition,
with a prominent relevance in the affective computing area. In other application
areas of deep learning, such as natural language processing, automatic transla-
tion, or speech recognition, recurrent neural networks (RNN) have been proven
to learn how to solve complex problems requiring sequences of data over time.
Soon enough it was envisioned that the combination of these architectures, i.e.,
CNN-RNN architectures, such as convolutional neural network - long-short term
memory (CNN-LSTM) can learn from video sequences, for example to remove
commercials or detect human activity.

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 322–333, 2021.
https://doi.org/10.1007/978-3-030-85030-2_27
Facial Expression Interpretation in ASD Using Deep Learning 323

The success in applying such AI techniques in complex cognitive tasks poses


the question of whether AI can also assist people with ASD to improve their qual-
ity of life. One main challenge for people with autism is to be able to interpret
social cues that humans normally use in their daily nonverbal communication.
More specifically, facial expressions most often convey essential information to
interlocutors which is usually misinterpreted by people with autism. Hence, hav-
ing the capacity of decoding such information for them is found of much relevance
to facilitate the communication among people with and without autism.
In the light of this issue, this work aims to develop a deep neural network
model, namely a CNN-LSTM, to classify conversational facial expressions in
order to support people with ASD. As it is usual in AI research, data must be
collected and preprocessed before it is fed to AI algorithms for learning. Search-
ing a suitable dataset was therefore the first step. Before feature extraction, a
preprocessing of the dataset was required. For the facial feature extraction, some
pretrained CNNs were used. In the final step of classification, a LSTM was used.
After ten trials with different CNN-LSTM settings, it was found that available
publicly datasets seem to be insufficient to solve the problem at hand. Lack of
enough data for training and testing results in difficulties in finding appropriate
settings to improve the proposed deep learning models.

2 State of the Art

There exist some previous works that have applied AI techniques to detect
human facial expressions in ASD. At least two projects are active by the time
this document is written. The first project is a commercial one, known as Brain
PowerTM [2]. Among other provided applications, this project offers Emotion
CharadesTM [3], which uses deep learning for human emotion recognition. The
mobile app is designed to run on Google Smart Glasses [4] as an augmented
reality game where children with ASD try to guess the emotion on faces of sur-
rounding people. If the child interprets emotions correctly, she is awarded with
points or encouraged to try again otherwise. This system can only recognize six
basic human emotions: sadness, happiness, fear, anger, surprise and disgust.
The second project is driven by the University of Stanford in California. The
solution, known as The Autism Glass Project [5], consists of an Android app that
works with a pair of Google Smart GlassesTM which in turn feeds the video to
the mobile app for deep learning emotion recognition. In response, the detected
emotion is sent to the glasses, which informs the child of the detected emotion.
Additionally, the child is encouraged to interpret captured faces in the app with
an emotion. Several papers [6–10] have been published about this project in
which the improvements experienced by children using the smart glasses and the
app are reported. This system can only recognize seven basic human emotions:
happiness, surprise, anger, disgust, sadness, fear and indifference.
There exist commercial cognitive systems designed to recognize human facial
emotions not necessarily related to ASD. Table 1 presents a non-comprehensive
list of the available systems, where the common denominator is that all these
324 P. Salgado et al.

systems can recognize a few human basic emotions. Apart from that, several
image datasets are publicly available that can be used for model training in
order to recognize human facial emotions in pictures. As it is observed with
available commercial cognitive systems, these datasets are restricted to a few
human emotions. In [11] three potential datasets for basic emotion recognition
training can be found: JAFFE, UMD and Cohn-Kanade. JAFFE and UMD
datasets provide pictures for basic emotions (happiness, surprise, disgust, anger,
sadness and fear). Cohn-Kanade, instead, provides combinations of facial action
units [12] of those basic emotions to build some complex facial expressions.

Table 1. Commercial human emotion recognition systems.

Provider Web site Emotions


Eyesee https://eyesee-research.com/facial- Happiness, surprise,
coding/ confusion, disgust, fear,
sadness, neutral
Emotion https://emotionresearchlab.com/ Happiness, surprise, anger,
Research Lab online-platform/ disgust, fear, sadness, neutral
iMotions https://imotions.com/biosensor/ Happiness, anger, fear,
fea-facial-expression-analysis/ disgust, contempt, sadness,
surprise
Kairos https://www.kairos.com/ Anger, disgust, fear,
happiness, sadness, surprise
Microsoft https://azure.microsoft.com/en- Anger, contempt, disgust,
Azure us/services/cognitive-services/face/ fear, happiness, neutral,
sadness, surprise
MoodMe https://www.mood-me.com/ Happiness, surprise, sadness,
insights/ anger, fear, disgust
Noldus https://www.noldus.com/ Happiness, sadness, anger,
facereader surprise, fear, disgust
NVISO https://www.nviso.ai/en Happiness, surprise, sadness,
disgust, fear, anger, neutral
Realeyes https://www.realeyesit.com/ Happiness, surprise,
technology/emotion-recognition/ confusion, sadness, disgust,
fear
RefineAI https://www.refineai.com/ Happiness, surprise, sadness,
fear, anger, disgust, contempt
Sightcorp https://sightcorp.com/ Happiness, surprise, sadness,
disgust, anger, fear

Since datasets like JAFFE and Cohn-Kanade, where emotions are posed by
actors in laboratory-controlled environments, the AffectNet dataset has been
collected from internet with 1,000,000 images [13]. Again, this dataset only tags
Facial Expression Interpretation in ASD Using Deep Learning 325

basic human emotions: happiness, sadness, surprise, fear, contempt and uncer-
tainty. FER-2013 [14] is a very well-known dataset for facial analysis but, like the
others, is limited to the basic emotions: anger, disgust, fear, happiness, sadness
and surprise. Regarding video datasets, the AFEW [15] has been collected from
movies with near real conditions but only for six basic human emotions: anger,
disgust, fear, happiness, sadness and surprise.
Conversely to the above datasets, the Large MPI Facial Expression
dataset [16] comprises 51 facial expressions commonly used by people in con-
versations. This dataset consists of 510 videos (10 videos per facial expression)
posed by ten non-professional actors and actresses and delivered as tagged and
numbered image sequences totaling 88823 photographs. The large array of facial
expressions covered in this dataset makes it particularly representative of poten-
tial affective social situations that people with ASD may encounter regularly.
Yet, as for most datasets, it is important to account for bias. The MPI dataset is
posed by German males and females in their 20’s. According to [17] facial expres-
sions are subject to cultural differences even for the most basic and universal
human emotions. The MPI dataset is hence biased, thus the systems developed
out of this data might not be applied in a context different than German. Nev-
ertheless, this type of bias can be a good characteristic since it may be expected
that a person with ASD normally lives in a homogeneous cultural environment.
In regard to AI techniques for recognizing human facial expressions, the clas-
sic approximation is based on the facial action coding system (FACS) [12]. This
system describes the human expression in terms of actions units (AU) in which
the “geometry” of the face can be represented according to the relative position
of the face muscles. Since the 80s, statistical analysis-based AI algorithms such
as Gabor filters were used along with FACS to extract facial features from pic-
tures. Statistical machine learning algorithms such as support vector machines,
bayesian networks, or Markov hidden models were used to learn how to clas-
sify human emotions. By the year 2012, with the introduction of AlexNet [1],
CNNs proved to excel in extracting features from images so efficiently that in the
ICML 2013 [14] one workshop was dedicated to facial emotion interpretation.
The top three teams, out of 56, attained accuracy scores of 71.162%, 69.256% and
68.821% using CNNs for classification of emotions in static pictures of human
facial expressions.
Training a CNN is a costly process, so the transfer learning technique provides
an excellent approach to transfer what a neural network has learned to solve a
specific problem to a new, potentially unrelated domain. Several CNNs are pre-
trained with the ImageNet1 dataset and are available for transfer learning in pop-
ular frameworks such as Keras2 . Three of them (MobileNet, MobileNetV2 and
NasNetMobile) will be used in this work applying the transfer learning technique
in attempt to resolve the problem at hand. These CNNs are specifically designed
for the limited resources of the mobile devices, hence the main reason for select-
ing them, as we intend to deploy the neural network on such devices. Although

1
http://www.image-net.org.
2
https://keras.io/.
326 P. Salgado et al.

these CNNs have relatively few training parameters, for instance MobileNetV2
which only has 3504872, it can achieve an accuracy of 90.1% in object classifi-
cation [22].
Despite CNNs are the way to go for classification of static images, this project
requires to analyze the temporal dimension of image sequences to find the pattern
that composes a conversational facial expression. RNN are called to solve this
type of problems. Specifically, the LSTM type of RNN is defined to remember
start and end of a sequence. A combination of both is therefore quite practical
for facial expression recognition problems. Some work has been done in this
direction [20], where the authors achieved an accuracy of 88.02% in classifying six
basic emotions (anger, disgust, fear, happy, sad and surprise) from the extended
Cohn-Kanade dataset [21]. Also, in [18] the authors proposed the use of CNN-
LSTM neural networks, attaining an accuracy of 41.67% while classifying six
basic emotions (happy, sadness, anger, fear, surprise, disgust) available in the
AFEW dataset [15].

3 Methodology
The overall objective of this work is to develop a classifier for the 51 facial expres-
sions available in the MPI dataset, commonly used in conversations, as a base
for a support cognitive artificial system for people with ASD. Ten different trials
are performed in order to obtain the best neural network architecture. The first
three trials are designed to find the most appropriate training mini-batch size,
video sequence size and dataset using MobileNetV2 (the Keras mobile CNN with
the fewest parameters). Trial four tries training MobileNetV2 without transfer
learning. Trial five attempts to train MobileNet (the Keras mobile CNN with the
fewest layers) while trial six attempts to train NasNetMobile (the Keras mobile
CNN with the most layers). At this point, some conclusions are drawn to guide
the last four trials. Trials seven, eight and nine test some RNN configurations
based on LSTM using the best CNN model found in earlier trials. These first
nine trials are conducted on models trained with just three (bored, confused,
contempt) of the 51 classes available in the MPI dataset. This approach reduces
the training time required while searching the best hyper-parameters for mini-
batch size, sequence size, neural network deepness and LSTM units. Based on
the conclusions of the first nine trials, the final CNN-RNN neural network to
classify the 51 facial expressions is trained in Trial 10.
Since MPI dataset is relatively small, just 10 videos for each expression, a
sliding window technique is used to generate multiple sequences from each video.
Such a technique is shown in Fig. 1, where a given 11 frames video turns out in
three sequences of three frames each. The sliding window algorithm receives as
parameters the video itself and the desired sequence size. In response it generates
all possible sequences of the given size from the video. This technique is applied
across all neural networks trained in the ten trials to increase the quantity of
videos sequences available.
Facial Expression Interpretation in ASD Using Deep Learning 327

Fig. 1. Sliding window algorithm.

Trial 1. A MobileNetV2 CNN is instantiated without the classification layer.


The last convolutional layer is trained again, while all remaining layers are con-
figured for non-training in order to apply the transfer learning technique. One
LSTM layer is added, just before the new classification layer with 3 outputs. The
training data is not preprocessed, simply 30 videos are created from the photo-
graph frames provided with the MPI dataset for the three selected classes. MPI
dataset pictures are at a resolution of 768 × 576 so this will be the final resolution
for the videos. Despite that, Keras pretrained networks receive a 224 × 224 × 3
tensor as input so each video is resized in runtime to match that requirement.
Fifteen models are trained by combining a mini-batch size from (2, 4, 8, 16,
32) and sequences size from (6, 12, 24). An early stopping technique is used
to stop the training when validation loss completes ten epochs of continuous
increasing. In order to establish the best model a linear regression is calculated
for validation accuracy and loss. The slope of the line is used as an indicator
of how fast the precision increases and the loss decreases. The model with the
biggest slope for precision (0.0099) and the smallest slope for loss (0.0126) is
considered the best model. This model is trained with a mini-batch size of 8 and
a sequence size of 12. The model fails to converge consistently and is over-fitted
as it is trained with just 312 sequences and validated with 78 sequences. So, next
trial tries a Data Augmentation technique to increase the number of sequences
for training.
Trial 2: This trial focus on Data Augmentation by preprocessing the same 30
videos of trial 1. The objective is to create 20 videos from each original video
totaling 600 videos for training. No more changes are made from trial 1. For
augmentation, 600 random transformations are generated making each video
slightly different from the previous one. One transformation is created from nine
parameters (rotation, displacement x and y, shear, x and y zoom, horizontal flip,
brightness and grayscale) with values randomly selected from a given range. Once
the transformation is instantiated it is applied to each frame of the original video
and saved as a new video. Figure 2 shows an example of some transformations
applied to a video frame.
Fifteen models are trained by combining a mini-batch size from (2, 4, 8, 16,
32) and sequences size from (6, 12, 24). The model trained with a mini-batch
size of 32 and a sequence size of 12 turns out to be the best model for this trial
328 P. Salgado et al.

Fig. 2. Example of image transformations applied to a video frame.

as the linear regression slope is 0.0169 for accuracy and it is 0.0056 for loss.
The model converges consistently on the validation accuracy chart achieving
more than 91.73% by the ninth epoch, but the model is over-fitted since loss
starts to increase consistently starting at 0.5 from epoch seven. Anyway, the
Data Augmentation technique shows better results on training as this model
was trained with 12000 sequences and validated with 3000 sequences. Next trial
focuses in making a close-up of the face in each video.
Trial 3: This trial focuses in preprocessing the 30 videos from trial 1 creating
600 videos with a close-up of the face in the video. The same 600 transformations
from trial 2 are applied to videos after the face close-up is preprocessed. No more
changes are made from trial 2. Each video is scanned twice, the first scan gets
the area where the face is located and the second one, cuts the face in all frames
saving a new video. Figure 3 shows an example resulting of a face frame close-up.

Fig. 3. Frame close-up.

Fifteen models are trained by combining a mini-batch size from (2, 4, 8, 16,
32) and sequences size from (6, 12, 24). The model trained with a mini-batch size
Facial Expression Interpretation in ASD Using Deep Learning 329

of 32 and a sequence size of 6 turns out as the best model for this trial as the linear
regression slope is 0.0247 for accuracy and it is 0.0391 for loss. This model failed
to converge consistently on the validation accuracy despite it seems to show an
upward trend without exceeding 80%. Anyway, the model is highly over-fitted.
What these results seems to show is that a high-resolution picture of face close-up
turns out in over-fitting the neural network. In this regard, Goodfellow et al. [14]
have found that a 48 × 48 resolution is just enough for facial expression recog-
nition on FER-2013 dataset. Same way, Cunningham et al. [19] have found that
conversational facial expressions from MPI dataset are recognizable by humans
at resolutions as lower as 64 × 48. Based in these studies and obtained results, the
closed-up face dataset is abandoned in this work as well as the best model for this
trial. As trial 3 is abandoned, the best model from trial 2 is by far the best model.
So, for the following trials, the mini-batch size will be 32, sequence size will be 12
and the preprocessed augmented video dataset from trial 2 will be used for train-
ing. The MobileNetV2 was used in all these trials.
Trial 4: This trial focuses on fully training the CNN used in the first three
trials (MobileNetV2). Same neural network architecture from trials 1, 2, and 3
are used. The mini-batch size and sequence size hyper-parameters are 32 and
12 respectively. The same 600 preprocessed videos from trial 2 are used to fully
train the CNN, but they are resized in runtime to 48 × 48 (see [14] and [19])
since emotion recognition can be successfully made at this lower resolution. The
model virtually learns nothing since the validation accuracy stays constantly
at around 33%. The conclusion is that transfer learning is very useful for this
project as there is not enough data to fully train the CNN portion of the neural
network.
Trial 5: This trial focuses in training a neural network with a CNN portion
deeper than MobileNetV2 which is 157 layers deep. The NasNetMobile is selected
as it is the deepest mobile CNN available in Keras. Neural network architecture is
the same from trial 4, but MobileNetV2 was switched for NasNetMobile. Hyper-
parameters are the same as trial 4. Training dataset is the same from trial 2. It
is observed that the validation accuracy hardly reaches 67% while loss indicates
that the model is very over-tuned as it increases in value rapidly from 1.0 since
epoch 1. The best model from trial 2 seems to be better than this one. So, the
conclusion is that a deeper CNN network does not improve the training results
for the problem at hand.
Trial 6: This trial, focuses on trying a less deep CNN portion. The selected
CNN is MobileNet as this CNN is the less deep mobile network. This trial is
conducted with same hyper-parameters and network architecture from trial 5
but NasNetMobile is switched to MobileNet. By far, the best model is from this
trial as it is the only one with a negative slope for loss at −0.1491, while it has
the highest value for the slope of accuracy 0.0352. The conclusion is that a less
deep CNN is the most appropriate for the problem at hand. So, MobileNet is
selected as the final CNN portion for the neural network.
330 P. Salgado et al.

Trial 7: Now that a CNN portion has been selected, this trial focuses on the
RNN portion. The RNN is made deeper by adding three additional LSTM layers
of 64 units. In order to control the over-fitting of the neural network, each LSTM
layer is added between dropouts at 50%. Remaining architecture is the same as
the trial 6 as well as hyper-parameters and training data. The model of this
trial start to converge to a validation accuracy around 90% while loss shows a
similar pattern of convergence below 1. The over-fitting of the network has been
controlled and the model looks appropriate to solve the problem at hand.
Trial 8: This trial aims to evaluate if an RNN with more LSTM units outper-
forms the model from trial 7. The only change from trial 7 is that the four LSTM
layers are expanded to 128 units. Expanding the LSTM layers with more units
makes the neural network to lose its ability to converge and to start over-fitting
as accuracy varies widely from 37% and 94% over 33 training epochs. The loss
behaves in a similar way varying between 0.2452 and 4.7848.
Trial 9: This trial focuses on training a bidirectional RNN. From trial 7 archi-
tecture, each LSTM 64 units layer is embedded into a bidirectional RNN. No
more changes are performed. The neural network loose even more its ability to
converge as accuracy varies widely from 50% and 95% over 43 training epochs
and the loss behaves in a similar way varying between 0.2872 and 3.4150. So,
trials 8 and 9 are abandoned and the architecture from the trial 7 is taken for
the final trial.
Trial 10: In the first three trials it is found that the hyper-parameters of mini-
batch size and sequence with the best results are 32 and 12 respectively, so is
the augmented and preprocessed video dataset from trial two. From trial four,
it is observed that transfer learning is more than appropriate for this project.
From trials five and six, it is concluded that a network with fewer layers may
be more effective, so the MobileNet CNN is selected. From trials seven, eight,
and nine, it is found that the neural network from trial seven presents the best
accuracy and loss curves, compared to experiment eight and nine, so the RNN
portion from trial seven is selected.
This trial is intended to train the final neural network to classify the 51
classes from the MPI dataset. 51 videos (all from one actor) are held apart for
testing. From each video of the remaining 459 videos, 20 videos are generated
using the augmentation procedure from trial 2. Thus, the training dataset for
this trial is composed of 9180 preprocessed and augmented videos. The sliding
window algorithm finally produces 119340 sequences for training. The neural
network architecture trained is the same as for trial 7.

4 Results and Discussion


The final neural network to classify the 51 classes from the MPI dataset obtained
in Trial 10 is validated and tested. The results from training this model are show
in Fig. 4. As seen in the figure, the neural network converges consistently and by
the epoch 33 hits an 88.6% of accuracy with a loss of 0.954. Trained is stopped by
Facial Expression Interpretation in ASD Using Deep Learning 331

Fig. 4. Evaluation charts for the generated model (trial 10).

epoch 40 since no improvement is shown for accuracy neither loss. Kaulard et al.
[16], performed a validation of the MPI dataset with three human evaluators for
each facial expression in the dataset. They obtained an accuracy of 60% when
the three evaluators classify a video with same class. The neural network seems
to be as good as humans to classify the MPI facial expressions.
Testing a neural network requires independent data that has not been used
to train the model. The 51 videos held apart are used to perform this test. The
sliding window algorithm is used to generate the required sequences of size 12
frames. Finally, 663 test sequences are generated from the 51 videos. Once the
network is evaluated just an accuracy of 7.97% is achieved with a loss greater
than 10. These results suggest that the neural network is over-fitted despite the
efforts made to avoid such a situation. It seems to be memorizing the faces of
the actors/actresses of the MPI dataset. The results are not entirely unexpected
as MPI dataset provides only 10 different faces and 51 classes. However, it is
important to note that the final neural network achieves an accuracy of 7.97%
which is slightly over four times the accuracy of 1.96% of a given random classifier
for 51 classes.
Either way, the low accuracy achieved may be due, in part, to the inclusion of
the neutral facial expression frames in the videos. In the same way, another issue
may be that the data augmentation preprocessing produced some frames with
the face displaced such that a slightly part of the face is missing. Another set
332 P. Salgado et al.

of trials addressing these issues should be run in a future work to see if greater
accuracy is achieved. Anyway, it seems that the main issue is the MPI dataset
itself as this dataset has a limited number of people posing in videos. There is
not enough data for the models to start generalizing, despite the augmentation
of the videos and the sliding window technique.

5 Conclusions

The final CNN-LSTM neural network model developed in this project has the
potential to develop an artificial cognitive system aimed to provide people with
ASD with a tool that can improve their lives. The main issue to achieve such an
artificial cognitive system is found in the lack of enough training data. There is
plenty of room in this area to develop such datasets before to attempt to build
this kind of systems.
Facial expression recognition is a complex cognitive task that even humans
find difficult, and people with ASD may found impossible to perform. AI in gen-
eral and artificial cognitive deep learning models have the potential to develop
applications that somehow can improve the quality of life for this people. Beyond
applications for people with ASD, further research in this field may find appli-
cation in areas like human-robot interactions or facial sentiment analysis.

References
1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/
10.1145/3065386
2. Brain Power — Autism Education - Empowering Every Brain. https://brain-
power.com/. Accessed 01 Mar 2021
3. Learn to Play Emotion Charades. https://youtu.be/lGoxUd2nTDc. Accessed 01
Apr 2021
4. Glass - Glass. https://www.google.com/glass/start/. Accessed 01 Apr 2021
5. Autism Glass Project. http://autismglass.stanford.edu/. Accessed 01 Apr 2021
6. Daniels, J., et al.: Feasibility testing of a wearable behavioral aid for social learning
in children with autism. Appl. Clin. Inform. 9(1), 129–140 (2018). https://doi.org/
10.1055/s-0038-1626727
7. Daniels, J., et al.: Exploratory study examining the at-home feasibility of a wear-
able tool for social-affective learning in children with autism. NPJ Digit. Med.
1(1), 32 (2018). https://doi.org/10.1038/s41746-018-0035-3
8. Voss, C., et al.: Effect of wearable digital intervention for improving socialization in
children with autism spectrum disorder: a randomized clinical trial. JAMA Pediatr.
173(5), 446–454 (2019). https://doi.org/10.1001/jamapediatrics.2019.0285
9. Voss, C., et al.: Superpower glass: delivering unobtrusive eal-time social cues in
wearable systems. In: UbiComp 2016 Adjunct - Proceedings of the 2016 ACM
International Joint Conference on Pervasive and Ubiquitous Computing, pp. 1218–
1226, September 2016. https://doi.org/10.1145/2968219.2968310
Facial Expression Interpretation in ASD Using Deep Learning 333

10. Washington, P., et al.: SuperpowerGlass: a wearable aid for the at-home therapy of
children with autism. Proc. ACM Interactive Mobile Wearable Ubiquitous Technol.
1(3), 1–22 (2017). https://doi.org/10.1145/3130977
11. Gross, R.: Face databases. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recog-
nition, pp. 301–327. Springer, New York (2005). https://doi.org/10.1007/0-387-
27257-7 14
12. Ekman, P., et al.: What the Face Reveals: Basic and Applied Studies of Sponta-
neous Expression Using the Facial Action Coding System (FACS). Oxford Univer-
sity Press, Oxford (1997)
13. Mollahosseini, A., Hasani, B., Mahoor, M.H.: AffectNet: a database for facial
expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Com-
put. 10(1), 18–31 (2019). https://doi.org/10.1109/TAFFC.2017.2740923
14. Goodfellow, I.J., et al.: Challenges in representation learning: a report on three
machine learning contests. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.)
ICONIP 2013. LNCS, vol. 8228, pp. 117–124. Springer, Heidelberg (2013). https://
doi.org/10.1007/978-3-642-42051-1 16
15. Dhall, A., Goecke, R., Joshi, J., Wagner, M., Gedeon, T.: Emotion recognition
in the wild challenge 2013. In: Proceedings of the 15th ACM on International
Conference on Multimodal Interaction, pp. 509–516 (2013). https://doi.org/10.
1145/2522848.2531739
16. Kaulard, K., Cunningham, D.W., Bülthoff, H.H., Wallraven, C.: The MPI facial
expression database - a validated database of emotional and conversational
facial expressions. PLoS ONE 7(3) (2012). https://doi.org/10.1371/journal.pone.
0032321
17. Elfenbein, H.A., Beaupré, M., Lévesque, M., Hess, U.: Toward a Dialect Theory:
Cultural Differences in the Expression and Recognition of Posed Facial Expressions,
psycnet.apa.org (2007). https://doi.org/10.1037/1528-3542.7.1.131
18. Li, Y.: Deep Learning of Human Emotion Recognition in Videos. https://uu.diva-
portal.org/smash/get/diva2:1174434/FULLTEXT01.pdf. Accessed 01 Mar 2021
19. Cunningham, D.W., Nusseck, M., Wallraven, C., Bülthoff, H.H.: The role of image
size in the recognition of conversational facial expressions. Comput. Animation
Virtual Worlds 15(3–4), 305–310 (2004). https://doi.org/10.1002/cav.33
20. Rajan, S., Chenniappan, P., Devaraj, S., Madian, N.: Novel deep learning model
for facial expression recognition based on maximum boosted CNN and LSTM. IET
Image Process. 14(7), 1373–1381 (2020). https://doi.org/10.1049/iet-ipr.2019.1188
21. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The
extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and
emotion-specified expression. In: 2010 IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition - Workshops, CVPRW 2010, pp. 94–101
(2010). https://doi.org/10.1109/CVPRW.2010.5543262
22. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2:
inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, pp. 4510–4520,
December 2018. https://doi.org/10.1109/CVPR.2018.00474
Voxel-Based Three-Dimensional Neural Style
Transfer

Timo Friedrich1(B) , Barbara Hammer2 , and Stefan Menzel1


1 Honda Research Institute Europe, Carl-Legien-Straße 30, 63073 Offenbach/Main, Germany
{timo.friedrich,stefan.menzel}@honda-ri.de
2 Bielefeld University, Universitätsstraße 30, 33615 Bielefeld, Germany

[email protected]

Abstract. Neural Style Transfer has been successfully applied in the creative
process for generating novel artistic 2D images by transferring the style of a
painting to an existing content image. These techniques which rely on deep neural
networks have been extended to further computational creativity tasks like video,
motion and animation stylization. However, only few research has been conducted
to utilize Neural Style Transfer in the spatially three-dimensional space. Existing
2D/3D hybrid approaches avoid the extra dimension during the stylization process
and add postprocessing or differentiable rendering to transform the results to 3D.
In this paper, we propose for the first time a complete three-dimensional Neural
Style Transfer pipeline based on a high-resolution voxel representation. Following
our previous research, our architecture includes the standardized gram matrix style
loss for noise reduction and visual improvement, the bipolar exponential activation
function for symmetric feature distributions and best practices for the underlying
classification network. In addition, we propose regularization terms for voxel-
based 3D Neural Style Transfer optimization and demonstrate their capability to
significantly reduce noise and undesired artefacts. We apply our 3D Neural Style
Transfer pipeline on a set of style targets. The style transfer results are evaluated
using 3D shape descriptors which confirm the subjective visual improvements.

Keywords: Deep neural networks · Neural Style Transfer · Voxel


representations · Computational creativity · Regularization

1 Introduction

The impressive development of powerful algorithms in the field of deep learning brought
novel and exciting perspectives to computational creativity. In our research, we extend
Neural Style Transfer [1] to 3D designs for an application in the design and engineering
domain. We envision a cooperative human-machine design system which is capable to
generate novel 3D shapes, with desired style features, e.g. for the automotive domain.
Here, a highly automated process would provide an overview over a portfolio of stylized
cars from different classes (sedan, SUV) to the designer for a fast assessment of design
options and an accelerated decision-making-process [2].

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 334–346, 2021.
https://doi.org/10.1007/978-3-030-85030-2_28
Voxel-Based Three-Dimensional Neural Style Transfer 335

Pixel-based 2D Neural Style Transfer has been proposed by Gatys et al. [1] for trans-
ferring the style of one image to the content of another image. In our former research,
we followed the general idea of style and content separability in the 2D domain and
extended it to 3D using Eigenspace projections for triangulated surface meshes of cars
[2], which is however limited to homogeneous meshes. To overcome this limitation,
we focused on the development of deep learning-based Neural Style Transfer in com-
bination with three-dimensional voxel data [5–7] to enable designers and engineers to
generate and assess novel solutions rapidly, e.g. in the context of automotive design
development. Similar to the 2D images, the artificially created 3D shapes contain prop-
erties which stimulate human inspiration and trigger previously overseen solution paths.
In contrast to other works (Sect. 2), our research comprises a solely 3D based Neural
Style Transfer approach. As challenges, fundamental methods and principles in a sub-
optimal environment regarding geometric model resolution and data sizes caused by
computational constraints (memory and processing units) have to be tackled. For input
model resolution, we increased the voxel resolution used by our 3D Neural Style Transfer
pipeline from initially 643 [5, 6] to a current maximum of 2563 [7] voxels (Fig. 1) which
significantly enhances the capabilities for algorithm development and visual judgement.
The main contributions in the present paper are threefold: First, we build upon our
architecture [7] and apply for the first time the complete solely 3D Neural Style Transfer
on voxel-based geometries for generating stylized designs. Second, we improve our 3D
voxel-based Neural Style Transfer pipeline with regularization terms which effectively
increase the visual quality by preventing undesired artefacts caused by the combination
of Neural Style Transfer and the voxel representation. Third, we present the results of
our proposed framework on two sets of content and style targets for visual inspection and
numerical evaluation. According to our evaluation procedure [7], we subjectively discuss
the generated shapes first. Then, we apply our previously developed complementary
3D style transfer success metric and discuss conformance as well as derivation to the
subjective human impression.

2 State of the Art

2D Neural Style Transfer [1] transfers artistic style features of a painting to another
image, for example a photograph [1, 3]. The algorithm is based on a deep neural network
and exploits its activation map responses as source for a loss function either encoding
content or style features. Based on the two-part weighted style transfer loss
 
Ltotal = α (C(p) − C(x))2 + β (S(a) − S(x))2 (1)

for content C and style S, an optimization iteratively adapts each pixel of a to-be-
generated result image x by applying automatic derivation and backpropagation with the
style source image a and the content source image p. For further details we refer to Gatys
et al. original work [1]. Research towards 3D Neural Style Transfer comprises meth-
ods for general 3D style transfer, which is not neural network based, or 2D/3D hybrid
approaches. In the case of a solely 3D-based Style Transfer we proposed an algorithm
based on the Eigenspace decomposition of triangulated meshes representing car shapes
336 T. Friedrich et al.

[2], which is limited to a homogeneous triangulated mesh dataset. Ma et al. [11] proposed
a method for shape-analogy driven style transfer inspired by cognitive science. Their
method is solely applied on repetitive and separable structures like buildings. However,
both methods are bound to specific datasets and cannot be applied to arbitrary shapes.
A different research focus relies on the combination of 2D Neural Style Transfer with
3D data by either using post-processing or connecting the 2D and 3D representations
with differentiable operators. The first variant is detailed in Ren et al. [12] who pro-
posed an approach where stylized images are stacked to create a voxel-based building
structure. The second variant is chosen by e.g. Kato et al. [13] who stylized the sur-
face of 3D meshes. Sýkora et al. [14] suggested a technique to transfer style onto 3D
animations with proper lighting and good temporal stability while Kim et al. [15] pro-
posed a method to transfer styles of images on fluid animations. All three methods use
a differentiable rendering operator which transforms the 3D shapes into the 2D domain
while the gradient is backpropagated to the 3D mesh. However, these methods are lim-
ited to operate on the non-occluded surface of three-dimensional shapes. In terms of
3D voxel-based Neural Style Transfer, we proposed the first solely three-dimensional
mid-range-resolution approach which is applicable on arbitrary voxel shapes [5]. Like
Qi et al. [16] and Kim et al. [15] previously described, we also faced a lack of deep
3D classification networks that were pre-trained on rich and dense datasets. Therefore,
we proposed voxel-based classification networks and applied them to successfully styl-
ize the first proof-of-concept voxel shape in 643 voxels [5]. Furthermore, we proved
that Neural Style Transfer can be applied on data which incorporates fundamentally
different elements of art [6, 17]. By adapting the style loss to our standardized gram
matrix approach, we achieved reasonable stylized shape primitives conforming subjec-
tive expectations which further indicated a general applicability of Neural Style Transfer
in combination with 3D voxel data. In following style reconstruction optimizations, we
highlighted the crucial influence of activation functions on the success of Neural Style
Transfer [7]. We also solved the issue of asymmetrical artistic feature distributions and
proposed the first complementary approach to numerically quantify the transfer of style
features.

Fig. 1. Left: Triangular mesh. Middle: 643 voxel model. Right: 2563 voxel model. Car model
taken from ModelNet40 [8]

3 3D Voxel-Based Neural Style Transfer


In this section, we provide the details of our 3D stylization pipeline based on Neural
Style Transfer and the voxel-representation1 . To the best of our knowledge, this is the
1 http://www.github.com/HRI-EU (code accessible following positive reviews).
Voxel-Based Three-Dimensional Neural Style Transfer 337

first complete realization of Neural Style Transfer for three-dimensional voxel-based


shapes. Our pipeline is based on the original implementation by Gatys et al. [1] with
modifications and enhancements required by the different data representation and prop-
erties. We designed our high-resolution classification network (Table 1 and Table 2)
used as the activation map extractor inspired by the architecture of VoxNet [18]. Due to
the increased voxel resolution of 2563 , we use a higher number of convolutional layers
(Conv) and filters per layer. Furthermore, we replaced down-sampling by striding with
max-pooling [6] and switched all non-linear activation functions in the convolutional
layers to bipolar exponential linear units (BELU) [7, 19].

Table 1. 3D Voxel BELU classification network configuration

Layer Configuration Layer Configuration


FCa (Out) (40) softmax, class labels Conv_3 (28, 28, 28, 64 * 2) belu
FC (64) elu Conv_2 (60, 60, 60, 64 * 2) belu
FC (128) elu Conv_1 (125, 125, 125, 32 * 2) belu
Conv_5 (4, 4, 4, 128 * 2) belu Conv_0 (254, 254, 254, 12 * 2) belu
Conv_4 (12, 12, 12, 96 * 2) belu Input (256, 256, 256, 1) voxel model
a Fully connected layer.

The network is trained with a batch size of 4. The voxel dataset is derived from the
aligned ModelNet40 [8] which consists of −12k objects like cars and furniture in forty
categories. We use binvox2 to generate the 2563 resolution voxel models.
The Neural Style Transfer style loss is adapted to our standardized gram matrix
approach [6]. This serves two purposes: First, we achieve an automatic weighting of
each style loss contribution by the selected convolutional layers. In case of 3D voxel-
based style transfer, the per convolutional layer style loss contributions S tend to diverge
more in comparison to the classic 2D case. Hence, our standardized approach provides
an advantage in terms of automatic inherent weighting. Second, we showed the positive
effect of the standardized gram matrix approach regarding the conformity of shape
distances derived from the style loss in comparison to subjective distance evaluations [6].
Typically, we use the convolutional layers 0–4 (Table 1) as activation map sources. We
achieve feasible results with style loss weight values between [1, 100] with our pipeline.
The content loss is based on Gatys’ original loss, adapted to cope with a third dimension.
We typically use layer 4 of our BELU Network as activation map source C for the content
loss with weight values between [1, 50]. In general, our pipeline is applicable on arbitrary
voxel shapes with a resolution of 2563 , e.g. taken from our voxelized ModelNet40 dataset
(Fig. 2). In addition to technical objects (cars), we chose the Stanford bunny as a source
for a more organic voxel shape. At the current development stage of the voxel-based
Neural Style Transfer pipeline, we rely on clear style target voxel shapes (Fig. 3) to be
able to unambiguously judge the success of style transfer.

2 http://www.patrickmin.com/binvox.
338 T. Friedrich et al.

In contrast to classic 2D Neural Style Transfer research, we do not solely rely on a


subjective evaluation of the stylized shapes. Based on voxel shape descriptors, similar
to descriptors used for 3D mesh representations [20], we evaluate our style transfer
results quantitatively. We apply a neighborhood bin grid to every surface voxel and
obtain the normalized histograms of the filling rates of the descriptor [7]. We apply
the distance measure earth-mover-distance [21] to generate a distance matrix of a set
of shapes for an evaluation of our subjective impressions. Our evaluations for style
reconstruction demonstrated that the quantification of the descriptor aligned well with
the visual impressions [9].

Fig. 2. Voxelized content targets: “bunny” (http://graphics.stanford.edu/data/3Dscanrep/#


bunny), and “car” voxelized from from ModelNet40 [8].

Table 2. 3D voxel classification network performances

Classifier ModelNet40 accuracy Voxel data


VoxNet [18] 83% 323
ST_Enc_Pool_big [6] 86% 643 , aligned
Net_belu [7] 88% 2563 , aligned

4 Regularization

Based on the pipeline described in the previous section, the content targets (Fig. 2) and
the cube style target (Fig. 3), we performed plain 3D Neural Style Transfer. We chose a
reasonable configuration for layer selection, weights and optimization parameters based
on our prior experience and manual tuning. Thereby, we noticed four undesired visual
artefacts on and around the generated voxel shapes (Fig. 4):

1. Features spread across the complete available voxel space instead of being located
roughly on the content shape’s surface.
2. Mass artefacts occurred dominantly at the border faces and edges probably caused
by the filter padding of the convolutional layers of the underlying neural network.
3. Structures grew out of the original shape, changing the global shape characteristics.
4. Noise appeared on the shape and in the empty area.
Voxel-Based Three-Dimensional Neural Style Transfer 339

From a computational standpoint, the artefacts contribute to the minimization of


the style loss value, as they represent cubic features like e.g. flat surfaces. Hence, these
structures cannot be avoided by variation of the parameters available in the framework. As
additional countermeasures, we propose a distance regularization, a border regularization
and a symmetry regularization described in the following sections. The necessity for
additional regularization terms is caused by the different nature of 2D pixel and 3D
voxel data in the current state of development where the latter has to deal with lots of
“empty” area which is rarely the case for pixel images where usually the whole canvas
is filled with content.

Fig. 3. Voxel style targets: “starbox”, “cube” and “sphere”. The objects are depicted in the correct
scale to each other and to Fig. 2. The car style target uses slightly smaller style target versions due
to its smaller mass (not shown).

4.1 Distance Regularization

To avoid material artefacts outside the vicinity of the content voxel shape p, we propose
a three-dimensional “distance matrix”-based regularization term (Algorithm 1). The
calculation of the distance of each voxel to the surface of the voxel shape is based on a
KDTree with a Euclidean distance metric for efficient performance.

Figure 4 depicts a single slice of the distance matrix. The neutral area around the
content shape p is encoded in dark blue where the creation or removal of mass is not
punished. The creation of material outside the neutral area generates a punishment term
proportional to the distance to the surface of the content shape, hence the optimization
is still able to form additional geometry if the style loss benefits exceed the punishment.
340 T. Friedrich et al.

We typically chose padding values between [8, 20] voxels for the distance matrix gen-
eration. The distance regularization term (Algorithm 2) is added to Ltotal . A distance
regularization weight between [100, 10000] has proven to generate reasonable results in
combination with our proposed pipeline.

4.2 Border Regularization

Occasionally, we observed artefacts at the border regions of the voxel space x even
though the distance regularization was active. This occurred in regions where the content
shape p extends close to the border area. In these areas, the voxel-based Neural Style
Transfer optimization tends to grow material from the shape towards the border which
then spreads along the border (Fig. 4 top of the bunny’s ears). Hence, we additionally
apply a border regularization (Algorithm 3) to prevent any artefacts at the border region
with a regularization weight of 10.

4.3 Symmetry Regularization

In case of content targets p with a strong symmetry axis like the car shape (Fig. 2),
we activate an additional symmetry loss. The enforced symmetry increases the positive
subjective evaluation of the generated style transfer results due to the strong expectations
of the viewer towards a proper car shape. This regularization has aesthetic reasons only
in contrast to the distance and border regularization which prevent methodology-based
artefacts. The loss is based on the mean squared error between the left and mirrored right
voxel data of the to be generated shape x.
Voxel-Based Three-Dimensional Neural Style Transfer 341

5 Experiments
In this section, we provide details on the performance of our proposed voxel-based
three-dimensional Neural Style Transfer pipeline with two target configurations. First,
we present the setup used to generate the transfer results. We chose the Adam optimizer
for our experiments with a learning rate of 0.01. We use layer 4 as basis for the content
loss and the layers 0–4 as sources for the style loss. The optimization has been carried
out with 500 steps per epoch and 40 epochs per shape to generate a single shape in
approx. 2–5 h. The to-be-optimized voxel shape is initialized with a copy of the content
target shape.

Fig. 4. Left: Exemplary distance matrix slice for the automotive content target. The color en-codes
the distance of each voxel position to the neutral zone around the content shape. The margin was
set to 12 voxels. Right: Example of unregularized result with artefacts (left) com-pared to distance
and border regularized style transfer result (right). The bunny was stylized with the cube style
target. More regularized results are shown in Fig. 5.

The configurations specific to the chosen content target are provided in Table 3. The
differences arise as consequence of the different sizes of the models and the organic
shape of the bunny versus the technical shape of the car. All used shapes are depicted
in Fig. 2 and Fig. 3 respectively. In our experiments, the style targets were scaled to
match the mass of the corresponding content target. Figure 3 shows the style targets in
the correct scale to the bunny content target. The style targets for the car content target
were slightly smaller and are not depicted.

Table 3. Content shape specific pipeline configuration

Configuration value Bunny Car


Content loss weight 4 40
Style loss weight 10 60
Distance reg. padding 12 8
Distance reg. weight 100 1e5
Border reg. weight 10 10
Symmetry reg. weight 0 1e7
342 T. Friedrich et al.

5.1 3D Voxel-Based Neural Style Transfer Evaluation

In this section, we first discuss the voxel-based Neural Style Transfer results for each
content target (Fig. 2) individually and then elaborate on differences between them. In
case of the bunny shape, the starbox and cube style target (Fig. 3) generate as expected
similar features at first glance because of the resemblance of the two targets (Fig. 5). But,
especially in the rear view, we observe a clear difference. The starbox target generated
clean concave 90-degree structures. The cube result lacks these mostly. Furthermore,
the cube result shows significantly less sharp convex 90-degree edges as well as close
to no sharp corners while the starbox result contains them in several areas. In terms of
smooth flat surfaces, the cube result is clearly superior. The sphere result has lost details
compared to the original bunny voxel shape in a sense that it shows a more rounded
shape, e.g. the snout and the eye. This is in conformity with prior expectations regarding
this style target. The rounded bunny shows the most surface noise but in general has
a similar surface structure as the style target sphere. All three results contain small,
disconnected artefacts in their vicinity.

Fig. 5. Style Transfer results: Bunny + starbox (left column), bunny + cube (middle column),
bunny + sphere (right column)

The technical car shape shows the difficulty to stylize shapes with strong semantic
meaning (Fig. 6). In case of transferred artistic style features, we observe the same
behavior as with the bunny content target. Compared to the bunny, the car shape is mostly
convex and smooth except the wheel and rear mirror area. Especially in these areas, the
optimization exploits the geometry to generate style features like big flat surfaces in case
of the cube target. In contrast, the car combined with the sphere style target suffers less
from these effects and remains with recognizable wheels. This is explainable with the
Voxel-Based Three-Dimensional Neural Style Transfer 343

smaller shape distance of the car model to the sphere compared to the cubic style targets
as confirmed in Sect. 5.2. The optimization starts already closer to the style target and
hence has to change less geometry to minimize its loss.

Fig. 6. Style Transfer results: Car + starbox (left column), car + cube (middle column), car +
sphere (right column)

The bunny in general is a more gratifying content target with its organic shape
compared to the car. The prior expectations towards a blocky styled bunny are less
constrained compared to a stylization of an automotive shape. The distance regularization
can be more relaxed. Furthermore, “growing” structures are perceived less bothering
compared to structures added to a car body that is expected to be streamlined. The
comparison between the starbox and cube style target results emphasize the need for
proper style targets. Both targets would be described as cubic or blocky by a human but
produce recognizable differences. This implies that the setup of a stylization optimization
is more difficult compared to 2D Neural Style Transfer, but also on the other hand
provides the opportunity to tune the style transfer optimization with custom tailored
style targets. E.g., cubic style targets with features in different scales and occurrence
relations lead to results with either high frequency features distributed over the content
shape or a global scale geometry alteration of the content target.

5.2 Quantitative 3D Neural Style Transfer Evaluation


Following the visual inspections, we applied our “voxel shape descriptor”-based distance
metrics [7] on the stylized results (Fig. 5). The distance matrix based on the histograms
is scaled with its maximum value to the range [0, 1] (Fig. 7). A 1 (dark blue) indicates
a pair of shapes which are the most different while a 0 (light blue) indicates a pair of
shapes with identical surface features. It is important to note that the difference value of
0 does not necessarily imply two identical geometries.
The distance matrix with the earth-mover-distance [21] is able to correctly asso-
ciate the similar results from the cube and starbox to their style targets even though the
numeric difference is small. But this is also conforming the visual evaluation. Worth
noting, results stylized with the same style target show small distance values, e.g.
344 T. Friedrich et al.

starbox_bunny/starbox_car, even though their global shape differs significantly. This


indicates that the metric performs as intended.
In the following we discuss the result bunny_starbox in detail (Fig. 7, marked in
red). First, bunny_starbox is closer to car (0.16) compared to bunny (0.32). This is
reasonable because the descriptor measures the distance of surface features and not the
global geometry. If we evaluate the distances of bunny and car to starbox, we observe
a smaller distance of car (0.6) to starbox than bunny to starbox (0.85), hence the small
distance between bunny_starbox to car is reasonable. Considering the car shape is also
close to the bunny shape due to their share of roundish features, it is plausible the
distance bunny/car (0.31) decreased as cubic features were added to bunny resulting
in bunny_starbox (0.16 to car). Second, bunny_starbox is closer to starbox (0.65) and
cube (0.68) as the bunny alone is to starbox (0.89) and cube (0.92) which confirms a
successful stylization. Third, the distance to sphere (0.45) is smaller than the distances
to cube and starbox which seems counterintuitive at first glance. However, the distance
values always have to be interpreted in relation. Bunny’s distance to sphere (0.14) is
lower than the distance of bunny_starbox to sphere (0.45) hence the value is plausible.
Fourth, bunny_starbox and car_starbox show a very small distance (0.03) that implies
both models share the same visual shape feature statistics on their surface which is the
primary objective of our voxel-based Neural Style Transfer method. In summary, we
confirm that the quantitative evaluation described in this section correlates well with
the subjective (visual) evaluation in the previous section. Hence, we conclude that our
three-dimensional voxel-based Neural Style Transfer successfully transfers artistic visual
features from one shape to another.

Fig. 7. Distance matrices for the shapes based on the neighborhood descriptor. The best distance
matrix is given by the earth-mover-distance in combination with the neighborhood descriptor.
Low values (light color) indicate more similar surface style features which is independent of the
global geometry. (Color figure online)
Voxel-Based Three-Dimensional Neural Style Transfer 345

6 Conclusion
In our research, we envision a cooperative design tool which allows users to utilize com-
putational creativity in an interactive fashion. As part of this system, three-dimensional
voxel-based Neural Style Transfer promises to be a valuable extension to the designer’s
and engineer’s toolbox for shape and structure generation and alteration. The method is
capable of producing novel and inspirational design solutions in an automatic fashion.
In this work, we propose a complete voxel-based stylization pipeline and realized a
fully automated process of voxel-based 3D Neural Style Transfer for the first time on
two sets of content and style targets. We further improved the style transfer pipeline by
introducing regularization terms specialized for the voxel representation and the differ-
ences compared to the pixel format used in classic 2D Neural Style Transfer. Distance-
and border-regularization avoid artefacts induced by the methodology, especially at the
empty and edge regions. Symmetry regularization follows aesthetic reasons and is con-
tent target specific. In our experiments, we showed the successful transfer of artistic style
features and discussed remaining challenges mostly caused by the different properties of
art between 2D pixels and 3D voxels. It became clear that a proper selection of content
and style shapes is crucial for generating result shapes which are conform to human
expectations. Here, structures with semantic meaning are especially challenging. Our
“voxel shape descriptor”-based distance metric, previously applied on style reconstruc-
tion, proved to be compatible with style transfer and confirmed the subjective evaluation
of the generated shapes.
In summary, we conclude that our approach successfully realized voxel-based three-
dimensional Neural Style Transfer and provides a meaningful method for generating
stylized designs.

References
1. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural net-
works. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2414–2423 (2016)
2. Friedrich, T., Schmitt, S., Menzel, S.: Rapid creation of vehicle line-ups by eigenspace pro-
jections for style transfer. In: Proceedings of the Design Society: DESIGN Conference,
pp. 867–876 (2020)
3. Jing, Y., Yang, Y., Feng, Z., Ye, J., Song, M.: Neural style transfer: a review, CoRR, vol.
abs/1705.04058 (2017)
4. Holden, D., Habibie, I., Kusajima, I., Komura, T.: Fast neural style transfer for motion data.
IEEE Comput. Graphics Appl. 37(4), 42–49 (2017)
5. Friedrich, T., Aulig, N., Menzel, S.: On the potential and challenges of neural style transfer for
three-dimensional shape data. In: Rodrigues, H.C., et al. (eds.) EngOpt 2018, pp. 581–592.
Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97773-7_52
6. Friedrich, T., Menzel, S.: Standardization of gram matrix for improved 3D neural style transfer.
In: Proceedings of IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1375–
1382 (2019)
7. Friedrich, T., Wollstadt, P., Menzel, S.: The effects of non-linear operators in voxel-based
deep neural networks for 3D style reconstruction. In: Proceedings of 2020 IEEE Symposium
Series on Computational Intelligence (SSCI) ,pp. 1460–1468. IEEE (2020)
346 T. Friedrich et al.

8. Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes, In: Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.
07–12-June-2015, pp. 1912–1920 (2015)
9. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition. In: Proceedings of 3rd International Conference on Learning Representations (ICLR)
(2015)
10. Huang, E., Gupta, S.: Style is a Distribution of Features. arXiv:2007.13010 (2020)
11. Ma, C., Huang, H., Sheffer, A., Kalogerakis, E., Wang, R.: Analogy- driven 3d style transfer.
Computer Graphics Forum, vol. 33, no. 2. Wiley Online Library, pp. 175–184 (2014)
12. Ren, Y., Zheng, H.: The Spire of AI - Voxel-based 3D neural style transfer. In: Proceedings
of the 25th International Conference on Computer-Aided Architectural Design Research in
Asia (CAADRIA) (2020)
13. Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3907–3916
(2018)
14. Sýkora, D., Jamriška, O., Lu, J., Shechtman, E.: StyleBlit: fast example-based stylization with
local guidance. Comput. Graph. Forum 38(2), 83–91 (2018)
15. Kim, B., Azevedo, V.C., Gross, M., Solenthaler, B.: Transport-based neural style transfer for
smoke simulations. ACM Trans. Graph. 38(6) (2019)
16. Qi, C.R., Su, H., Niebner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs
for object classification on 3D data, In: Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 5648–5656 (2016)
17. Understanding Formal Analysis. http://www.getty.edu/education/teachers/building_lessons/
formal_analysis.htm. Accessed on 01 July 2020
18. Maturana, D., Scherer, S.: VoxNet: A 3D Convolutional Neural Network for real-time object
recognition, In: Proceedings of IEEE International Conference on Intelligent Robots and
Systems, vol. 2015-Dec, pp. 922–928 (2015)
19. Eidnes, L., Nøkland, A.: Shifting Mean Activation Towards Zero with Bipolar Activation
Functions. arXiv:1709.04054 (2017)
20. Lara López, G., Peña Pérez Negrón, A., De Antonio Jiménez, A., Ramírez Rodríguez, J.,
Imbert Paredes, R.: Comparative analysis of shape descriptors for 3D objects. Multimedia
Tools Appl. 76(5), 6993–7040 (2017)
21. Rubner, Y., Tomasi, C., Guibas, L. J.: A metric for distributions with applications to image
databases. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 59–
66 (1998)
Rendering Scenes for Simulating Adverse
Weather Conditions

Prithwish Sen(B) , Anindita Das , and Nilkanta Sahu

Indian Institute of Information Technology Guwahati, Guwahati, India


[email protected]

Abstract. Most of the object detection schemes do not perform well


when the input image is captured in adverse weather. Reason being that
the available datasets for training/testing of those schemes didn’t have
many images in such weather conditions. Thus in this work, a novel app-
roach to render foggy and rainy datasets is proposed. The rain is generated
via estimation of the area of the scene image and then computing streak
volume and finally overlapping the streaks with the scene image. As visi-
bility reduces with depth due to fog, rendering of fog must take depth-map
into consideration. In the proposed scheme, the depth map is generated
from a single image. Then, the fog coefficient is generated by modifying the
3D Perlin noise with respect to the depth map. Further, blending the cor-
responding density of the fog with the scene image at a particular region
based on precomputed intensities at that region. Demo dataset is available
in this https://github.com/senprithwish1994/DatasetAdverse.

Keywords: Depth map · Perlin noise · Photo-realistic fog rendering ·


Rain rendering

1 Introduction

Physically creating data for outdoor scene in adverse weather condition is costly
and cumbersome task. Synthetic data creation has been tried by many for dif-
ferent computer vision applications. Realistic synthesis of the foggy/rainy scene
has become an important aspect in the field of game development, virtual real-
ity simulation of autonomous vehicles and object detection in adverse weather
conditions. A balanced dataset is very crucial for creation of such type of appli-
cations, but most of the datasets available are either for normal scene conditions
or contains very less number of images in adverse condition.
Degradation of the visibility can be seen with increase in the fog content in
a scene image [16,20]. This creates problems for computer vision algorithms to
interpret the scene. With an aim to create dehaze image researchers continued
to develop new techniques [12,17,19] to address this problem. The need for foggy
and rainy dataset had been observed with the development of deep learning based
techniques. In this paper we proposed a scheme for generation of rainy and foggy
scene data. A simple but efficient algorithm for rain rendering is introduced. For
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 347–358, 2021.
https://doi.org/10.1007/978-3-030-85030-2_29
348 P. Sen et al.

fog generation we automated the generation of transmission map with deep learn-
ing instead of using manual transmission map estimation. To make fog rendering
more realistic, depth information is used. Finding depth from stero image is easy
but as most of the existing data set for object detection and other applications are
not stereo, a single image dept map estimation algorithm is used.

2 Related Work

Few notable works had been done in the area of rain rendering. In the year 2006
[8], Garg and Nayar built a raindrop oscillation model that developed photoreal-
istic rendering of rain streaks and then blended the whole into a single image. In
[21], authors introduced a number of new effects such as the image-space rainfall
rendering, GPU-based water simulation, etc. to model a system for interactive
rendering of rain effects in real-time complex environments. Wang [6] introduced
a real-time solution to render a realistic rainy scene based on the physical char-
acteristics. They modeled shapes, movements and intensity of raindrops knowing
the physical characteristics of raindrops.
For rendering different situations they presented a multi-particles scattering
method knowing the particle distribution coherence in every view ray. In 2013,
Creus et al. [7] introduced a rain rendering method with complex illumination
effects, along with fog, halos and light glows as hints of the participating media.
Recently in the year 2018, [4] rendered drops of rain by means of Continuous
Nearest Neighbor search. The resulting images can be taken for training datasets
in machine learning approaches instead of taking real captured pictures. To
enhance the realism of rain, Halder et al. [11] presents a physical rain rendering
scheme where rendering relies on a particle distribution, estimation of the scene
light intensity and precise rain photo-metric model. In the real world scenario, it
is often difficult to collect or render rainy images that could represent each and
every rainy conditions that possibly could occur in this real world. Therefore,
Zhai et al. [26] proposed blending of two completely separate studies, i.e. rainy
image rendering and adversarial attack. A factor-aware generation of rain that
produces rain steaks in accordance with the camera exposure process and builds
a learning rain factor due to which adversarial attack is introduced. After this the
adversarial rain attack is used for image classification and object detection also.
In 2021, [22] a new rain rendering system is proposed where three contrasting
methods i.e. physic-based, data-driven, and a mixture of both were used to create
synthetic data.
On the other hand, to create foggy effects many researchers tried with differ-
ent foggy situations as homogeneous and heterogeneous. Guo et al. [10] proposed
a new synthesis method based on transmission map estimation. They produced
some noise in the image having a density distribution texture of heterogeneous
fog. After that the transmission map is measured through Markov random field
(MRF) technique and bilateral filter. In 2016, Kaiming et al. [13] rendered a
residual learning framework for training that are adequately deep. They again
formulated the learning residual functions according to the layer inputs avoiding
Rendering Scenes for Simulating Adverse Weather Conditions 349

the use of unresidual function. Another work made on rendering foggy scene
images, [27] where they proposed a framework as Foggy and Hazy Images Sim-
ulator (FoHIS). They also proposed an Authenticity Evaluator for Synthetic
foggy/hazy Images (AuthESI for short) to evaluate the efficiency with com-
pared to other works. To address the foggy scene at various levels of depth, in
the year 2018 [24], proposed a simple method to generate dense relative depth
annotations. A dataset of different images with their dense related depth map
is proposed. They used ranking loss to measure imbalanced ordinal relations
and to focus with a set of hard pairs. Again a contribution towards rendering
adverse weather conditions made in the year 2019 by Bernuth [5] focusing on
known labeled datasets by blending them along with adverse weather effects like
snow and fog. They reused the labeled data sets using augmentation in adverse
weather effects. The effects are resembled to the real existing images. Some other
researchers like Ranftl, René et al. [18] developed tools that mix multiple datasets
during training. They proposed a method that is independent of the changes
in depth range and scale. They used multi-objective learning to use different
datasets and helped in pretraining encoders on auxiliary tasks. At the beginning
of 2020 [9], the concept of Image-to-image translation to render foggy effects is
proposed. They used AIT to create foggy images where they learned with syn-
thetic clear images, synthetic foggy images and real clear images. They produced
output without seeing the real foggy images during training. Another group of
researchers Jamie et al. [23] proposed a disparity map from single images. They
used a scheme using flawed disparity maps to generate stereo training pairs.
Training using this system helps to convert a RGB image to stereo training
data. This mainly reduces the importance of collecting real depths or synthetic
data. Work like Kerim et al. [14] where they generated a real dataset for pedes-
trian tracking under adverse weather conditions and evaluated the performance
with the existing works.

3 Proposed Scheme

In this section, a statistical and deep learning approach to render the adverse
weather condition scene is proposed.

3.1 Rendering Rainy Scene

While synthesizing photo realistic rain streaks one must consider different light
intensity, angle of shower and how intensity varies with different types of rain
(example heavy, torrential, etc.). Raindrops undergo a quick change in their
shape and get distorted when they fall, also called oscillation. These oscilla-
tions produce complex patterns which are bright in nature by reflection and the
refraction of light. In this work, a new model is proposed for rain streak ren-
dering which reflects those complex interactions with the light falling and also
with the environmental parameters that are oscillating in nature as shown in
Fig. 1. The rain streaks are augmented via the generation of random straight
350 P. Sen et al.

lines with pre-calculated coordinate points of each rain streak. The streaks were
drawn with the draw function as in Eq. 1.

y = mx + c (1)

where given two points (x1, y1) and (x2, y2) with slope m of the line being user
defined and c is the y intercept as in Eq. 2.

c = y1 − m ∗ x1 (2)

Mostly, rainy days are blurry and intensity of light is low as compared to
sunny days. Thus in the proposed scheme using average filter the whole image is
convolved first to render the blurring effect. The average filter evaluates average
of all the pixels inside the given filter area and replaces the central component
with this average value of pixel. This filter was used just to smooth the image and
add visual effects of blurring, but any other blur filter can be used. Brightness
and contrast of the blurred image is adjusted with help of Eq. 3 as follows:

g(i, j) = α.f (i, j) + β (3)

where (i,j) indicates pixel location. Changing the β value will either add or
subtract a constant value to every pixel and α value will modify how the intensity
levels are spread. If α < 1, the color levels get compressed and the resultant will
be an image of less contrast. β, α are the brightness and contrast coefficient
respectively. Further, using HSL color models to scale down the pixel intensity
values for the lightness channel, which in turn made the pipeline ready to render
the rain streaks to this preprocessed image.

Fig. 1. Rain rendering pipeline

Now, assigning some arbitrary values for length of the streak and the number
of drops at a certain two-dimensional area of the image with different categories
of rain (e.g. drizzle, heavy or torrential). After obtaining the above parameters,
area of impact is found along with generation of rain by using drawing functions.
Drawing functions work with matrices/images of arbitrary depth. The streak
shape boundaries can be synthesized with anti-aliasing. Anti-aliased lines are
generally drawn using Gaussian filter. The streak generating function taken as
the reference image, coordinate of starting point, ending point coordinate w.r.t.
slant angle and rain drop length, rain color, and lastly drop width as input
parameters. After rendering the rain-streaks and blending the rain with the
original image, blended result is converted into RGB color space.
Rendering Scenes for Simulating Adverse Weather Conditions 351

3.2 Rendering Foggy Scenes


The framework (Fig. 2) is based on the atmospheric scattering model which can
render foggy effects at a given scene image. To render heterogeneous foggy effects,
perlin noise is generated and distributed uniformly to present more natural look-
ing scene results.

Fig. 2. Fog rendering pipeline

At first brightness of the image is adjusted to render some dull effect. Then
transmission map is estimated with help of depth map.

Depth Map Estimation. Transmission map is generated from depth map


instead of segmented image, so that the fog density varies with depth. Depth
map estimation can be performed with or without stereo images. The stereo
images can be generated using [23] from monocular images and then depth maps
can be estimated from the disparity. The method we have used is based on [18]
which is an improved version of [24].
Monocular Images are fed as input to the pretrained ResNet-based [24] multi-
scale architecture for single image depth prediction. This method regress per-
pixel relative depth using residual network. This network generates the depth
map of the given input image. Later improvement is proposed in [18], where
a new loss function was introduce to reduce the dataset dependency. The gen-
erated output is normalized to values ranging (0,1]. The inverse of the output
is computed. The computation of inverse is nothing but substracting each and
every pixel values from 1. The inverse depth map image is then converted to
gray.

Perlin Noise Generation and Distribution. Before uniform distribution


of fog we generate fog with Perlin noise using inverse depth map. Perlin noise
allows to create randomness in the generated texture i.e. smoothness all over the
space. Perlin noise is used for rendering coherent noise over a space. Hence, for
any two points in the coordinate space, this coherent noise signifies the score of
the noise function that changes smoothly as we seek from one point to the other.
Firstly, defining all the pixel points belonging to the image in the coordinate
space, the whole numbers at the corners and the real number at the grid. After we
have all the points at the respective position, we define random gradient vectors
352 P. Sen et al.

for all corner points and then defining distance vectors for the grid points as
shown in Fig. 3.

Fig. 3. Finding gradient vectors

Further, finding the dot product between these vectors we obtain the 2D
perlin noise. Hence, for generating foggy effects in a more smooth and realistic
nature 3D perlin noise is used. This 3D perlin noise [25] is computed from 2D
perlin noise with different frequencies and noise functions (also known as octaves)
given by the Eq. 4.


j−1
perlinN oise2Dj (x)
perlinN oise3D(x) = (4)
n=0
2n

Here, j denotes the number of dimensions. In this scheme, the maximum number
of dimensions taken into account is 3. Finally, generated noise is multiplied with
depth coefficient τ to render the smooth distribution of the fog.
The gray scale inverse depth map is used to compute the density of distribu-
tion onto the various depth of image (density of noise increases with increase in
the pixel intensity from 0 to 255). The perlin noise is added to different regions
with different density of the noise based upon the transmission map generated
from depth map. Thus the resultant image is the rendered photo-realistic foggy
image. Also the amount of fog in the scene image can be controlled by a multi-
plicative coefficient termed as fog Coefficient (Ω) as in Eq. 5. The transmission
Map Matrix is the gray scale of inverse depth map expressed in matrix form.

amountOf F og = Ω ∗ transmissionM apM atrix (5)

4 Experimental Results

The whole experiment is carried out with NVIDIA’s 1050 Ti GPU engine.
Datasets used here are ours, IDD [1] dataset. All images are resized to 1200×800
Rendering Scenes for Simulating Adverse Weather Conditions 353

pixels resolution. IDD contains 10,000 images, with 34 classes collected from 182
drive sequences on Indian roads.
The proposed adverse weather rendering is subdivided into two pipelines
namely rain rendering and fog rendering. The rain rendering algorithm takes
the amount of rain and slant angle as input and generates rainy image as per
the input. Similarly, the fog rendering pipeline computes the amount of perlin
noise that has to be distributed over the image. The distribution is done based
upon the zonal intensities that are found out with the help of gray scale inverse
depth map. Hence, adding perlin noise to each corresponding zone of the original
given image such that the amount of noise distributed increases with the increase
in pixel values of the corresponding zone. Therefore, adding perlin noise to the
input image gives us the fog rendered image.
Inviting 20 users from different disciplines (arts, science, engineering and
management) to make opinion at the given results. This users were taken in the
ratio of 1:1 male to female so as to ensure reliability and visual accuracy. Based
upon these subjective opinion score, Mean Opinion Score (MOS) is evaluated
by taking average of these scores. The users were asked to rate on the scale of
1 to 5 (where 1 means poor output and 5 means the best output). The rating
was done on 100 image samples which were rendered with different methods
and our proposed method. These users were given images in pairs for every
method and our result, but they were not told which images are from which
scheme. Subjective comparisons are shown below for different weather rendering
schemes.

4.1 Rain Rendering


Rendered rainy scenes are shown in the Fig. 4 with different angles of fall. This
rain generation is applied on Ours, IDD and KITTI [28] dataset under normal
weather conditions.

Input Image Input Image Input Image Input Image

Rainy Image Rainy Image Rainy Image Rainy Image

Fig. 4. Some results of this approach

Comparisons with the proposed method, Halder’s [11], Tatarchuk’s [21] and
Creus’s [7] methods respectively made are given in Fig. 5. Results are shown with
respect to different slant angle.
354 P. Sen et al.

Input Image Halder’s [11] Slant=90 Slant=-45

Input Image Tatarchuk’s [21] Slant=90 Slant=-65

Input Image Creus’s [7] Slant=90 Slant=-75

Fig. 5. Comparison result of rain rendering

Fig. 6. User opinion plot for rain rendering

Based on the user inputs, MOS is found. It is found that MOS of 3.8, 2.4, 2.9
achieved for other methods w.r.t. 3.95, 4.0 and 3.7 MOS for proposed scheme.
The graphical representation for user ratings are plotted in the Fig. 6 and cor-
responding MOS is given Table 1.

Table 1. MOS for rain rendering

Halder’s [11] Tatarchuk’s [21] Creus’s [7]


MOS for different approaches 3.8 2.4 2.9
MOS for our result 3.95 4.0 3.7
Rendering Scenes for Simulating Adverse Weather Conditions 355

4.2 Fog Rendering

Different foggy scenes are generated with respect to different intensities of pixel
values of different segmented regions. Below in the Fig. 7 some of the results of
the fog simulated scenes are shown. Here, IDM is the Inverse Depth Map, GIDM
is the Gray Scale Inverse Depth Map and results with different fog coefficient
(Ω) 1 and 2. This results are rendered with the help of different values of fog
coefficient which is nothing but increasing or decreasing the intensity of the fog
via mono channel inverse depth map image array. The coefficient is multiplied
with each pixel values of the depth array which results in increase or decrease
in the values of the intensities. The increase in the values of the mono channel
inverse depth map intensifies the fog content in the scene image.

INPUT IDM GIDM Ω =1 Ω =2

INPUT IDM GIDM Ω =1 Ω =2

INPUT IDM GIDM Ω =1 Ω =2

Fig. 7. Some results of this approach

Results are compared with some existing work, which is tabulated below.
With the variation of Ω, different results are obtained for the given gray scale
inverse depth map which can be realized in the Fig. 8.
Rating from 1 to 5 defines the extent of photo realism of the proposed method,
Guo’s [10], I-Haze [2], O-Haze [3] and SOTs Outdoor [15] methods respectively.
Based on the user inputs, MOS is found. It is found that MOS of 3.4, 3.7, 2.15
and 2.9 achieved for other methods w.r.t. 4.35, 3.9, 4.3, 4.2 MOS for proposed
scheme. The graphical representation for user ratings are plotted in the Fig. 9
and corresponding MOS is given Table 2.
356 P. Sen et al.

Original Image Guo’s [10] Ω = 1.5

Original Image I-Haze [2] Ω = 1.5

Original Image O-Haze [3] Ω = 1.5

Original Image SOTs [15] Ω = 1.5

Fig. 8. Comparison result of fog rendering

Fig. 9. User opinion plot for fog rendering


Rendering Scenes for Simulating Adverse Weather Conditions 357

Table 2. MOS for fog rendering

Guo’s [10] I-Haze [2] O-Haze [3] SOTs [15]


MOS for different approaches 3.4 3.7 2.15 2.9
MOS for our result 4.35 3.9 4.3 4.2

5 Conclusion
In this paper we presented an adverse weather rendering system. We considered
two aspect of adverse weather one being rainy and other foggy. Proposed scheme
generates realistic rainy and foggy scenes. Experimental result showcase the effi-
ciency of the scheme. Comparative results show that the scheme outperforms
the existing schemes.

References
1. https://idd.insaan.iiit.ac.in/
2. Ancuti, C.O., Ancuti, C., Timofte, R., Vleeschouwer, C.D.: I-haze: a dehazing
benchmark with real hazy and haze-free indoor images. arXiv:1804.05091v1 (2018)
3. Ancuti, C.O., Ancuti, C., Timofte, R., Vleeschouwer, C.D.: O-haze: a dehazing
benchmark with real hazy and haze-free outdoor images. In: IEEE Conference on
Computer Vision and Pattern Recognition, NTIRE Workshop. NTIRE CVPR 2018
(2018)
4. von Bernuth, A., Volk, G., Bringmann, O.: Rendering physically correct raindrops
on windshields for robustness verification of camera-based object recognition. In:
2018 IEEE Intelligent Vehicles Symposium (IV), pp. 922–927. IEEE (2018)
5. von Bernuth, A., Volk, G., Bringmann, O.: Simulating photo-realistic snow and
fog on existing images for enhanced CNN training and evaluation. In: 2019 IEEE
Intelligent Transportation Systems Conference (ITSC), pp. 41–46. IEEE (2019)
6. Changbo, W., Wang, Z., Zhang, X., Huang, L., Yang, Z., Peng, Q.: Real-time
modeling and rendering of raining scenes. Vis. Comput. 24(7–9), 605–616 (2008)
7. Creus, C., Patow, G.A.: R4: realistic rain rendering in realtime. Comput. Graph.
37(1–2), 33–40 (2013)
8. Garg, K., Nayar, S.K.: Photorealistic rendering of rain streaks. ACM Trans. Graph.
(TOG) 25(3), 996–1002 (2006)
9. Gong, R., Dai, D., Chen, Y., Li, W., Van Gool, L.: Analogical image translation
for fog generation. arXiv preprint arXiv:2006.15618 (2020)
10. Guo, F., Tang, J., Xiao, X.: Foggy scene rendering based on transmission map
estimation. Int. J. Comput. Games Technol. 2014 (2014)
11. Halder, S.S., Lalonde, J.F., Charette, R.D.: Physics-based rendering for improv-
ing robustness to rain. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 10203–10212 (2019)
12. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior.
IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2010)
13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
358 P. Sen et al.

14. Kerim, A., Celikcan, U., Erdem, E., Erdem, A.: Using synthetic data for person
tracking under adverse weather conditions. Image Vision Comput. 104187 (2021)
15. Li, B., et al.: Benchmarking single-image dehazing and beyond. IEEE Trans. Image
Process. 28(1), 492–505 (2019). https://doi.org/10.1109/TIP.2018.2867951
16. Narasimhan, S.G., Nayar, S.K.: Contrast restoration of weather degraded images.
IEEE Trans. Pattern Anal. Mach. Intell. 25(6), 713–724 (2003)
17. Nishino, K., Kratz, L., Lombardi, S.: Bayesian defogging. Int. J. Comput. Vision
98(3), 263–278 (2012)
18. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust
monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.
arXiv preprint arXiv:1907.01341 (2019)
19. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y.,
Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33715-4 54
20. Tan, R.T.: Visibility in bad weather from a single image. In: 2008 IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
21. Tatarchuk, N.: Artist-directable real-time rain rendering in city environments. In:
ACM SIGGRAPH 2006 Courses, pp. 23–64 (2006)
22. Tremblay, M., Halder, S.S., de Charette, R., Lalonde, J.F.: Rain rendering for eval-
uating and improving robustness to bad weather. Int. J. Comput. Vision 129(2),
341–360 (2021)
23. Watson, J., Aodha, O.M., Turmukhambetov, D., Brostow, G.J., Firman, M.: Learn-
ing stereo from single images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M.
(eds.) ECCV 2020. LNCS, vol. 12346, pp. 722–740. Springer, Cham (2020). https://
doi.org/10.1007/978-3-030-58452-8 42
24. Xian, K., et al.: Monocular relative depth perception with web stereo data super-
vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 311–320 (2018)
25. Zdrojewska, D.: Real time rendering of heterogeneous fog based on the graphics
hardware acceleration. Proc. CESCG 4, 95–101 (2004)
26. Zhai, L., et al.: It’s raining cats or dogs? Adversarial rain attack on DNN percep-
tion. arXiv preprint arXiv:2009.09205 (2020)
27. Zhang, N., Zhang, L., Cheng, Z.: Towards simulating foggy and hazy images and
evaluating their authenticity. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.S.
(eds.) International Conference on Neural Information Processing, pp. 405–415.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70090-8 42
28. http://www.cvlibs.net/datasets/kitti/
Automatic Fall Detection Using Long
Short-Term Memory Network

Carlos Magalhães , João Ribeiro , Argentina Leite ,


E. J. Solteiro Pires(B) , and João Pavão

Escola de Ciências e Tecnologia, Universidade de Trás-os-Montes


e Alto Douro and INESC TEC, Vila Real, Portugal
{al66417,al66645}@utad.eu, {tinucha,epires,jpavao}@utad.pt

Abstract. Falls, especially in the elderly, are one of the main factors
of hospitalization. Time-consuming intervention can be fatal or cause
irreversible damages to the victims. On the other hand, there is currently
a significant amount of smart clothing equipped with various sensors,
particularly gyroscopes and accelerometers, which can be used to detect
accidents. The creation of a tool that automatically detects eventual
falls allows helping the victims as soon as possible. This works focuses in
the automatic fall detection from sensors signals using long short-term
memory networks. To train and test this approach, the Sisfall dataset is
used, which considers the simulation of 23 adults and 15 older people.
These simulations are based on everyday activities and the falls that
may result from their execution. The results indicate that the procedure
provides an accuracy score of 97.1% on the test set.

Keywords: Fall detection · Long short-term memory · Sisfall dataset

1 Introduction
The number of elder people living alone has been increasing, and consequently,
the number of falls. As the age factor increases, both the number of falls and the
severity of falls also increase. Therefore, measures have been taken to ensure that
medical care after the fall is less time-consuming and, consequently, to make the
severity of injuries less and that the population independence is a less risk factor.
Currently, there are several studies in the area of fall detection and simulation
[4]. However, there is room for improvement, particularly in the performance of
simulations with individuals of the target age group, usually being carried out
by younger individuals.
The work to be developed focuses on the automatic detection of falls in
the elderly. For this purpose, the Sisfall dataset [9] was considered where, the
participants use equipment with a gyroscope and two accelerometers. The data
was collected by researchers who provided the dataset, and the video support
associated with each simulation.
Usually, the devices used to obtain data in the simulation of falls are the
gyroscope and the accelerometer, and they are placed on the body of the target
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 359–371, 2021.
https://doi.org/10.1007/978-3-030-85030-2_30
360 C. Magalhães et al.

individual. However, in some works found in the literature, other devices such
as Smartwatches and Smartphones perform similar detection [1]. The ease with
which the individual can use it, as well as its remarkable portability, are essential
factors in the choice of the device to be implemented for the detection. However,
most researchers prefer to develop their own devices due to the low production
price and higher versatility. It is also important to note that there are detec-
tion devices with a higher degree of complexity, such as cameras and sensors.
However, they are currently not widely used due to several disadvantages than
devices of another type, such as high cost, high degree of complexity, and lack of
portability. However, it is possible to admit that the last ones referred, together
with other devices such as the gyroscope and the accelerometer, could produce
results with even higher precision.
The devices used have a high degree of accuracy. However, problems are
encountered in some cases: the system has falsely detected a fall or a fall has
occurred and was not detected by the system. All of these problems will not
matter if the target individual simply, forgets, or for any other reason, is not
equipped with the device.
In 2018, Tamarasco et al. [10] used a dataset consisting of 208 simula-
tions, of which 96 were records of falls and 112 of daily activities. In a total
of four volunteers, two males and two females, aged between 25 and 37 years. To
solve this problem, they use three types of networks: Long Short-Term Memory
(LSTM), Gated Recurrent Unit (GRU), and bidimensional LSTM (Bi-LSTM).
The method that obtained the best results was the Bi-LSTM, with an accuracy
of 93.0%, compared to 91.0% for LSTM and 87.5% for GRU. Unlike the process
performed to build the Sisfall dataset, the data records obtained in the simula-
tions were recorded by sensors implemented in the test room and not in each
individual. The presence of objects, which did not allow a global and total view
of the room, as well as different temperature conditions, were some of the difficul-
ties encountered. Consequently, the researchers showed interest in implementing
a Bi-LSTM network, but with the data recorded from mobile devices.
In 2018, Sucerquia et al. [8] used the Sisfall dataset. In addition to the test
performed on the dataset, the algorithm was also tested on three individuals
over 60 years old, two females and one male. The method used a non-linear
classification feature and a Kalman filter with a periodicity detector to avoid
false positives. The work uses the Kalman filter was used to improve the feature
extraction. The classification consists of evaluating a single threshold feature.
When considering the Sisfall dataset, the results obtained an accuracy of 99.4% in
the test dataset, respectively. Despite the high rate of precision in the individual
tests, the researchers said that it would not be viable or sufficient to use only an
accelerometer in the context of automatic detection of falls in the daily lives of
the elderly population. The existence of false positives was also reported when
the test group’s elements performed D06 and D10 activities from the Sisfall
dataset (Table 3).
In 2019, Luna-Perejón et al. [4] used the Sisfall dataset. In this work, they
used two types of Recurrent Neural Networks: LSTM and GRU. For each kind
of network, two architectures were implemented with one and two layers each.
Automatic Fall Detection Using Long Short-Term Memory Network 361

Table 1. Summary of the studies about wearable fall detection systems.

Ref. Sensors Dataset N. Users N. Records Feature Algorithms Accuracy (%)


extration
[10] Wearable-based, Work acquired 4 208 No LSTM 91.0
Camera-based, GRU 87.5
Ambience device Bi-LSTM 93.0
[8] Accelerometer, Sisfall 41 4510 Yes Threshold 99.4
Gyroscope approach
[4] Accelerometer, Sisfall 38 4510 No LSTM 96.3
Gyroscope GRU 96.7
[11] Accelerometer, Sisfall 38 4510 Yes XGB 99.4
Gyroscope KNN 98.5
SVM 98.3
DT 98.9
MLP 99.9
DAE 99.8
CAE 99.9
CNN 99.9
[6] Camera, UR FallDetection NS 70 No CNN 85.7
Accelerometer LSTM 95.7

The best results algorithm with the was the GRU considering two layers with an
accuracy, specificity, and a sensitivity of 96.7%, 96.8%, and 87.5%. One of the
researchers difficulties was the complexity of the algorithm used and the high
energy consumption associated with a high computational cost.
In 2019, Wang et al. [11] used the Sisfall dataset. The methods used were Sup-
port Vector Machines (SVM), Decision Trees (DT), K-nearest neighbor (KNN),
Extreme Gradient Boosting (XBG), Multi-layer perceptron (MLP), Convolu-
tional Neural Network (CNN), and AutoEncoders. The one that got better
results was CNN, with an accuracy rate of 99.94%. However, the high compu-
tational cost of the algorithm was reported. At the conclusion of this work, the
objective was to carry out further studies with different types of activities and
networks, as well as having different data collection devices such as Smartphones
and Smartwatches.
In 2019, Santos et al. [6] used a CNN and LSTM neural network model
applied on the UR Fall Detection [3]. They obtain an accuracy of 95.7% for the
LSTM approach and 85.7% for the CNN network. However, when using a data
augmentation technique, the results improve to 98.6% and 99.9% for the LSTM
and CNN networks.
Table 1 summarizes the most recent studies that are based on the use of
wearable fall detection systems. Column 1 indicates the reference of the work
described, column 2 enumerates the sensors used in the system, column 3 denote
the dataset used, columns 4 and 5 specify the number of voluntaries and activities
simulations (records), column 6 indicates if the work uses features previously
extracted, column 7 illustrates the algorithm used, and finally, column 8 the
accuracy reached.
In this work, the fall detection was improved using a LSTM network and the
Sisfall dataset. The deep learning approach considered only used two sensors:
362 C. Magalhães et al.

one accelerometer and one gyroscope, and the signals are used directly into the
LSTM network avoiding to extract previously the features.
The remaining of this paper is organized as follows. Section 2 describes
the dataset used and introduces a brief neural networks explanation. Section 3
explains the signals resizing and filtering process, the construction of the net-
work, and the classification used. Then, experimental results and discussion are
presented in Sect. 4. Finally, Sect. 5 draws the main conclusions.

2 Materials and Methods


This section describes the materials and methods used in this work. Thus,
Sect. 2.1 describes the volunteer’s information and the description and char-
acteristics of the simulations. Then in Sect. 2.2, the methods used are explained
and the criteria for preparing the data to enable their implementation in the
algorithms used.

2.1 Dataset
The dataset used to develop this work is Sisfall [9]. It has a set of data obtained
by two accelerometers and a gyroscope every 0.005 s and in a three-dimensional
form. Different simulations in daily activities and falls are recorded. The data
were obtained through 38 volunteers simulations which are divided into two
groups: young adults and elderly people. The elderly group, formed by eight
males and seven females, and the younger group by eleven males and twelve
females. The young group’s ages range from 19 to 30 years old and of the elderly
between 60 and 75 years old (Table 2). The researchers of the Sisfall project
decided to make fifteen simulations of different types of falls, trying to have a
considerable variation in the individual’s condition and the reason for the fall,
and 19 simulation types of daily activities program accuracy (Tables 3 and 4). All
records, both of daily activities and falls, were carried out five times, except for
simple locomotion (walking and running), which were recorded only once. Due
to medical recommendations, the elderly group, in addition to not performing
any fall simulations, did not perform certain day-to-day activities (D06, D13,
D18, and D19). However, an elderly Judo specialist performed both the fall
simulations and all daily activities. In total, 4510 records were considered as
independent activities to evaluate the classifier.
The dataset was recorded with embedded devices: two accelerometers and
one gyroscope fixed to the participants’ waist. Each device registers the data in
three-axis (x, y, z) at the rate sampling 200 Hz.

2.2 Neural Networks


Recurrent Neural Network
A Recursive Neuronal Network (RNN) [5] is a class of Artificial Neural Networks
where connections between several nodes are described, sequentially, in a tem-
poral graph, thus demonstrating the dynamic behavior of the network. Unlike
Automatic Fall Detection Using Long Short-Term Memory Network 363

Table 2. Age, height and weight of the groups (average ± standard deviation).

Group Gender Age Height (cm) Weight (kg)


Elderly Female 66.00 ± 4.11 156.00 ± 6.94 59.70 ± 7.36
Male 65.80 ± 3.19 166.50 ± 3.08 72.60 ± 13.45
Adult Female 23.30 ± 3.84 157.40 ± 5.45 50.60 ± 6.41
Male 23.50 ± 3.31 174.10 ± 5.76 69.20 ± 6.76

Table 3. Activities of Daily Living.

Code Activity Trials Duration


D01 Walking slowly 1 100 s
D02 Walking quickly 1 100 s
D03 Jogging slowly 1 100 s
D04 Jogging quickly 1 100 s
D05 Walking upstairs and downstairs slowly 5 25 s
D06 Walking upstairs and downstairs quickly 5 25 s
D07 Slowly sit in a half height chair, wait a moment, and up 5 12 s
slowly
D08 Quickly sit in a half height chair, wait a moment, and 5 12 s
up quickly
D09 Slowly sit in a low height chair, wait a moment, and up 5 12 s
slowly
D10 Quickly sit in a low height chair, wait a moment, and 5 12 s
up quickly
D11 Sitting a moment, trying to get up, and collapse into a 5 12 s
chair
D12 Sitting a moment, lying slowly, wait a moment, and sit 5 12 s
again
D13 Sitting a moment, lying quickly, wait a moment, and sit 5 12 s
again
D14 Being on one’s back change to lateral position, wait a 5 12 s
moment, and change to one’s back
D15 Standing, slowly bending at knees, and getting up 5 12 s
D16 Standing, slowly bending without bending knees, and 5 12 s
getting up
D17 Standing, get into a car, remain seated and get out of 5 25 s
the car
D18 Stumble while walking 5 12 s
D19 Gently jump without falling (trying to reach a high 5 12 s
object)
364 C. Magalhães et al.

Table 4. Falls.

Code Activity Trials Duration


F01 Fall forward while walking caused by a slip 5 15 s
F02 Fall backward while walking caused by a slip 5 15 s
F03 Lateral fall while walking caused by a slip 5 15 s
F04 Fall forward while walking caused by a trip 5 15 s
F05 Fall forward while jogging caused by a trip 5 15 s
F06 Vertical fall while walking caused by fainting 5 15 s
F07 Fall while walking, with use of hands in a table to 5 15 s
dampen fall, caused by fainting
F08 Fall forward when trying to get up 5 15 s
F09 Lateral fall when trying to get up 5 15 s
F10 Fall forward when trying to sit down 5 15 s
F11 Fall backward when trying to sit down 5 15 s
F12 Lateral fall when trying to sit down 5 15 s
F13 Fall forward while sitting, caused by fainting or falling 5 15 s
asleep
F14 Fall backward while sitting, caused by fainting or falling 5 15 s
asleep
F15 Lateral fall while sitting, caused by fainting or falling 5 15 s
asleep

Fig. 1. Units difference between RNN and LSTM.

Feedforward Neural Networks, RNNs can use their internal state (similar to
memory) to process various dimensions of a sequence of data entries. The term
recurrent is used to refer to two classes of networks with an identical structure
Finite and Infinite. The first is an infinite impulse network and can be replaced
with a network with only input power. The second describes a directed cyclic
graph and cannot be rolled out. Both types of networks may have additional
states, kept under the direct control of the same network. This same memory
Automatic Fall Detection Using Long Short-Term Memory Network 365

can also be replaced by another network or graph in case of loops or time delays.
This architecture is called Feedback Neural Network.
Long Short-Term Memory
One of the problems with recurrent neural networks is, in a long sequence, the
difficulty in transporting information from a particular previous stage to a later
and advanced stage. Therefore, it is possible to admit that this type of net-
work, for example, when processing a paragraph of text to make a prediction,
initial information could be lost. In an attempt to solve this problem, in 1997,
Hochreiter and Schmidhuber [2] proposed Long Short-term Memory Network
(LSTM). This network is an Artificial Recurrent Neuronal Network and used in
deep learning and data mining. In addition to recording several precision records
in several domains, this network is used by large companies such as Google or
Apple in key components in the new products [7]. Unlike Feedforward networks,
this network has feedback links. Besides, it can process images and data streams
(video, audio). Neural networks of the LSTM type are suitable for classification,
processing, and predicting results based on signals. This happens because there
may be losses or a lack of relevant information during specific periods. These
networks were developed to deal with problems related to the extreme variations
of gradient problems, which may occur when using RNN type networks.
As shown in Fig. 1, a common LSTM architecture may consist of a cell and
three gates: for entry, exit, and forgetting. These gates have the function of
controlling the information entering and leaving the cell. However, other possible
unit architectures may have more gates or gates of different types, consequently
with different functions. It is also possible to conclude that an LSTM architecture
unit differs in complexity, and LSTM-type networks can be trained, sequentially,
and supervised using an optimization algorithm. To find the best type of training,
there is a network, Connectionist Temporal Classification, which returns a score
according to the network weights matrix to maximize the probability of the
network recognizing and owning the correct sequence.

3 Classification Model

A deep learning technique is used for fall detection where the signals are exposed
directly to the network without the need to extract and select the features. The
proposed model was implemented in Python with Keras and is explained in the
sequel.

3.1 Pre-processing
To train the LSTM neuronal network, it is first essential to perform data pre-
processing, consisting of two steps: resizing and filtering. The data recorded by
the second accelerometer (MMA8451Q) is not be considered since it is redundant
face to the first.
366 C. Magalhães et al.

Resizing: To train the network is particularly useful to use signals of the same
length (15 s, i.e., 3000 samples). Therefore, smaller than 15 s signals are padded
with a batch sample of the signal, and those that are longer are truncated,
ignoring the remaining samples. So, the following changes were made:

– The walking and running simulations D01 to D04 were reduced from 100 to
15 s, considering the first signal samples.
– The simulations of going up and down stairs D05, D06, and D17 were also
reduced from 25 to 15 s.
– Simulations D07 to D19 were extended from 12 to 15 s. This extension was
obtained by duplicating the first three seconds of the signal.
– The simulations of the falls remained the same, with 15 s.

Filtering. After all, the samples have the same size, the data was filtered using
a fourth-order Butterworth low-pass filter with a cut-off frequency of 0.5 Hz.
Figure 2 shows the pre-processing stage for activity of daily living D10 (quickly
sit in a low height chair, wait a moment, and up quickly) and fall F08 (fall
forward when trying to get up). It is observed that in the fall there is a high
peak acceleration.

3.2 Network Architecture

The proposed approach for detecting falls using the accelerometer (ax , ay , az )
and gyroscope (gx , gy , gz ) signals is illustrated in Fig. 3. The details of the imple-
mented LSTM network is shown in Fig. 4. The network is sequential with one
LSTM layer, a drop-out layer (discharge rate of 0.5), a hidden layer with 100
neurons, and a layer with two outputs. To train the network, the number of
epochs used is equal to 15, which is the number of times the network processes
one entire group. The batch size of 64 is the number of training data samples
the network has to process before getting updated. Both layers use the ReLU
activation function. The metric for determining the efficiency of the network will
be the accuracy as a percentage.
Automatic Fall Detection Using Long Short-Term Memory Network 367

Fig. 2. Filter pre-processing of (x, y, z) axis signals for a D10 and F08 activity.
368 C. Magalhães et al.

Fig. 3. Proposed approach for detecting falls using Sisfall dataset.

3.3 Classification

The performance of LSTM networks is assessed by five-fold cross-validation. In


this method, the n = 4510 records of the data are split into k = 5 portions with
n/5 records. Then, in each k experiment, when a different data portion is set
aside for testing, the other four portions are used for training. This process is
then repeated k experiments, and the classifier’s performance is the mean of the
k runs. This function guarantees a variation in test and training accuracy, thus
making the results more reliable.

4 Results and Discussion

To analyze the network performance, the True Positives (TP), True Negatives
(TN), False Positives (FP), and False Negatives (FN) values are measure. After-
ward, Accuracy, Specificity, and Sensitivity are calculated using the following
formulas:
TP + TN
Accuracy = (1)
TP + TN + FP + FN
TN
Specif icity = (2)
TN + FP
Automatic Fall Detection Using Long Short-Term Memory Network 369

Fig. 4. The block diagram of the LSTM network architecture.

Table 5. Accuracy (%) of five-fold cross-validation.

Experiment Training set Testing set


1 98.4 98.0
2 97.6 96.0
3 98.0 96.5
4 98.0 96.6
5 98.6 98.4

TP
Sensitivity = (3)
TP + FN
After training and testing the network five times, always with different train
and test sets, the accuracy obtained from each experiment is presented in Table 5.
From the results, we can observe that experiment 5 got the best result with 98.6%
accuracy in the training set and 98.4% in the test set. However, we can agree that
all the experiments got high accuracy results, never below 96.0%. It should be
noted that we obtain this result without feature extraction or use any particular
technique, namely data augmentation or periodicity signal analysis, to improve
the result.
Table 6 summarizes the average and standard deviation of Accuracy, Speci-
ficity, and Sensitivity for training and testing sets with stratified five-fold cross-
validation. The proposed LSTM classifier achieves an overall classification accu-
370 C. Magalhães et al.

Table 6. Five-fold cross-validation performance (mean ± std).

Accuracy (%) Specificity (%) Sensitivity (%)


Training set 98.1 ± 0.4 98.6 ± 0.5 97.8 ± 0.5
Testing set 97.1 ± 1.0 98.1 ± 1.1 96.4 ± 1.0

racy of 98.1% in the training and 97.1% in the testing sets. Comparing the results
with other works that directly use the signals (without extract features) at the
network input, the proposed approach presents better results than those found
in the literature (see Table 1).

5 Conclusions and Future Work


This paper presents a fall detection methodology using a long short-term memory
based on the Sisfall data set, which considers 23 adults and 15 older people. The
experiments considered are based on everyday activities and the falls that may
result from their execution. An LSTM network was used to solve the problem
considering two devices: one accelerometer and one gyroscope. The results obtain
an accuracy of 97.1%, concluding that the system can automatically detect the
occurrence of falls, and by doing this, allowing quicker assistance after falls,
assuring a safer life for those who wear the devices.
The methodology proposed uses the sensors’ signals directly, processing them,
and classify the activity signals to identify possible falls. Therefore, it is not
required to extract the signal features to use them in the neural network. In this
case, the model learns by itself to identify falls directly from the activity signals.
As a future improvement to the system, a non-binary classification will be
considered, classifying not only the fall occurrence but also categorizing its type
(e.g., falling while walking or falling while standing still).

References
1. Hakim, A., Huq, M.S., Shanta, S., Ibrahim, B.: Smartphone based data min-
ing for fall detection: analysis and design. Procedia Comput. Sci. 105, 46–51
(2017). https://doi.org/10.1016/j.procs.2017.01.188. https://www.sciencedirect.
com/science/article/pii/S1877050917302065. IEEE International Symposium on
Robotics and Intelligent Sensors, IRIS 2016, 17–20 December 2016, Tokyo, Japan
(2016)
2. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),
1735–1780 (1997)
3. Kwolek, B., Kepski, M.: Human fall detection on embedded platform using depth
maps and wireless accelerometer. Comput. Methods Programs Biomed. 117(3),
489–501 (2014). https://doi.org/10.1016/j.cmpb.2014.09.005
4. Luna-Perejón, F., Domı́nguez-Morales, M., Civit-Balcells, A.: Wearable fall detec-
tor using recurrent neural networks. Sensors 19, 4885 (2019)
Automatic Fall Detection Using Long Short-Term Memory Network 371

5. Rassem, A., El-Beltagy, M., Saleh, M.: Cross-country skiing gears classification
using deep learning. CoRR abs/1706.08924 (2017). http://arxiv.org/abs/1706.
08924
6. Santos, G., Endo, P., Monteiro, K., Rocha, E., Silva, I., Lynn, T.: Accelerometer-
based human fall detection using convolutional neural networks. Sensors (Basel,
Switzerland) 19 (2019)
7. Smagulova, K., James, A.P.: A survey on LSTM memristive neural network archi-
tectures and applications. Eur. Phys. J. Spec. Top. 228(10), 2313–2324 (2019).
https://doi.org/10.1140/epjst/e2019-900046-x
8. Sucerquia, A., López, J.D., Vargas-Bonilla, J.F.: Real-life/real-time elderly fall
detection with a triaxial accelerometer. Sensors (Basel, Switzerland) 18 (2018).
https://doi.org/10.3390/s18041101
9. Sucerquia, A., López, J.D., Vargas-Bonilla, J.F.: SisFall: a fall and movement
dataset. Sensors 17(1) (2017). https://doi.org/10.3390/s17010198
10. Tamarasco, C., et al.: A novel monitoring system for fall detection in older people.
IEEE Access 6, 43563–43574 (2018)
11. Wang, G., Li, Q., Wang, L., Zhang, Y., Liu, Z.: Elderly fall detection with an
accelerometer using lightweight neural networks. Electronics 8, 1354 (11 2019).
https://doi.org/10.3390/electronics8111354
Deep Convolutional Neural Networks
with Residual Blocks for Wafer Map Defect
Pattern Recognition

Zemenu Endalamaw Amogne(B) , Fu-Kwun Wang, and Jia-Hong Chou

Department of Industrial Management, National Taiwan University of Science and Technology,


Taipei 10607, Taiwan

Abstract. Different deep convolution neural network (DCNN) models have been
proposed for wafer map pattern identification and classification tasks in previous
studies. However, factors such as the effect of input image resolution on the clas-
sification performance of the proposed models and class imbalance in the training
set after splitting the data into training and test sets have not been considered in
the previous studies. This study proposes a DCNN model with residual blocks,
called Opt-ResDCNN model, for wafer map defect pattern identification and clas-
sification by considering 26 * 26 input image resolutions and class imbalance
issues during the model training. The proposed model is compared with the pre-
viously published defect pattern recognition and classification models in terms of
accuracy, precision, recall, and F1 score for 26 * 26 input image size. Using a pub-
licly available wafer map dataset (WM-811K), the proposed method can obtain
an average accuracy, precision, recall, and F1 score results of 99.672%, 99.664%,
99.695%, 99.692%, respectively for the 26 * 26 input image resolution.

Keywords: Class imbalance · Deep convolutional neural network · Residual


blocks · Wafer map

1 Introduction

The current advancements in hardware technologies result in an increased demand for


electrical products. Due to the increasing demand for integrated circuit (IC) based elec-
tronic products worldwide, the quality, reliability, and capability of wafer production
will be a significant issue for the semiconductor industry. Even though high accuracy
automated equipment and sensors are installed and detailed analysis approaches are
utilized, we cannot completely solve defective wafer production [1].
Wafer map plays a vital role in the semiconductor manufacturing industry. Pro-
cess engineers realize and determine the defects during the manufacturing process after
checking the unique spatial curves or “signatures” by visual inspection tools [2]. Cre-
ating efficient and effective wafer analysis tools is very important for semiconductor
manufacturing companies to increase their competitiveness.

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 372–384, 2021.
https://doi.org/10.1007/978-3-030-85030-2_31
Deep Convolutional Neural Networks with Residual Blocks 373

In recent years, to increase automation engineering, sensors are installed to collect


data and analyze the data through various machine learning (ML) and deep learning algo-
rithms. The algorithms have been applied for identifying wafer map defect patterns. Even
though many studies researched wafer map defect pattern identification (WMDPI), most
of them focused on feature extraction of input data and didn’t focus on the recognition
and analysis of image-based wafer maps as input.
For image classification problems, a useful and powerful model structure called con-
volutional neural networks (CNNs) has proven successful in several studies. CNNs can
automatically learn and extract features from the input image data via convolution lay-
ers. As the convolutional layer’s depth increases, the vanish/exploding gradients problem
will be an issue to be solved. The residual block design can solve the vanishing gradients
problems without complex computation when we stack many convolutional layers [3].
In this study, a deep CNN model called Opt-ResDCNN model, which utilizes the
idea from the ResNet [3] and DensNet [4] is proposed for wafer map defect recognition.
The deep CNN (DCNNs) model is improved by stacking more convolutional layers and
residual blocks. The proposed model is compared with a previously published study with
26 * 26 input image resolution using the real-world wafer map dataset (WM-811K).
The remainder of this study is organized as follows. Section 2 describes the related
works on the wafer map dataset. The WM-811K dataset is described in Sect. 3. The
proposed model is discussed in Sect. 4. Experiment and results analysis are presented
in Sect. 5. In the end, the conclusion of this study is given in Sect. 6.

2 Related Works

The first CNN structure was created by Lecun et al. [5] in 1980 for image classification
purposes. Wafer map defect classification is one of the vital research fields of classifi-
cation problems. Since Wu et al. [6] created the world’s most extensive WM-811K data
set, including 811,467 real-world wafer maps, many defect-recognition methods have
been proposed.
Robust deep learning (DL) algorithm structure called CNN is recently utilized to
recognize wafer maps’ defect patterns. The CNNs can be used in automated assem-
bly lines to identify wafer map defects quickly. Shawon et al. [7] proposed a DCNN
architecture with a data augmentation technique to solve the data balance problem and
achieve high accuracy. They build a convolutional autoencoder (CAE) for augmentation
and utilize DCNN to classify the balanced dataset. The testing accuracy of the proposed
DCNN model is 99.29%. Yu et al. [8] create a hybrid deep learning model called stacked
convolutional sparse denoising auto-encoder (SCSDAE), which combines CNN with
stacked denoising auto-encoder (SDAE).
374 Z. E. Amogne et al.

The defect class imbalance is a fundamental issue in wafer map defect identifica-
tion and classification model performance. An adaptive balancing generative adversarial
network (AdaBalGAN) was proposed by Wang et al. [9]. They tried to use the gener-
ative adversarial networks (GAN) model concept to simulate the synthetic wafer maps
and compare the results with ML models, SVM and Adaboost. Yu et al. [10] proposed
a novel CNN model, a two-dimensional principal component analysis-based convolu-
tional auto-encoder (PCACAE), a hybrid learning model that can solve the imbalance
problem and well-learn the features from the image. Ji and Lee [11] proposed a GAN
model to improve the CNN-based classifier performance.
According to the above studies, most research finds out that balancing the label type
is a significant issue for improving ML and DL models’ classification accuracy. Many
researchers try to build a good classifier or increasing the DL model’s depth to improve
the WMDPI problem’s performance. Jin et al. [12] proposed a framework to extract the
features from high-resolution wafer maps pattern using CNN. Tsai et al. [13] proposed
an efficient light-weight deep convolution model called defect-map classification (DMC)
network that can reduce the computation time and use fewer parameters.
Different researchers develop DCNN based models to increase the model’s classi-
fication performance by building deep neural networks without vanishing or exploding
gradient problems. He et al. [3] proposed the residual neural network (ResNet). In the
same year, He et al. [14] compared different kinds of residual block structures. The
experiment results show that the full pre-activation structure makes ResNet-1001 easier
to train and improve the original ResNet’s performance. After ResNet was published,
Huang et al. [4] proposed the dense convolutional network (DensNet). The dense block
was modified from the residual block, and the computation was changed from summation
to concatenating.
Many studies tried to add more layers based on these concepts to increase the depth of
CNNs structure. Maksim et al. [15] tried to improve pattern recognition performance in a
few amounts of experimental data conditions. They got the pre-trained weight from only
synthesized data and implemented it on four previously proposed models, VGG-19 [16],
MobileNetV2 [17], ResNet-34, and ResNet-50 [3]. A model based on the integration
of transfer learning and DensNet called T-DensNet for wafer map defect recognition is
proposed by Shen et al. [18]. Since deep CNNs stacked more convolutional layers, their
performance becomes better as the layers’ number becomes deeper. But the classification
performance of the proposed models based on those ResNet and DensNet on the WM-
118K dataset is somehow poor since they used the whole model structure directly and
didn’t try to change its construction for this specific dataset.
In this study, a CNN structure model with residual blocks is created to increase the
model’s depth, and the dropout layer is also added to solve the overfitting problem.
The proposed model’s efficiency is compared with the previously published paper, and
it outperforms in terms of all performance metrics. A class balance function for the
training sets is also used in this study (Table 1).
Deep Convolutional Neural Networks with Residual Blocks 375

Table 1. Related work on the WM-811K dataset.

Author Train Test Resolution Channel Model Overall


samples samples size accuracy
Shawon 12730 705 [3, 26] RBG DCNN 99.29%
et al. [7]
Jin et al. 18000 2000 [256, 256, 1] Grayscale CNN-ECOC-SVM 98.43%
[12]
Tsai et al. 103770 51885 [64, 64, 3] RBG DMC1 97.01%
[13]
Maksim 14000 10071 [96, 96, 1] Grayscale VGG-19 84.81%
et al. [15]
Maksim 14000 10071 [96, 96, 1] Grayscale ResNet-34 81.91%
et al. [15]
Maksim 14000 10071 [96, 96, 1] Grayscale ResNet-50 87.84
et al. [15]
Maksim 14000 10071 [96, 96, 1] Grayscale MobileNetV2 85.39%
et al. [15]
Yu et al. 13451 4483 [96, 96, 1] Grayscale PCACAE 97.27%
[10]
Shen et al. 7112 2000 [224, 224, 1] Grayscale T-DenseNet 87.70%
[18]
Ji and Lee. 8,160 1,000 [64, 64, 1] Grayscale CNN-based model 98.30%
[11]

3 Dataset Description
3.1 WM-811K Dataset
The WM-811K dataset contains eight different kinds of defect patterns (Center, Donut,
Edge-local, Edge-ring, Local, Random, Scratch, and Near-full) and non-failure pattern
class. Domain experts label these defect types. As shown in Fig. 1, the WM-811K dataset
has an imbalance problem between those eight defect patterns. Therefore, a suitable data
balancing technique is necessary to achieve a good classification performance.

Fig. 1. Defect type distribution in the WM-811K dataset.


376 Z. E. Amogne et al.

3.2 Class Imbalance in the Dataset

Different data balancing techniques generate additional images for the minority failure
types to balance the dataset and reduce the image in majority failure types. There are
three methods for data balancing, which are called data-level methods, algorithm-level
methods, and hybrid approaches [19].
The data-level method is the simplest method to modify the data distribution for data
balancing to improve the DL model’s image data classification performance [20–22].
Instead of shifting the data distribution, algorithm-level methods focus on modifying
the classifier learning process. Some studies use cost-sensitive methods [23–25] based
on algorithm perspective, which assigns a different high penalty for minority class to
increase its importance. Another research [26] tried to create a new loss function to
capture the majority and minority labels’ errors unevenly during the training of deep
neural networks. Other studies adjust the threshold to improve the imbalance problem to
get high classification performance [27]. The hybrid approach is a method that combines
data-level and algorithm-level methods to solve the data imbalance problems by sampling
and cost-sensitive learning.
An unsupervised DL model called convolutional autoencoder (CAE) is used in this
study for data augmentation. Over-sampling and under-sampling data-level methods are
used to deal with the class imbalance issue.

4 Proposed Model
4.1 Data Preprocessing

From Fig. 1, we can see that 78.69% (638507 samples) of the raw WM-118K dataset is
an unlabeled one. 21.31% (172950 samples) of the data set is used to train the proposed
DL model by removing the unlabeled dataset. After that, the one-hot-encoding approach
is utilized to generate an RGB image channel. The channel number is transformed from
1 channel (grayscale) to three channels (RBG image). Figure 2 present the one-hot
encoded image data for eight different classes. The benefit of increasing the channel
number is to extract more features from the RBG image by utilizing a 3D filter in the
convolutional layer.

Fig. 2. Original grayscale and one-hot encoded RGB image data.


Deep Convolutional Neural Networks with Residual Blocks 377

4.2 Flow Chart of the Proposed Model

This study proposed a deep convolutional neural network-based model called Opt-
ResDCNN for defect identification and classification of unbalanced WM-811K dataset.
The raw WM-811K data set is preprocessed for identifying labeled and unlabeled pat-
terns and then checking the class distribution of labeled patterns. The image size of the
raw dataset is resized to the required resolutions to see the effect of different image reso-
lutions on the model’s classification performance. The one-hot encoding method is used
to change the one channel (grayscale) raw image data to a three-channel (RGB) format.
Since the raw wafer image dataset has an imbalance problem, CAE is used to balance the
labeled patterns of the dataset using the augmentation technique. The balanced dataset
is split into training and test set with 70% for training and 30% for testing the model.

Fig. 3. Flowchart for the proposed model.


378 Z. E. Amogne et al.

After splitting the data set into a train-test set, the training set is further balanced
with a balance function to increase the model classification performance. The balanced
training data will then be used to train the Opt-RseDCNN model, and the model will
recognize and classify the defects in different classes. Based on the classification results,
the performance metrics computation is performed. To ensure the consistency of the
proposed model’s performance, the model runs for ten iterations. The average of these
ten iterations performance metrics results is used to measure the model’s performance.
The flowchart of the proposed model is shown in Fig. 3.

4.3 Convolutional Autoencoder (CAE) for Data Balancing


CAE model is used to deal with the data imbalance issue in this study. The objective of the
CAE model is to reconstruct the input data during the forward and backpropagation calcu-
lations. The encoder structure tries to extract the useful features from the input data. The
decoder structure tries to reconstruct the input image by valuable feature extraction from
the encoder output. Each convolutional layer in the encoder part connects with the pool-
ing layer. In the decoder part, the transposed convolutional layer is used for up-sampling,
and the transposed convolution process can learn the optimum up-sampling during the
model training. The structure of the proposed CAE model is shown in Table 2.

Table 2. Details of different layers of the proposed convolutional autoencoder.

Layer Layer type Kernel weight size Stride Padding Activation function
1 Conv (3, 3) 1 (1, 1) ReLU
2 Max pool (2, 2) 1 – –
3 Conv (3, 3) 1 (1, 1) ReLU
4 Max pool (2, 2) 1 – –
5 Transposed Conv (3, 3) 2 – ReLU
6 Transposed Conv (2, 2) 2 – Sigmoid

The encoder extracts the feature from the input x to the next hidden layer. The output
feature maps of the l-th hidden layer in the encoder (Hl ) are given by (1).
 
H l = σ x ∗ W l + bl , l = 1, 2, . . . , n (1)

where σ is the activation function, which is the ReLU activation function, n is the number
of hidden layers in the encoder and ∗ represents the convolution operator. W l and bl are
defined as the convolution kernel weight and bias of hidden layer l, respectively.
The decoder tries to reconstruct the input x by the last hidden layer’s feature maps,
and the transpose convolution process is used in the decoder to reduce the feature maps.
The output is the restored feature maps of the k-th layer (TransHk ) given by (2).
 
k

TransH k = σ H n ∗ W + bk , k = 1, 2, . . . , m (2)
Deep Convolutional Neural Networks with Residual Blocks 379

where σ is the activation function. In the decoder structure, the ReLU activation func-
tion and Sigmoid activation are conducted in the hidden layers and output layer,
respectively,H n is the output of the n-th hidden layer in the encoder, m is the number of
hidden layers in the decoder, and ∗ in the decoder represents the transposed convolution

k
operator. W and bk are defined as the transposed convolution kernel weight and bias of
hidden layer k, respectively.
To minimize the difference between the input x and the output of the last decoder
layer (TransH m ) in training the CAE model, the Mean Square Error (MSE) is used as
the loss function. The MSE loss function is defined in Eq. (3), where the training batch
size (s) is the samples in mini-batch.
n  2
i=0 TransHin − xi
MSE Loss = , n = 0, 1, . . . , s (3)
s
For the data balancing process, some noise values generated from the random normal
distribution generator will be added to the output feature map from the encoder. After
adding noise to the model, the decoder tries to reconstruct the input images by removing
noise. The CAE model will generate output images with the same defect pattern as the
input image through the data balancing process, as shown in Fig. 4.

Fig. 4. The data balancing process in 26 × 26 input image resolution.

4.4 Proposed DL Classifier: Opt-ResDCNN Model

The proposed model called Opt-ResDCNN is a unique model which combines the con-
cept from the DCNN model and residual blocks. This study tries to improve wafer map
defect pattern identification and classification based on the DCNN idea and add resid-
ual blocks to connect more convolutional layers to increase the neural network’s depth.
The Opt-ResDCNN model includes dropout layers to solve the overfitting problem and
optimize the model’s performance (Fig. 5).
380 Z. E. Amogne et al.

Fig. 5. The architecture of Opt- ResDCNN model.

The proposed model includes two residual blocks, a powerful structure for training
a very deep DL model without any complex computation, and can solve the gradient
vanishing problem [3]. It tried to learn the identity mapping residual from the additive
residual function, and the identity mapping can solve the gradient vanishing problem
during backpropagation. Different structures of the residual blocks have been experi-
mented by He et al. [14]. The proposed model used the pre-activation function residual
block structure to improve the Opt-ResDCNN model performance.

4.5 Performance Metrics Computation


To evaluate the classification performance of the proposed model, accuracy, precision,
recall, and F1 score metrics are used in this study. Those metrics are calculated from
a confusion matrix. The measurements are computed based on true positive (TP), true
negative (TN), false positive (FP), and false negative (FN), as shown in Eq. (4).
⎧ TPclass + TNclass

⎪ Accuracyclass = TPclass + TN class + FPclass + FNclass

⎨ Precisionclass = TPclassTP+classFPclass
(4)

⎪ Recallclass = TPclassTP+class

⎩ FN class
F1 scoreclass = 2 ×Precision
(Precisionclass × Recallclass )
class + Recallclass

5 Experiment and Result Analysis


5.1 Hyperparameters Setting for CAE and Opt-ResDCNN Model
The hyperparameter setting is essential for training DL models since it will affect the
model performance by avoiding the overfitting and gradient vanishing problems. The
tuned hyperparameters in the CAE and Opt-ResDCNN models are epoch, batch size,
and learning rate with values of 1000, 1024, and 0.01, respectively.

5.2 Balancing the Training Set


Using a balanced training set is essential to train a good DL model for the classification
problem. The previous related studies didn’t mention anything related to this issue. From
Deep Convolutional Neural Networks with Residual Blocks 381

the experiments in this study, it is found that balancing the classes in the training set can
provide a better classification performance. A balance function (Bi ) for each class i is
used to balance the classes in the training set using the following equation.
x
Bi (x, c) = (5)
c
where x is the total number of the training samples, c is the total number of defect classes,
which is 9 in this case. Using these balance function, we can make sure that each class
has an equal amount of training samples for better performance of the model.

Table 3. Comparison of Opt-ResDCNN without and with balance function

Measurements Without balance With balance function


function
Accuracy 99.565% 99.672%
Precision 99.585% 99.664%
Recall 99.592% 99.695%
F1 99.584% 99.692%

From Table 3, we can see that balancing the training set can result in better per-
formance than training the model without balancing the training set. Using a balanc-
ing function helps to get a 0.107%, 0.079%,0.103%, and 0.108% increase in accuracy,
precision, recall, and F1 score performance measures.

5.3 Class Level Classification Performance

Table 4. Class level classification results of the proposed model

Failure type Accuracy Precision Recall F1 value


Center 99.436% 99.507% 99.885% 99.693%
Donut 100.000% 100.000% 100.000% 100.000%
Edge-Loc 99.411% 99.415% 99.282% 99.346%
Edge-Ring 99.301% 99.410% 99.897% 99.651%
Loc 99.478% 99.495% 99.595% 99.543%
Random 99.913% 99.918% 99.948% 99.932%
Scratch 99.838% 99.851% 99.918% 99.882%
Near-full 99.933% 99.945% 99.967% 99.955%
None 99.729% 99.702% 98.765% 99.229%
Average 99.672% 99.694% 99.695% 99.692%
382 Z. E. Amogne et al.

Table 4 shows class level classification results of the proposed model for 26 * 26 input
image resolution. The proposed model classifies the donut class with 100% accuracy,
precision, recall, and F1 score values. The table shows the superior performance of
our proposed model with the worst accuracy, precision, recall, and F1-score values of
99.301%, 99.410%, 99.282%, and 99.229%, respectively. Those worst classification
results for accuracy and precision are recorded for Edge-Ring class. In contrast, Edge-
Loc and None classes had the worst recall and F1- score results in the proposed model’s
classification result. The overall average classification performance of the proposed
model is 99.672%,99.694%,99.695%, and 99.692% for accuracy, precision, recall, and
F1 score values.
When we compare our proposed model with the previously published model with
the same input image size of 26 * 26 [7], our model provides a much better performance
than the authors in [7] reported. The accuracy metrics reported by [7] is 99.29%, while
our proposed model’s classification accuracy is 99.67%. Our proposed model’s worst
classification accuracy of 99.301% for Edge-Ring class is even better than the average
accuracy of [7]. Our model’s classification performance is increased because we intro-
duce the residual block just before the fully connected layers, which plays a significant
role in solving the vanishing gradient problem.

6 Conclusion
This study proposes a DCNN model with a residual block for wafer map pattern iden-
tification and classification. Input image size of 26 * 26 is used to verify the proposed
model’s performance. A convolutional autoencoder is used to solve the data imbalance
problem of the WM-811k dataset and to improve the model’s classification performance.
Besides, the model’s credibility was verified by performing ten iterations and calculat-
ing the average of the measured performance indicators. The proposed model correctly
classifies the Donut class with perfect 100% accuracy, precision, recall, and F1 score
values. Comparing our proposed model with the results of the previously published
paper with an input image resolution of 26 * 26, our proposed model shows outstanding
performance in terms of all classification performance measurement metrics.
We are currently working on an experiment on different input image resolutions on
our proposed model and try to compare it with the previously published papers to show
the effect of input image resolution on the classification performance of a model.

References
1. Wang, R., Chen, N.: Defect pattern recognition on wafers using convolutional neural networks.
Qual. Reliab. Eng. Int. 36(4), 1245–1257 (2020). https://doi.org/10.1002/qre.2627
2. Gleason, S.S., Tobin Jr., K.W., Karnowski, T.P., Lakhani, F.: Rapid yield learning through
optical defect and electrical test analysis. In: 23rd Annual International Symposium on
Microlithography, pp.232–242. SPIE, Santa Clara, CA, United States (1998). https://doi.
org/10.1117/12.308731
3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Las
Vegas, NV, USA (2016). https://doi.org/10.1109/CVPR.2016.90
Deep Convolutional Neural Networks with Residual Blocks 383

4. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolu-
tional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 2261–2269. IEEE, Honolulu, HI, USA (2017). https://doi.org/10.1109/CVPR.2017.243
5. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791
6. Wu, M.-J., Jang, R.J.-S., Chen, J.-L.: Wafer map failure pattern recognition and similarity
ranking for large-scale data sets. IEEE Trans. Semicond. Manuf. 28(1), 1–12 (2015). https://
doi.org/10.1109/TSM.2014.2364237
7. Shawon, A., Faruk, O.M., Bin Habib, M., Khan, M.A.: Silicon wafer map defect classifica-
tion using deep convolutional neural network with data augmentation. In: IEEE 5th Interna-
tional Conference on Computer and Communications (ICCC), IEEE, Chengdu, China (2019).
https://doi.org/10.1109/ICCC47050.2019.9064029
8. Yu, J., Zheng, X., Liu, J.: Stacked convolutional sparse denoising auto-encoder for identifi-
cation of defect patterns in semiconductor wafer map. Comput. Ind. 109, 121–133 (2019).
https://doi.org/10.1016/j.compind.2019.04.015
9. Wang, J., Yang, Z., Zhang, J., Zhang, Q., Chien, K.W.-T.: AdaBalGAN: An improved gener-
ative adversarial network with imbalanced learning for wafer defective pattern recognition.
IEEE Trans. Semicond. Manuf. 32(3), 310–319 (2019). https://doi.org/10.1109/TSM.2019.
2925361
10. Yu, J., Liu, J.: Two-dimensional principal component analysis-based convolutional autoen-
coder for wafer map defect detection. IEEE Trans. Ind. Electron. (Early Access) (2021).
https://doi.org/10.1109/TIE.2020.3013492
11. Ji, Y., Lee, J.-H.: Using GAN to improve CNN performance of wafer map defect type classi-
fication: yield enhancement. In: 31st Annual SEMI Advanced Semiconductor Manufacturing
Conference (ASMC), pp.1–6. IEEE, Saratoga Springs, NY, USA (2020). https://doi.org/10.
1109/ASMC49169.2020.9185193
12. Jin, C.H., Kim, H.-J., Piao, Y., Li, M., Piao, M.: Wafer map defect pattern classification based
on convolutional neural network features and error-correcting output codes. J. Intell. Manuf.
31(8), 1861–1875 (2020). https://doi.org/10.1007/s10845-020-01540-x
13. Tsai, T.-H., Lee, Y.-C.: A light-weight neural network for wafer map classification based on
data augmentation. IEEE Trans. Semicond. Manuf. 33(4), 663–672 (2020). https://doi.org/
10.1109/TSM.2020.3013004
14. He, K., Zhang, X., Ren, S., Sun, J.: Identity Mappings in Deep Residual Networks. In: Leibe,
B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645.
Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
15. Maksim K., et al.: Classification of wafer maps defect based on deep learning methods with
small amount of data. In: International Conference on Engineering and Telecommunication
(EnT), pp. 1–5. IEEE, Dolgoprudny, Russia (2019). https://doi.org/10.1109/EnT47717.2019.
9030550
16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition. In: Proceedings of International Conference on Learning Representation, pp.1–14.
ICLR, (2015).arXiv:1409.1556, 2014 - arxiv.org
17. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. In: Computer Vision and Pattern Recognition, pp. 1–9 (2017).
arxiv.org/abs/1704.04861
18. Shen, Z., Yu, J.: Wafer map defect recognition based on deep transfer learning. In: IEEE
International Conference on Industrial Engineering and Engineering Management (IEEM),
pp. 1568–1572. IEEE, Macao, China (2019). https://doi.org/10.1109/IEEM44572.2019.897
8568
19. Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big
Data 6(1), 1–54 (2019). https://doi.org/10.1186/s40537-019-0192-5
384 Z. E. Amogne et al.

20. Lee, H., Park, M., Kim, J.: Plankton classification on imbalanced large scale database via
convolutional neural networks with transfer learning. In: IEEE International Conference on
Image Processing (ICIP), pp. 3713–3717. IEEE, Phoenix, AZ, USA (2016). https://doi.org/
10.1109/ICIP.2016.7533053
21. Pouyanfar S., et al.: Dynamic sampling in convolutional neural networks for imbalanced data
classification. In: IEEE Conference on Multimedia Information Processing and Retrieval
(MIPR), pp. 112–117. IEEE, Miami, FL, USA (2018). https://doi.org/10.1109/MIPR.2018.
00027
22. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in
convolutional neural networks. Neural Netw. 106, 249–259 (2018). https://doi.org/10.1016/
j.neunet.2018.07.011
23. Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., Kennedy, P.J.: Training deep neural networks
on imbalanced data sets. In: International Joint Conference on Neural Networks (IJCNN),
pp. 4368–4374. IEEE, Vancouver, BC, Canada ( 2016). https://doi.org/10.1109/IJCNN.2016.
7727770
24. Khan, S.H., Hayat, M., Bennamoun, M., Sohei, F.A., Togneri, R.: Cost-sensitive learning
of deep feature representations from imbalanced data. IEEE Trans. Neural Networks Learn.
Syst. 29(8), 3573–3587 (2018). https://doi.org/10.1109/TNNLS.2017.2732482
25. Wang, H., Cui, Z., Chen, Y., Avidan, M., Ben, A.A., Kronzer, A.: Predicting hospital readmis-
sion via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinforma. 15(6),
1968–1978 (2018). https://doi.org/10.1109/TCBB.2018.2827029
26. Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced clas-
sification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp.5375–5384. IEEE, Las Vegas, NV, USA (2016). https://doi.org/10.1109/CVPR.2016.580
27. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In:
IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007. IEEE, Venice,
Italy (2017). https://doi.org/10.1109/ICCV.2017.324
Enhanced Convolutional Neural Network
for Age Estimation

Idowu Aruleba and Serestina Viriri(B)

School of Mathematics, Statistics and Computer Science,


University of KwaZulu-Natal, Durban, South Africa
[email protected]

Abstract. The human face constitute various biometric features that


could be used to estimate an important detail such as age. Variations
in facial landmarks and appearances have presented challenges to auto-
mated age estimation. This account for limitations attributed to conven-
tional approaches such as the traditional hand-crafted method, which
cannot efficiently and adequately estimate age. In this study, a six lay-
ered Convolutional Neural Network (CNN) were proposed, which extract
features from facial images taken in an uncontrolled environment, and
classifies them into appropriate classes. Since a huge datasets is needed
to obtain good accuracy from the trained model and minimize overfit-
ting, data augmentation was performed on the datasets to balance the
number of images in each class. The UTKFace dataset was used to train
the model while validation was carried out on FGNET dataset. With
the proposed novel method, an accuracy of 89.75% was recorded on the
UTKFace dataset, which is a significant improvement over existing state-
of-the-art methods previously implemented on the UTKFace dataset.

Keywords: Age estimation · Pre-processing · Feature extraction ·


Convolutional Neural Network

1 Introduction

The use of biometric features to determine a person’s age is described as age


estimation and can be accomplished using traits (biometric) such as eye color,
skin, and facial emotions [1]. In the last few years, numerous research has been
conducted in age estimation using facial images for real-world applications such
as access monitoring and control, enhanced communication. These impacts have
contributed significantly to emotiona1 intelligence, security control, and enter-
tainment, among several others [1–3].
Many studies have been carried out to implement age estimations effec-
tively. Although there still exist several limitations and drawbacks [6]. Previous
approaches have real-world limitations majorly due to variations in conditions
in which some images are taken, and could, in turn, affect the accuracy of the

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 385–394, 2021.
https://doi.org/10.1007/978-3-030-85030-2_32
386 I. Aruleba and S. Viriri

estimations. These conditions include lighting, background, and facial expres-


sion, which, cumulatively, could account for reductions in the quality of images
to be processed [4,5].
Several techniques have been previously employed, some of which include
hand-craft methods for extracting visual features [5,6], different age model-
ing approaches [7] for image representation, and machine-based approaches for
resolving problems with classification [14] and regression [15]. Most of these
approaches have one or two weaknesses, like considering the distance between
facial landmarks and neglecting the facial appearance, not able to encode wrin-
kles on the face, in addition to their inability to match boundaries in an image [6].
In this work, we propose an efficient deep learning approach that would serve
as an end-to-end model to increase the accuracy of deep neural networks in
the age estimation task. Convolutional Neural Network is known for its ability
to learn from raw pixels, exploit spatial correlations in data, among several
functionalities with proven high performance in the area of computer vision and
image processing. The proposed model will be trained using a large benchmark
dataset with unconstrained faces, while the accuracy will be compared to other
state-of-the-art approaches.
This paper is structured as follows: in Sect. 2, we described the background
and related work; in Sect. 3, we describe the proposed Convolutional Neural Net-
works method and its application in age estimation; Sect. 4, we provide detailed
analyses of the experimental result, and ultimately, the conclusions are presented
in Sect. 5.

2 Background and Related Work


The possibility of performing age estimation using facial images based on bio-
metric features extracted from a person’s face has been previously established.
A survey carried out by Angulu, R. et al. [6] provided extensive details based
on recent researches in aging and age estimation, which implemented popular
algorithms of existing models with comparative analyses of their performance.
Deep learning methods like the deep convolutional neural network has per-
formed excellently on different tasks of computer vision and image recognition
such as age & gender classification [9], diagnosing Tuberculosis [12] because of the
strength to learn without explicitly working on and ability to deal with extensive
data [4,13]. In 2016, Zhang, Z. [2] study was based on apparent age estimation
with CNN using Google-NET architecture [20]. A batch normalization layer was
added to the model after each rectified linear unit operation. However, their
method did not work well on older people, blurry images, and misaligned faces.
Anand, A. et al. [1], previously modeled a CNN method to cumulatively
perform feature level fusion, dimension reduction of feature space, and individual
age estimation. WIKI dataset [21] was employed to model the Feed Forward
Neural Networks (FFNNs) for parameter estimation prior to training. Adience
Benchmark dataset was used to compare the exact accuracy with the state-of-
the-art techniques.
Enhanced Convolutional Neural Network for Age Estimation 387

Mualla, N. et al. [10], in their study, made use of PCA for feature extrac-
tion and reduction. Thereafter, the Deep Belief Network (DBN) classifier was
compared with the K-nearest neighbor algorithm and Support Vector Machine
classifiers. From their results, the DBN accuracy for age estimation output sig-
nificantly outperforms the KNN and SVM classifier.
Niu, Z. et al. [16] presented a multiple output CNN learning algorithm with
ordinal regression to solve the classification problem for age estimation [16]. How-
ever, Yi, D. et al. [17] proposed the use of the multi-scale convolutional network
for age estimation. In their study, they proposed a CNN model which was not
vast enough. It consisted of only one fully connected layer, one convolution layer,
one pooling, and one local layer, which summed up to a total of four layers, while
just a subset of MORPH2 was used to train the model.
Sendik, O. et al. [8] applied Convolutional Neural Network (CNN) and Sup-
port Vector Regressor (SVR) in their face-based age estimation task. They
trained the CNN for representation and metric learning, the SVR was applied to
the learned features. The work demonstrated that by retraining the SVR layer,
small datasets can be analysed.
Huang, J. et al. [11] proposed a deep learning technique for age group classifi-
cation using the Deep Identification features, the architecture was optimized for
face representation. They analysed their architecture on both constrained and
unconstrained databases which performs well.
The study carried out by Das, A. et al. [19] analyzes the classification of
gender, age, and race as a multi-task. They proposed a multi-task Convolutional
Neural Network (MTCNN), which makes use of unconnected features of the
fully connected layers. They split the fully connected layers to perform multi-
task learning for a better face attribute analysis efficiently. However, they took
advantage of the hierarchical distribution of information that is present in CNN
features. The experiment was carried out using two datasets; UTKFace and Bias
Estimation in Face Analytics (BEFA). Their approach provided a promising
outcome on both datasets.

3 Methods and Techniques

3.1 Datasets

The following databases were used for the experiments done in this work. These
databases are publicly available at:

UTKFace
The UTKFace dataset is a large face dataset that consists of over 20,000 face
images in the wild with a long span ranging from 0 to 116 years old. Each facial
image is with annotations of age, gender, and ethnicity. Some face samples pro-
vide corresponding aligned and cropped faces. The images cover large variations
in pose, facial expression, illumination, occlusion, and resolution, to mention but
388 I. Aruleba and S. Viriri

a few. The UTKFace dataset has been used for different tasks e.g. age estimation
[19], age progression/regression [22] etc. In this study, the ages of the face images
are grouped, as shown in Table 1.
The dataset can be accessed at https://www.kaggle.com/jangedoo/utkface-
new.

Table 1. Ages from the dataset as grouped

Age group Age range Number of images


Group 1 0–18 3943
Group 2 19–30 7377
Group 3 31–80 9477
Group 4 80 above 491

FGNET
FGNET dataset only contains 1002 images of 82 individuals in the wild; the ages
in dataset ranges from 0 to 69 years old with about 12 images per person. This
is a small dataset with every constituent images representing different ages.

3.2 PreProcessing

Working with a deep neural network requires a large data size to achieve higher
accuracy in the estimation of age. As such, we carried out data augmentation on our
dataset to balance the number of images for each age class. In Table 1, age class ≥ 80
compared to age class 31–80 shows the differences in the number of images. These
differences could adversely affect the accuracy of our model. For the augmentation
process, we applied different augmentations like flip left right (0.5), skew (0.4, 0.5)
and zoom (probability = 0.2, min factor = 1.1, max factor = 1.5).

3.3 Proposed Model

Convolutional Neural networks are deep neural networks for classification tasks.
The novel CNN architecture in this work consists of four convolutional layers and
two fully connected layers. The input image with shape 200 × 200 × 1, which
was passed in the first convolutional layer learning at 32, Asides from the second
convolutional layer learning 64, which has a 5 × 5 filter, other convolution layers
consist of a 3 × 3 filter, the same padding, and activation function (rectified
linear unit). The four convolutional layers made use of Max pooling with 2 * 2
poling size with a dropout 0.25 and also a batch normalization layer. The first
fully connected layer consists of 512 neurons, a Rectified Linear Unit activation,
then a dropout ratio of 0.5. The second fully connected layer has four (4) neurons
with a Softmax as the activation function. After features have been extracted
from the input images at the convolutional layer, a softmax classifier is needed
to determine the probability of the feature extracted and return the confidence
Enhanced Convolutional Neural Network for Age Estimation 389

Fig. 1. Sample of the preprocessed images for the flip, skew, and zoom.

Fig. 2. The architecture of the proposed age estimation.


390 I. Aruleba and S. Viriri

score for each class. The mathematical representation of the softmax function is
as follows (Figs. 1 and 2);
exi
S(xi ) = k
x
j=1 ej

where each value of x in the input vector is an exponent divided by a sum of the
exponents of all the inputs.
Our model was trained using Adam optimizer with an initial learning rate
of 0.0001, and we used the reduced learning rate on Plateau to monitor the
validation accuracy and loss setting the patience at ten (10) and the minimum
learning rate at 0.001. The Adam optimizer is given as below;

θt+1 = θt − √η·m̂t
v̂t +

where
mt vt
m̂t = 1−β1t
, v̂t = 1−β2t

and where

mt = (1 − β1 )gt + β1 mt−1 , vt = (1 − β2 )gt2 + β2 vt−1

where  is Epsilon, η is the learning rate, g is the gradient (Table 2).

Table 2. The training details with UTKFace dataset

Classifier Optimizer No of epoch Initial learning rate Weight decay


Age group Adam 35 0.0001 0.005

During the implementation, fewer convolutional filters should be used in the


initial layer and increased filter in subsequent layers. Furthermore, wider window
size in the initial layers compared to the later layers. Finally, the use of dropout
can help to avoid over-fitting (Table 3).

4 Result and Discussion


The overall performance of the proposed model is evaluated using the accuracy
metric, and the system achieves an accuracy rate of 89.75% on the classifier.
Figure 3 shows the accuracy achieved with this method. It can be seen that the
model accuracy curve goes optimal before five epochs. The categorical cross-
entropy used for the evaluation of the loss function in the classifier achieves a
loss of 0.3389 on the validation dataset, as shown in Fig. 3. Our model is adept
to correctly estimate the age group of faces in many instances, while in some
instances of estimating the age group, misclassification was recorded.
Enhanced Convolutional Neural Network for Age Estimation 391

Table 3. The details of the architecture for the proposed method

Layers Output size Filter size


Input image 200 × 200 × 1
CONV1 200 × 200 × 32 3 × 3
BATCHNORM 200 × 200 × 32
ACT 200 × 200 × 32
MAXPOOLING 100 × 100 × 32
DROPOUT 100 × 100 × 32
CONV2 100 × 100 × 64 5 × 5
BATCHNORM 100 × 100 × 64
ACT 100 × 100 × 64
MAXPOOLING 50 × 50 × 64
DROPOUT 50 × 50 × 64
CONV3 50 × 50 × 128 3×3
BATCHNORM 50 × 50 × 128
ACT 50 × 50 × 128
MAXPOOLING 25 × 25 × 128
DROPOUT 25 × 25 × 128
CONV4 25 × 25 × 256 3×3
BATCHNORM 25 × 25 × 256
ACT 25 × 25 × 256
MAXPOOLING 12 × 12 × 256
DROPOUT 12 × 12 × 256
FULLY CONN1 512
BATCHNORM 512
ACTIVATION 512
DROPOUT 512
FULLY CONN2 4

Relatively, our model demonstrated a higher accuracy when implemented on


the UTKFace dataset typifying the state-of-the-art method when compared to
the accuracy reported in the study by Sithungu et al. [23] and Das et al. [19]
(see Table 4). Sithungu et al. [110], in their study, employed a lightweight CNN
model for age estimation analyses, after which the UTKFace dataset was used
to train and test the performance of their model. Only 7700 out of 23718 images
were used, which presents a limitation to their study since training a model
with few data causes overfitting. Contrary to ours, we achieved augmentation of
data in our study, which enabled us to use a large dataset to train and test our
classifier. Das et al. [19] proposed the Multi-Task Convolution Neural Network
392 I. Aruleba and S. Viriri

Fig. 3. Model accuracy and loss.

Table 4. Proposed model as compared

REF Accuracy (%)


[23] 45.3
[18] 63.97
[19] 70.1
Proposed model 89.75

(MTCNN) using joint dynamic weight loss alteration to classify gender, age,
and race jointly. While they achieved high gender and race precision, their age
estimate was less accurate than our model.

5 Conclusion
The major challenge associated with age classification has been the stochastic
nature of aging among individual and uncontrollable age progression informa-
tion displayed on faces as deduced from models applied in previous studies. This
validates the need for a more efficient model that can accurately extract features
from input images, which is the major underlying process in image classification.
Though our CNN approach is a custom one, but with the introduction and appli-
cation of a large facial image dataset we consider that our model can achieve a
meaningful improvement. In this study, the proposed model is implemented on
the UTKFace dataset for extracting distinctive features and then classified into
Enhanced Convolutional Neural Network for Age Estimation 393

the appropriate classes. The efficiency of the model is being validated on the
FGNET dataset, and the performance is compared with other studies, as shown
in Table 4. The proposed CNN architecture performed very well and demon-
strated the state-of-the-art result on the UTKFace dataset. For future works,
we would consider the effect of gender information on age estimation, consider
a deeper CNN model age, gender and ethnicity. Also, we would explore some
pretrained deep learning model to our study.

References
1. Anand, A., Labati, R.D., Genovese, A., Muñoz, E., Piuri, V., Scotti, F.: Age estima-
tion based on face images and pre-trained convolutional neural networks. In: 2017
IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. IEEE
(2017)
2. Zhang, Z.: Apparent, age estimation with CNN. In: 4th International Conference on
Machinery, Materials and Information Technology Applications, p. 2017. Atlantis
Press (2016)
3. Agbo-Ajala, O., Viriri, S.: Age estimation of real-time faces using convolutional
neural network. In: Nguyen, N.T., Chbeir, R., Exposito, E., Aniorté, P., Trawiński,
B. (eds.) ICCCI 2019. LNCS (LNAI), vol. 11683, pp. 316–327. Springer, Cham
(2019). https://doi.org/10.1007/978-3-030-28377-3 26
4. Liu, H., Lu, J., Feng, J., Zhou, J.: Ordinal deep learning for facial age estimation.
IEEE Trans. Circ. Syst. Video Technol. 29(2), 486–501 (2017)
5. Huerta, I., Fernández, C., Segura, C., Hernando, J., Prati, A.: A deep analysis on
age estimation. Pattern Recogn. Lett. 68, 239–249 (2015)
6. Angulu, R., Tapamo, J.R., Adewumi, A.O.: Age estimation via face images: a
survey. EURASIP J. Image Video Process. 2018(1), 42 (2018)
7. Kwon, Y.H., da Vitoria Lobo, N.: Age classification from facial images. Comput.
Vis. Image Underst. 74(1), 1–21 (1999)
8. Sendik, O., Keller, Y.: DeepAge: deep learning of face-based age estimation. Sig.
Process. Image Commun. 78, 368–375 (2019)
9. Agbo-Ajala, O., Viriri, S.: Deeply learned classifiers for age and gender predictions
of unfiltered faces. Sci. World J. 2020, 1289408 (2020)
10. Mualla, N., Houssein, E.H., Zayed, H.H.: Face age estimation approach based on
deep learning and principle component analysis. Int. J. Adv. Comput. Sci. Appl.
9(2), 152–157 (2018)
11. Huang, J., Li, B., Zhu, J., Chen, J.: Age classification with deep learning face
representation. Multimedia Tools Appl. 76(19), 20231–20247 (2017)
12. Oloko-Oba, M., Viriri, S.: Diagnosing tuberculosis using deep convolutional neural
network. In: El Moataz, A., Mammass, D., Mansouri, A., Nouboud, F. (eds.) ICISP
2020. LNCS, vol. 12119, pp. 151–161. Springer, Cham (2020). https://doi.org/10.
1007/978-3-030-51935-3 16
13. Qiu, J.: Convolutional neural network based age estimation from facial image and
depth prediction from single image (2016)
14. Geng, X., Yin, C., Zhou, Z.H.: Facial age estimation by learning from label distri-
butions. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2401–2412 (2013)
15. Lanitis, A., Taylor, C.J., Cootes, T.F.: Toward automatic simulation of aging effects
on face images. IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 442–455 (2002)
394 I. Aruleba and S. Viriri

16. Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multi-
ple output CNN for age estimation. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 4920–4928 (2016)
17. Yi, D., Lei, Z., Li, S.Z.: Age estimation by multi-scale convolutional network. In:
Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9005,
pp. 144–158. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16811-
1 10
18. Savchenko, A.V.: Efficient facial representations for age, gender and identity recog-
nition in organizing photo albums using multi-output ConvNet. PeerJ Comput. Sci.
5, e197 (2019)
19. Das, A., Dantcheva, A., Bremond, F.: Mitigating bias in gender, age and ethnicity
classification: a multi-task convolution neural network approach. In: Proceedings
of the European Conference on Computer Vision (ECCV), pp. 4920–4928 (2018)
20. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
21. Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age
from a single image without facial landmarks. Int. J. Comput. Vis. 126(2), 144–157
(2016). https://doi.org/10.1007/s11263-016-0940-3
22. Zeng, J., Ma, X., Zhou, K.: CAAE++: improved CAAE for age progres-
sion/regression. IEEE Access 6, 66715–66722 (2018)
23. Sithungu, S., Van der Haar, D.: Real-time age detection using a convolutional neu-
ral network. In: Abramowicz, W., Corchuelo, R. (eds.) BIS 2019. LNBIP, vol. 354,
pp. 245–256. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20482-
2 20
Deep Interpretation with Sign Separated
and Contribution Recognized
Decomposition

Lucas Y. W. Hui(B) and De Wen Soh

Singapore University of Technology and Design, Changi, Singapore


Lucas [email protected], DeWen [email protected]

Abstract. Network interpretation in context of explainable AI contin-


ues to gather interest not only because of the need to explain algorithm
decisions, but also because of potential improvements that can be made
to network design. A large pool of research effects have been made includ-
ing explanation by training sample representer points, exhaustive fea-
ture occlusion methods, locally learned interpretable models, sensitivity
methods using network gradients, and relevance models using layer-wise
back-propagation. It is however a constant challenge to interpret different
network architectures or even different network function layers given the
multiplicity of models, tools, rules and assumptions. In addition, there
are challenges in producing good interpretable results; in particular, that
of jointly improving both the sensitivity and relevancy of each attribute
contribution within a network to the final network decision. A unified
decomposition rule based on new propositions about negative features
and majority contribution is proposed in this paper to address these
challenges. Furthermore, quantitative measures are discussed to address
performance of both sensitivity and relevancy of interpretation.

Keywords: Network interpretation · Decomposition method

1 Introduction
Deep convolutional network (DNN) such as state-of-the-arts image classifiers1
are known to yield high predictive performance with sufficiently large training
datasets. For example, while the Inception-ResNet-V2 [30] has already achieved
near perfect 100% training accuracy and around 80% validation accuracy on
ImageNet [7], the EfficientNet [31] further pushes the validation accuracy on
ImageNet to 84% accuracy using efficient architecture scaling. The complexity
and non-linearity of such impressive networks have drawn wide research efforts
[27] on interpretation to better explain and understand prediction results.
Specifically on the training side, the representer point method [34] has been
devised to capture the importance of each training sample towards the learned
1
https://paperswithcode.com/sota/image-classification-on-imagenet.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 395–406, 2021.
https://doi.org/10.1007/978-3-030-85030-2_33
396 L. Y. W. Hui and D. W. Soh

network parameters2 . In terms of input interpretation, there are effective meth-


ods such as Occlusion analysis methods [8,28], which locate important features
by input perturbation, and sensitivity methods [2,33] which make use of gradi-
ents computed at backward passes to determine rate of change in output pre-
diction score with respect to specific input attribute change. Among sensitivity
methods is the Integrated Gradients Method [9,18,29] :
 1
∂f (xbaseline + α(xinput − xbaseline ))
(xinputi − xbaselinei ) dα. (1)
α=0 ∂xinputi
where xbaseline is typically selected as blank or zero input.
From the prediction output perspective, specific interpretable models such as
linear model [10,24] can be learned locally around each prediction. Meanwhile,
there are additional needs and broad interests to interpret at the hidden neuron
level and this is being addressed mainly by relevance models [3,21] involving
key techniques such as the Layer-wise Relevance Propagation (LRP) [20]. The
relevance method decomposes and back propagates an initial relevance value
assigned to output of an upper layer to all inputs of a lower layer hence deter-
mines how each input attribute is relevant to an output result. Specifically, the
LRP is based on decomposition rules [21] derived using a relevance model [19]
approximated by first order Taylor series3 to back-propagate relevance values
(l)
Ri , where l is the layer referenced and i is the output of that layer. The back-
propagation of the relevance values follows

(l)
 (j)
vi wij (l+1)
Ri =  (j) Rj , (2)
j i vi wij

(j)
where vi is a search vector for root-point of the Taylor approximation.

2 Motivations
Given the multiplicities of interpretation tools and algorithm models on top of
the fast development of DNN architectures as well as machine learning solutions,
it remains a challenge to interpret classification results as well as to understand
in what ways the interpretation results are related to data attributes and clas-
sification. In particular, while sensitivity and relevance methods draws good
interests and applications [5,32] due to their quantitative results and implemen-
tation effectiveness, this section discusses some of associated observations and
challenges.

2.1 Sensitivity vs Relevancy


An intuitive method of testing interpretation is by perturbation [26]. The concept
involves generating a heat-map of interpretation results with reference to each
2
https://blog.ml.cmu.edu/2019/04/19/representer-point-selection-explain-dnn/.
3
https://en.wikipedia.org/wiki/Taylor%27s theorem.
CRSS Deep Interpretation 397

Fig. 1. Heat-maps results, Perturbations, and Classification scores. Perturbed images


(column 3) by Sensitivity method generally produce bigger impact on classification
score (row 2 & 4); while Relevancy method has clearly heat-maps (column 4) with
better focused on visually important features.

data attribute, selecting a set of data attributes with highest values according
to the heat-map, perturbing these attributes by replacing their values to zero or
random numbers, re-classifying the perturbed data, and evaluating the impact
on classification score.
Illustrated in Fig. 1 are DenseNet-121 [15] classification results of 2 example
images. The significant drops in classification scores by the Sensitivity method
generated perturbation demonstrate better correlation of interpretation results
to prediction scores as compared to the Relevance method. On the other hand,
normalized heat-maps created by Relevance method are noticeably aligned with
visual features of objects belonging to classification results. A large motivation
for this paper is to seek interpretation method that can jointly improve both
sensitivity and relevancy. Furthermore, there is a need to device a quantitative
measurement concept for relevancy of interpretation results to classification or
prediction outcomes.

2.2 Unified Decomposition Method


While relevance method shows good potential of subjective correlation to input
objects and output predictions, some challenges still exist to apply the method
across the varieties of DNN network architectures and function layers. A first
attempt is to apply LRP method on various state-of-art DNN image classifiers
(j)
using a basic or default z decomposition rule (aka ε-rule) [3] derived using vi =
xi in Eq. 2, and interpretation results in terms of heat-map given an example
input image are showed in Fig. 2 (first row). It can be seen that the default LRP
rule fails to interpret pre-trained MobileNet-V2 model [14] hence giving a blank
(dark-blue) heat-map, and for other pre-trained image classifiers [13,15,30] it
produces heat-maps that are somewhat similar to Integrated Gradients Method
(shown in Fig. 1).
398 L. Y. W. Hui and D. W. Soh

Fig. 2. Heat-map results by LRP using single default or signed decomposition rule on
various DNN image classifiers (row 1 & 2), vs heat-map results by LRP using a Preset
of decomposition rules specific to functions and layers within a network model (row 3).
(Color figure online)

These heat-maps can be compared to another LRP method (Fig. 2 row 3)


tuned with a Preset-A [1] of decomposition rules specific to functions and layers
within a network model. These decomposition rules [21] are derived using various
(j)
root-point search vectors vi according to Eq. (2) leading to examples shown
in Fig. 3 table. Notice that the Preset-A heat-maps are significantly tuned to
focus on features relevant to objects to be detected hence the brighter red pixels
around object contours. However, challenges can be seen for example the random
blue patches on DenseNet [15] heat-map signifying negation of relevance on
uncorrelated image features, and the unfocused heat-map in case of Inception-
ResNet [30] suggesting low resolution of sensitivity.

Fig. 3. LRP Decomposition Rules. Note that (z)− = min(0, z), (z)+ = max(0, z),
(+)  + (−)  −
zj = i zij , and zj = i zij . (Color figure online)
CRSS Deep Interpretation 399

There is also a challenge in the relevance model for Batch Normalization lay-
ers [11,17] that are essential to deep network learning. With in-depth analysis of
such Batch Normalization layers using various possible decomposition rules [16],
several weaknesses in the relevance model have been identified including weak
consideration of network bias parameters, cancellation effects of large positive
and negative contributions (aka contributing features), and lack of consideration
for negative only contributions in the relevance model. Therefore, it is another
strong motivation of present paper to introduce an unified/consistent and under-
standable decomposition model that improves interpretation accuracy as well as
robustness across different DNN architectures and network layers.

3 Proposed Decomposition

Based challenges and weaknesses observed, several design considerations are pro-
posed in this paper for network interpretation using relevance model. Firstly,
all negative-only contributions is possible within a network layer and there-
fore constraints on relevance model for root-point searching should be recon-
sidered. Consequentially, negative contribution may not equal to negative rel-
evance and therefore implying separate handling of positive and negative con-
tributions for relevance decomposition is necessary. Furthermore, cancellation
of contributions at a network node may not equal to no contribution. While
there are workaround solutions to cancellation such as the ε-rule (Fig. 3) or the
γ-rule [20], the decomposition remains unstable especially when large positive
contributions
 arecancelled or almost
 cancelled by large negative contributions

( i (xi wij )+ ≈ i (x w
i ij ) , or xi wij = 0). In addition, solutions should
not favor positive-only contributions. Accordingly, a new DNN interpretation
decomposition is proposed with concepts described in following subsections.

3.1 Contribution Recognized Decomposition

To explain the concept of contributions vs relevance, a very simple network


node y = x1 + x2 is used as shown in Fig. 4 with four example sets of input
(l+1)
contributions [x1 , x2 ]. Relevance Rj is back-propagated from layer (l + 1)
(l) (l)
to inputs of layer (l) respectively as R1←j and R2←j using decomposition rules
given in Fig. 3 table. To be addressed are cases highlighted in red including (1)
exploding relevance when y ≈ 0 with z-rule, (2) vanishing relevance in case of
negative-only contribution with z + -rule, and (3) positive-biased relevance and
reversed relevance when y < 0 in case of both z + -rule and α2 β1 -rule (ie αβ-rule
with α = 2 and β = 1).
A contribution recognized concept defines that majority of relevance is
assigned to the majority contributor separated by positive and negative value as
shown in Fig. 5. When a neuron output zj is positive, the majority of the rele-
(l+1) +
vance Rj will be propagated to the positive contributors zi→j ; furthermore,
400 L. Y. W. Hui and D. W. Soh

Fig. 4. Contributions vs Relevance (R) Back-Propagation illustration using a simple


example node of output y = x1 + x2 with 4 sets of example [x1 , x2 ] values. (Color figure
online)

the majority distribution may be made adaptive to the ratio of negative and
positive contributions (ie always < 1) as according to Eq. 3.
      − 
 − +   −
(l)  i zij  zij  i zij  zij (l+1)
Ri←j = 1 + γ   +   + − γ   +   − Rj (3)
 i zij  z
i ij
 z
i ij
 z
i ij

Similarly, when output zj is negative, majority of relevance is back-propagated


to the negative contributors according to Eq. 4.
      + 
 + −   +
(l)  i zij  zij  i zij  zij (l+1)
Ri←j = 1 + γ   −   − − γ   −   + Rj (4)
 i zij  i z ij
 i z ij
 i z ij

Fig. 5. Adaptive contribution recognized concept for Relevance decomposition.  The


(l+1) +
amount of relevance Rj to be back-propagated to the positive contributors i zi→j
and negative contributors i zi→j − depends on whether neuron output zj is (a) posi-
tive, or (b) negative.

3.2 Sign-Separated Relevance


A sign-separated relevance concept is also proposed according to current proposi-
tions. Specifically positive and negative relevance are separately captured hence
(+)  (+) (−)  (−)
Ri = j Rij and Ri = j Rij , and combined finally at the input layer
CRSS Deep Interpretation 401

or R = α Ri + β  Ri . Using conservation property of Deep Taylor Decompo-


(+) (−)

sition model [21], the sign-separated relevance are constrained according to


 (+)  (−) (+) (−)
Ri←j + Ri←j = Rj + Rj . (5)
i i

By integrating the proposed concepts, a final Contribution-Recognized and Sign-


Separated (CRSS) decomposition method for DNN interpretation is formulated
for case of zj ≥ 0:
  +  (−)  −
(+)
 (−)
zj zij (+) zj zij (−)
Ri = 1−γ (+) R + γ (+)
(+) j
R
(−) j
(6)
j zj zj zj zj
   
(−)
 (−)
zj +
zij (−)
(−)
zj −
zij (+)
Ri = 1−γ (+)
R +
(+) j
γ (+)
R
(−) j
; (7)
j zj zj zj zj
and similarly for the case zj < 0:
  −  (+)  +
(−)
 (+)
zj zij (−) zj zij (+)
Ri = 1−γ (−) R + γ (−)
(−) j
R
(+) j
(8)
j zj zj zj zj
   
(+)
 (+)
zj −
zij (+)
(+)
zj +
zij (−)
Ri = 1−γ (−)
R +
(−) j
γ (−)
R
(+) j
, (9)
j zj zj zj zj
(+)  (−)  −  (+) (+)
where zj = +
i zij , zj = i zij , and note that i Ri←j = Rj .

4 Experiment Results
While there are many progresses that push top-of-the-line image classifiers per-
formances such as network architecture searching [12], efficient network scaling
(EfficientNets) [31], and meta pseudo labeling [23], some basic building blocks for
image inference including the Dense-layer of DenseNet [15], Bottleneck of ResNet
[13], Inverted-Residual of MobileNet [14], and the Inception-module of Inception-
ResNet [30] remain relevant for interpretations. The proposed CRSS decomposi-
tion method can be applied consistently across all function layers within various
(+)
DNN classifiers without using preset tuning, starting with Rinit = network out-
(−)
put classification score and Rinit = 0, and with γ = 0.5.

4.1 Heat-Map Comparisons


Heat-maps are normalized and color-mapped interpretation output representing
sensitivity of specific input attribute (ie pixel position) towards classification
result. It is a visual or subjective way to correlate sensitivity to object clas-
sified. Shown in Fig. 6 are results of an example input image comparing pro-
posed CRSS decomposition method vs Preset-A [1] LRP, and Preset-G LRP
402 L. Y. W. Hui and D. W. Soh

with specific |z|-rule [16] for Batch Normalization layer. Heat-maps by CRSS
are generally sharper and more visually aligned with object outlines suggesting
better sensitivity and relevancy as well as robustness across network layers and
architectures.

Fig. 6. Proposed CRSS vs Preset-A and Preset-G LRP heat-maps.

4.2 Perturbation Tests

Fig. 7. CRSS and Preset-A/E/G LRP perturbation tests based on 4 reference art image
classifiers. X-axis represents average prediction and Y-axis is perturbation steps. The
Preset-F LRP is modified Preset-A with z + -rule decomposition at Batch Normalization
layers.

To quantify, perturbation test [26] is performed where groups of pixels are being
progressively replaced according to interpretation heat-map ranking, and result-
ing perturbed images are being re-classified to compare impact to the classifi-
cation scores. Tests are performed using around 10k validation images sampled
CRSS Deep Interpretation 403

equally from 1k class labels of ImageNet, and 400 input R/G/B components are
independently perturbed at each step. Results of interpretation based on 4 ref-
erence art image classifiers are shown in Fig. 7, and the proposed CRSS method
demonstrates faster drop in average prediction scores starting from highest heat-
map rankings. The results are consistent to heat-maps quality comparison where
CRSS shows better sensitivity and robustness.
While Integrated Gradients (IG) method [29] might show higher sensitivity
compared to LRP methods as illustrated in Fig. 1, this can be further evaluated
using specific perturbation tests. The idea is to perturb pixels in or against
direction of gradients to induce higher or lower prediction scores. Graph on left
of Fig. 8 shows in general IG has high sensitivity; however, comparing CRSS and
Preset-A LRP to IG, the CRSS has stronger robustness to perturbation.

Fig. 8. CRSS and Preset-A LRP vs Integrated Gradients Method using ResNet-50
as example. The IG method is significantly affected by direction of perturbation i.e.
replace by zero, mid, or min/max values, while CRSS remains robust against Preset-A
LRP.

4.3 Relevancy Check


On topic of how to quantify relevancy, specifically visually important features in
case of image classifiers, a new cross-correlation checking method is proposed.
A good image feature extractor using human vision algorithms may be firstly
selected. and candidates include SIFT [22], SURF [4], or SIFT-FAST [6] method.
In this paper the ORB method [25] is used for convenience of open source open
license as well as simplicity, and some output examples are illustrated in Fig. 9.

Fig. 9. Example image feature extractions (in green) using basic ORB [25] method.
(Color figure online)
404 L. Y. W. Hui and D. W. Soh

The heat-maps generated by interpretation methods are sampled progres-


sively according to ranking, normalized and cross-correlated with the normalized
feature extraction map (by ORB) to get a quantitative measure of relevancy. The
results are averaged over around 10k ImageNet validation images, and plotted in
Fig. 10. Again, this quantitative result of proposed CRSS method is consistent
to heat-map quality observations with reference to visual relevancy to image
objects.

Fig. 10. Visual feature cross-correlation check with different interpretation methods
based on reference image classifiers. The Integrated Gradients Method demonstrates
lowest correlation while proposed CRSS method show strongest results across all ref-
erence network architectures.

5 Discussions

This paper discusses some limitations or challenges about current reference inter-
pretation methods for DNN explanations using sensitivity and relevance. Based
on better understandings about how contributions behave at decomposition, new
propositions are made. Following the propositions, new contribution recognized
and sign-separated design concepts for decomposition are formulated and new
CRSS decomposition method is proposed.
The CRSS method aims to optimize jointly sensitivity and relevancy, and
be able to work robustly across all network layers and network architectures
without preset rules. It provides consistently sharper and focused interpretation
heat-map, higher sensitivity on perturbation tests, and strong relevancy with
image feature cross-correlation tests. While broad interest in explainable AI
remains, current CRSS method hopes to have more directed impact on the field
of object detection & segmentation, data augmentation, adversarial attacks, with
emerging applications in medical or weather analysis, and autonomous driving.

References
1. Alber, M., et al.: Investigate neural networks! J. Mach. Learn. Res. 20(93), 1–8
(2019)
CRSS Deep Interpretation 405

2. Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Towards better understanding
of gradient-based attribution methods for deep neural networks. arXiv preprint
arXiv:1711.06104 (2017)
3. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
propagation. PLoS ONE 10(7), e0130140 (2015)
4. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–
417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023 32
5. Böhle, M., Eitel, F., Weygandt, M., Ritter, K.: Layer-wise relevance propagation
for explaining deep neural network decisions in MRI-based Alzheimer’s disease
classification. Front. Aging Neurosci. 11, 194 (2019)
6. Deak, R., Sterca, A., Bădărı̂nză, I.: Improving sift for image feature extraction.
Studia Universitatis Babes-Bolyai, Informatica 62(2) (2017)
7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale
hierarchical image database. In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 248–255. IEEE (2009)
8. Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful
perturbation. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 3429–3437 (2017)
9. Goh, G.S., Lapuschkin, S., Weber, L., Samek, W., Binder, A.: Understanding inte-
grated gradients with smoothtaylor for deep neural network attribution. arXiv
preprint arXiv:2004.10484 (2020)
10. Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., Giannotti,
F.: Local rule-based explanations of black box decision systems. arXiv preprint
arXiv:1805.10820 (2018)
11. Guillemot, M., Heusele, C., Korichi, R., Schnebert, S., Chen, L.: Breaking batch
normalization for better explainability of deep neural networks through layer-wise
relevance propagation. arXiv preprint arXiv:2002.11018 (2020)
12. Guo, Z., et al.: Single path one-shot neural architecture search with uniform sam-
pling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020.
LNCS, vol. 12361, pp. 544–560. Springer, Cham (2020). https://doi.org/10.1007/
978-3-030-58517-4 32
13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
14. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:1704.04861 (2017)
15. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 4700–4708 (2017)
16. Hui, L.Y.W., Binder, A.: BatchNorm decomposition for deep neural network inter-
pretation. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019. LNCS, vol.
11507, pp. 280–291. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-
20518-8 24
17. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by
reducing internal covariate shift. In: International Conference on Machine Learning,
pp. 448–456. PMLR (2015)
18. Jha, A., Aicher, J.K., Gazzara, M.R., Singh, D., Barash, Y.: Enhanced integrated
gradients: improving interpretability of deep learning models using splicing codes
as a case study. Genome Biol. 21(1), 1–22 (2020)
406 L. Y. W. Hui and D. W. Soh

19. Kauffmann, J., Esders, M., Montavon, G., Samek, W., Müller, K.R.: From cluster-
ing to cluster explanations via neural networks. arXiv preprint arXiv:1906.07633
(2019)
20. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., Müller, K.-R.: Layer-wise
relevance propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A.,
Hansen, L.K., Müller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and
Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 193–209. Springer, Cham
(2019). https://doi.org/10.1007/978-3-030-28954-6 10
21. Montavon, G., Lapuschkin, S., Binder, A., Samek, W., Müller, K.R.: Explaining
nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn.
65, 211–222 (2017)
22. Ng, P.C., Henikoff, S.: SIFT: predicting amino acid changes that affect protein
function. Nucleic Acids Res. 31(13), 3812–3814 (2003)
23. Pham, H., Dai, Z., Xie, Q., Luong, M.T., Le, Q.V.: Meta pseudo labels. arXiv
preprint arXiv:2003.10580 (2020)
24. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?” explaining the
predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
25. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative
to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp.
2564–2571. IEEE (2011)
26. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., Müller, K.R.: Evaluating
the visualization of what a deep neural network has learned. IEEE Trans. Neural
Netw. Learn. Syst. 28(11), 2660–2673 (2016)
27. Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., Müller, K.R.: Explaining
deep neural networks and beyond: a review of methods and applications. Proc.
IEEE 109(3), 247–278 (2021)
28. Strumbelj, E., Kononenko, I.: An efficient explanation of individual classifications
using game theory. J. Mach. Learn. Res. 11, 1–18 (2010)
29. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In:
Proceedings of the 34th International Conference on Machine Learning, vol. 70,
pp. 3319–3328. JMLR. org (2017)
30. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet
and the impact of residual connections on learning. In: Thirty-First AAAI Confer-
ence on Artificial Intelligence (2017)
31. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural net-
works. In: International Conference on Machine Learning, pp. 6105–6114. PMLR
(2019)
32. Wang, Y., Liu, J., Chang, X., Mišić, J., Mišić, V.B.: IWA: integrated gradi-
ent based white-box attacks for fooling deep neural networks. arXiv preprint
arXiv:2102.02128 (2021)
33. Yeh, C.K., Hsieh, C.Y., Suggala, A.S., Inouye, D., Ravikumar, P.: How sensitive
are sensitivity-based explanations? arXiv preprint arXiv:1901.09392 (2019)
34. Yeh, C.K., Kim, J.S., Yen, I.E., Ravikumar, P.: Representer point selection for
explaining deep neural networks. arXiv preprint arXiv:1811.09720 (2018)
Deep Learning for Age Estimation
Using EfficientNet

Idowu Aruleba and Serestina Viriri(B)

School of Mathematics, Statistics and Computer Science


University of KwaZulu-Natal, Durban, South Africa
[email protected]

Abstract. The human face constitutes various biometric features that


could be used to estimate important details from humans, such as age.
The automation of age estimation has been further limited by variations
in facial landmarks and appearances, together with the lack of enor-
mous databases. These have also limited the efficiencies of conventional
approaches such as the handcrafted method for adequate age estimation.
More recently, Convolutional Neural Network (CNN) methods have been
applied to age estimation and image classification with recorded improve-
ments. In this work, we utilise the CNN-based EfficientNet architecture
for age estimation, which, so far, has not been employed in any current
study to the best of our knowledge. This research focused on applying the
EfficientNet architecture to classify an individual’s age in the appropriate
age group using the UTKface and Adience datasets. Seven EfficientNet
variants (B0–B6) were presented herein, which were fine-tuned and used
to evaluate age classification efficiency. Experimentation showed that the
EfficientNet-B4 variant had the best performance on both datasets with
accuracy of 73.5% and 81.1% on UTKFace and Adience, respectively.
The models showed a promising pathway in solving problems related
to learning global features, reducing training time and computational
resources.

Keywords: Age estimation · Classification · Deep learning · Transfer


learning · EfficientNet architecture

1 Introduction

Age estimation encompasses the determination of an individual’s age which could


be based on the facial image’s biometric features. These include soft features like
(gender, skin colour, and hair colour) and traits such as marks, wrinkles, and
moles, to mention a few, which are extracted to estimate a person’s age [1]. In
the last few years, numerous researches have been conducted in age estimation
using facial images for real-world applications such as access monitoring and
control, and enhanced communication. Every human seems to be equipped with
the ability to predict an individual’s age at a glance on the face similar to
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 407–419, 2021.
https://doi.org/10.1007/978-3-030-85030-2_34
408 I. Aruleba and S. Viriri

other biological trait detection tasks such as classifying gender and ethnicity;
achievable due to the presence of non-verbal features [1].
These intrinsic human strengths are fundamental tools and already an essen-
tial part of our social synergy. However, a human can easily be deceived by
non-natural factors such as makeups, skin tan, and plastic surgery, among many
others. All mentioned factors can easily influence an individual’s perceived age
and contribute to a more complex age estimation task. However, numerous stud-
ies have been proposed to efficiently implement age estimation system, with
several limitations and drawbacks [2]. Previous approaches have real-world limi-
tations, majorly due to variations in conditions in which some images are taken,
and could, in turn, affect the accuracy of the system. These conditions include
lighting, background, and facial expression, which cumulatively could account
for reductions in the quality of images to be processed [3,4].
In recent times, deep learning techniques such as Convolutional Neural Net-
work (CNN) have help in learning ageing features directly from larger-scale facial
data, causing it to be used in human age estimation [5]. CNN has recently become
one of the most enticing methods and has been a significant determinant in
recent achievements and complicated implementations related to the machine
learning applications, such as image detection, face recognition and age estima-
tion. Therefore, CNN solves two of the most critical computer vision problems of
conventional algorithms, the extensive feature design efforts and the low ability
to work with big data. However, age classification remains a practical research
area since the existing classification technique has not yet achieved an accu-
racy of 100%. Over recent years, several derivations have been developed from
the state-of-the-art CNN methods to improve accuracies and easily learn image
representation from faces for age estimation [6]. These are adapted as varying
CNN like AlexNet [7], VGGNet [8], SqueezeNet [9], Xception [10], GoogleNet
[11], and ResNet [12] as implemented in previous CNN-based age estimation
studies. Tan and Le [13] recently designed a new baseline network by utilising
neural architecture search and scaling it up to achieve a family of models. This
family of models is called EfficientNet, and they have proven to achieve better
accuracy and efficiency than preceding ConvNets when trained on the ImageNet
dataset. In this paper, we examined seven variants (B0 − B6) of the EfficientNet
architecture on age estimation task. We investigate if fine-tuned EfficientNet can
achieve a state-of-the-art result on the age estimation task and investigate which
variant performs best compared to preceding architectures.
Section 2 covers work related to our approach. Section 3 explains the materials
and methods used in this research. The results of the experiments and discussion
are presented in Sect. 4. Section 5 covers the conclusion of the paper.

2 Related Work

Automatically extracting attributes relevant to age from facial images has led to
various research in recent time, and numerous approaches have been proposed.
The age estimation task has been tackled both as a classification and regression
Deep Learning for Age Estimation Using EfficientNet 409

problem. In the work of Qawaqneh et al. [18], they tackled age estimation using
the VGG-Face model, modified, fine-tuned and trained on the Adience dataset.
Their work proved that CNN models trained for face recognition task could be
used for age estimation; pre-trained models can help control overfitting problems.
Pre-trained models can help improve age estimation task accuracy.
Anand et al. [15] approached the age estimation task as a classification and
regression problem by applying post-processing strategies to increase pre-trained
deep networks’ performance. They made use of pre-trained models like VGG-
Face and AlexNet to learn and extract features from the images. Analysis of
three strategies [16,17] for dimension reduction was performed and the features
computed in the dimensionality reduction phase was trained using Feed-Forward
Neural Network (FFNN) as a regressor to estimate age also as a classifier to
classify groups of age. Their results on the public dataset show that a pre-trained
model can be used for extracting features for age estimation without applying
expensive computational fine-tuning techniques. The authors in [18] examined
the VGG-Face pre-trained model for the age estimation task using real-world
facial images. The base model of the model has eight convolutional layers and
three fully connected layers. Their experiment was done using LFW [19] and
YFT [20] databases.
Zhang et al. [21] proposed a novel CNN method that estimates age and
gender. The model is called Residual Networks of Residual Networks (RoR),
which operates on RoR architecture. It was first pre-trained on ImageNet and
fine-tuned on the IMBD-WIKI-101 dataset to learn features in facial images
further. Experimentation was performed on the Adience dataset, which shows
the RoR method’s effectiveness in age and gender estimation in the wild. The
authors in [22] explore transfer learning using VGG19 and VGGFace pre-trained
models. They focused on testing the effects of change in various schemes and
training parameters to improve accuracy. The comparison was made based on
training techniques such as input standardisation, data augmentation and label
distribution age encoding.
Lin et al. [23] utilised VGG-16 to tackle the age estimation task. They fine-
tuned their model using the FGNet dataset pair with contrastive loss and fur-
ther fine-tuned the intermediate models using the AvgOut-FC over the Adience
dataset. Their model gave an accuracy of 56.9% on Adience. Das et al. [24]
used Multi-Task Convolution Neural Network (MTCNN) to jointly use dynamic
joint loss weight adjustment to classify gender, age, and race. To enhance the
study of face attributes, they proposed a FaceNet for face recognition with
ResNet V1 inception. They also employ transfer learning to train the already
pre-trained model using less similarity in face attributes. For the objective of
model comparison, Han S. [25] proposed a work that estimates precise age
and age-group. They compared models such as InceptionResNetV2, Xception,
DesNet21, ResNet152V2, VGG16 and InceptionV3. They took advantage of the
soft labelling technique to analyse their results and reported that the Xception
performed best. Dagher et al. [26] proposed a stratified model that generates
high age estimation accuracy. The model consists of a set of pre-trained 2-classes
410 I. Aruleba and S. Viriri

GoogleNet coupled with an optimum age gap that can better organise the face
images in the age group they belong. Some of these proposed methods still have
challenges in estimating age from images taken in the wild, which has discrep-
ancies in the non-frontal, illumination, occlusion etc. In this paper, we try to
address the effect of these discrepancies in age estimation.

3 Materials and Method

3.1 Datasets

In this paper, the model begins with the dataset, and in this work, UTKFace
and Adience are used. They are both publicly available datasets.

– UTKFace: This dataset is a face dataset that consists of over 20,000, facial
images in the wild with a long span ranging from 0 to 100 years old. Each
facial image is with labelling of age, gender, and ethnicity. Some face samples
provide corresponding aligned and cropped faces. We grouped the ages into
five (5) age-groups; 0–13, 14–23, 24–39, 40–55 and 56 - above.
– Adience: The Adience dataset consists of images of 2284 subjects taken in
the wild. It is grouped into 8 age groups: 0–2, 4–6, 8–13, 15–20, 25–32, 38–43,
48–53, 60. The images have no filter but, has low resolution with different
expressions and head postures.

3.2 Preprocessing

The datasets used in this paper requires pre-processing to prepare the images
before they are being fed into the ConvNet. We performed face detection, face
alignment and landmark estimation.

– Face Detection: Firstly, we detected the faces in each of the datasets using
a facial detection algorithm that automatically detects the location of faces
and localises them by sketching a four-sided bounding box in the image. It
captures features such as mouth, nose, and eyes by returning the top, left,
the bottom and right coordinates of the face using Facenet library. If a face is
detected in an image, the face is cropped out, if not, the image is removed. We
choose this algorithm because it can handle images with various orientation,
detect faces across various scales, and adequately handle occlusions.
– Facial Landmark Detection and Alignment: After the faces have been
detected, we perform the landmark detection and face alignment using the
same Facenet. The algorithm picks the spatial points on the faces for the
significant annotations. After identifying the facial keypoint, the distance
between these points is measured, and the value becomes the facial features.
For the face alignment, the facial keypoint are used to capture face images
from different angles to match the features extracted.
Deep Learning for Age Estimation Using EfficientNet 411

– Data Normalisation: To further process the images, we normalised the


images using the Min-Max technique since the neural network performs better
on normalised data. Normalisation is a rescaling of the original range images
to put all values with the new range of 0 and 1. The Min-Max normalisation
provides linear transformation on the original range of images and maintains
the relationship among the original image. The mathematical representation
below defines the Min-Max normalisation technique as;
N − min value of N
N = ( ) × (D − K) + K (1)
max value of N − min value of N
where N  = Min-Max normalised image if the pre-defined boundary is [K, D],
if N is the range of the original image and K is the mapped one image.
– Data Augmentation: After the normalisation of the facial images, aug-
mentation is then applied to the images. We examined different methods to
observe the effect of overfitting and how augmentation could improve classifi-
cation accuracy. We applied a Random flipping set to horizontal flip because
an image could have been taken from the right or left compared to a vertical
flip which could probably not be suitable. The model is unlikely to recog-
nise an image from upside down. We also applied the Random rotation to
change the object’s angles in our dataset during training and set it at 0.2.
We randomly augmented the images and fed the outcome into the model for
training.

Fig. 1. Framework of the model


412 I. Aruleba and S. Viriri

3.3 EfficientNet

Tan et al. [13] studied the process of scaling up ConvNets to develop an efficient
way of scaling network architecture as it is generally used to achieve better
accuracy. Most scaling of Convolutional Neural Network is done randomly until
the desired outcome is found, and this requires much knowledge to understand
the manual scaling. To scale a network includes scaling a model by depth, width
and image resolution.
From the research done in [13], the authors presented a novel method that
uses a powerful compound coefficient φ to scale up the networks in a way that
makes it more methodical. Their research shows that if any single dimension of
network resolution, width or depth is scaled up, the network accuracy will be
improved. Also, it was proven that when dimensions of a network; depth, width
and resolution is balanced the accuracy is improved as compared to scaling just
one of the dimensions. The authors derived Eq. 2 to evenly scale up the depth
width and resolution for coefficient φ.

depth : d = αφ
width : w = β φ
resolution : r = γ φ (2)
s.t.α · β · γ ≈ 2
2 2

α ≥ 1, β ≥ 1, γ ≥ 1
Where α, β, and γ are constants that can be determined by a small grid
search, the coefficient φ is a user-specified coefficient that controls how many
new resources are available for model scaling. α, β, and γ specify how to assign
these extra resources to network depth, width and resolution, respectively. The
Floating Point Operations Per Second (FLOPS) of a regular convolution op is
proportional to d, w2 , r2 . Since convolution ops usually dominate the computa-
tion cost in ConvNet, scaling a ConvNet with the Equation will approximately
increase total FLOPS by (α · β 2 · γ 2 )φ . To better validate the scaling method’s
effectiveness, the authors developed a mobile-size baseline called EfficientNet
variant B0. They applied a technique that uses both efficiency and accuracy for
optimisation with respect to FLOPS. The network means building block is the
mobile inverted bottleneck MBConv with squeeze-and-excitation optimisation
and Swish activation added.
To further prove their approach, they first applied the scaling method to
MobileNets and ResNet. Their scaling method improves the accuracy of both
models compared to the single-dimension scaling method, which suggests the
effectiveness of their scaling method. They further used the compound scaling
method to establish the EfficientNet variants, including variant B1 to B7, making
α, β, and γ fixed while coefficient φ was scaled. Each one of these variants
contains parameters ranging from 5.3M to 66M.
Deep Learning for Age Estimation Using EfficientNet 413

3.4 Model Architecture


This research experimentation is limited to the first seven (7) EfficientNet vari-
ants (B0–B6) due to computational resource constraint. To reproduce results,
training and validation of images are all set as same through the experiment.
Each variant of the EfficientNet model is used for the transfer learning pro-
cess. We freeze the top layer and apply global average pooling2D to reduce over-
fitting by performing sub-sampling. Sub-sampling helps to reduce the total num-
ber of parameters by combining clusters of neurons into a distinct neuron. In
addition to that, a dropout layer with a rate of 0.45, an inner dense layer with
RELU activation functions and a final output dense layer containing five (5)
output units in the case of this classification task with a Softmax activation
function.

3.5 Training
Each dataset was split into 85% training set and 15% for validation and testing
set. To compile the model, we used categorical cross-entropy for the loss function
and Adam as the optimiser with an initial learning rate of 0.001 to handle
weight overshooting. It is noticed that when the learning rate is high, the model
does not retain information and when too low, the network trains too slow,
which consumes time and resources. We also monitor the accuracy using reduced
learning on the plateau by setting the patience at seven epochs, verbose at one
and the minimum learning rate at 0.01. The Adam optimiser is mathematically
defined as;
η · n̂t
θt+1 = θt − √ (3)
ŵt + 
where
nt wt
n̂t = , ŵt = (4)
1 − β1t 1 − β2t
and where

nt = (1 − β1 )gt + β1 nt−1 , wt = (1 − β2 )gt2 + β2 wt−1 (5)

where  is Epsilon, η is the learning rate, g is the gradient.


The batch size was set to 128 from variant B0–B4 and set at 64 for B5–B6
because of computational power and all variants we set to train for 35 and 25
epochs for UTKFace and Adience, respectively. In this research, it was necessary
to employ a regularisation technique to reduce overfitting in the learning process.
It helped to surmount the low bias and high variance performance of the model.
L2 regularisation is employed, and it is set at 0.03 for variant B0–B4 and set at
0.07 for variant B5–B6. L2 regularisation is defined as;
λ 
Cost F unction = Loss + × || w ||2 (6)
2m
where λ is the regularisation parameter.
Each trainable parameters in the models are shown in the Table 1.
414 I. Aruleba and S. Viriri

Table 1. Trainable parameters in each variant

Models Trainable parameters


(million)
B0 ≈4.1
B1 ≈6.5
B2 ≈7.7
B3 ≈10.7
B4 ≈17.6
B5 ≈28.4
B6 ≈40.8

4 Results and Discussion

4.1 System Environment Configuration

The experiments, evaluation, and testing of the age estimation model’s efficiency
are done on various frameworks of the state-of-the-art facial dataset using Open
Computer Vision (OpenCV) and TensorFlow libraries. These tools can extend
algorithms performance to classification task, object retrieval, and detection of
an object. We picked these libraries basically because of their capability and
strength, as performed in many other computer vision studies.
These libraries were configured and implemented using a Google colaboratory
environment with a GPU (Tesla K80) runtime type of 12 GB. To train each
variant, it required about 150 min and to predict a single image using a variant
required about 100 ms.

4.2 Evaluation Metrics

To measure the performance of each variant, we employed accuracy, precision,


recall and f1-score. The metrics are given as;
N umber of correct prediction
Accuracy = (7)
T otal number of predicted samples
T rue P ostives
precision = (8)
T rue P ositives + F alse P ositives
T rue P ositives
Recall = (9)
T rue P ositives + F alse N egatives
(Recall ∗ P recision)
F 1Score = 2 × (10)
(Recall + P recision)
Deep Learning for Age Estimation Using EfficientNet 415

4.3 Experimental Results

In total, 15043 processed facial images were used for training, and 2654 images
for testing on the UTKFace dataset. On the Adience dataset, 14500 processed
facial images were used for training and 2558 images were used for the testing
set. The results on each variant are reported using metrics in Sect. 4.2.
Table 2 and 3 presents the results on the datasets. Least performed vari-
ants on UTKFace are B0, B1, and B3 while B0, B1, and B5 on Adience. The
EfficientNet-B4 model gave the best result on both datasets compared to other
models. The model gave an accuracy of 73.5% and 81.1% on UTKface and Adi-
ence, respectively, which are better than accuracies reported in the literature.
This result shows that the B4 variant is the best in learning global features in
the training set of facial images.
Several experiments were carried out to obtain these performance accuracies.
Firstly, we did not apply regularisation and augmentation, which gave a very
high accuracy, but the model overfits terribly. After adding both techniques,
we got an accuracy of 63.3% on the B0 variant, which does not overfit, however
accuracy dropped. To further improve our accuracy, we fine-tuned our model and
stopped the training after 35 epochs as it is noticed to start overfitting again
after 35 epoch. We therefore modified the regularisation to control overfitting.
Having achieved a good set of parameters from the initial experiments. These
parameters were then applied to train other variants.
As we trained variants, we encountered overfitting on bigger variant such as
B5 and B6. Therefore, we had to further fine-tune our model by increasing the
L2 regularisation and other parameters to the problem with overfitting on bigger
variants, though it affected the accuracy.
The proposed models are compared with related approaches. We compared
the result with results in the literature to prove the effectiveness of EfficientNets.
Table 4 highlights pre-trained architecture utilised in previous approaches and
the accuracies obtained on the same dataset as ours. It can be seen that our
result outperformed similar approaches by a substantial margin. Suppose we
check the number of parameters used for computation by other architectures
listed in Table 4. In that case, the EfficientNet-B4 architecture has the lowest
parameter of 17.3 million compared to the VGG-16 and VGG-Face of over 138
million and the Inception module of approximately 24 million parameters.

Table 2. EfficientNet architecture variants test accuracy on UTKFace dataset

EfficientNet Accuracy Recall Precision F1-score


B0 71.4 69.7 72.2 88.3
B1 71.6 70.7 72.5 88.6
B2 72.7 71.1 73.6 89.7
B3 71.6 70.4 72.5 89.6
B4 73.5 71.8 73.5 88.0
B5 71.9 70.5 72.7 88.5
B6 72.9 71.6 74.6 93.2
416 I. Aruleba and S. Viriri

Table 3. EfficientNet architecture variants test accuracy on Adience dataset

EfficientNet Accuracy Recall Precision F1-score


B0 74.9 69.8 79.8 88.7
B1 77.7 74.5 80.9 92.6
B2 78.3 75.9 81.1 92.1
B3 80.0 78.7 82.5 94.1
B4 81.1 79.2 82.9 93.9
B5 77.4 73.6 80.7 90.5
B6 79.3 77.1 82.5 92.7

Fig. 2. EfficientNet-B4 Adience accuracy graph

Table 4. Comparison with previous approaches that utilised a pre-trained model on


Adience dataset

References Architecture Accuracy (%)


[14] Deep CNN 50.7
[23] VGG-16 56.9
[18] VGG-Face 59.9
[28] VGG-16 62.8
[27] Inception module 51.1
[21] Deep ROR 67.3
Proposed EfficientNet-B4 81.1
Deep Learning for Age Estimation Using EfficientNet 417

Fig. 3. Confusion matric for EfficientNet-B4

5 Conclusion

This paper presents a system to estimate ages in groups. The proposed model
implements seven lightweight EfficientNet variants trained and tested on two
datasets. The datasets were grouped into five groups on UTKFace and Eight
groups on Adience. Transfer learning was employed to fine-tune the models.
The EfficientNet architecture proved to be effective in handling global features
and images taken in the wild. After experimentation on seven variants, the
EfficienttNet-B4 gave the best accuracy of 73.5% and 81.1% on UTKFace and
Adience, respectively. This work shows that the application of augmentation and
regularisation techniques could help reduce overfitting and improve accuracy. It
also shows that the use of EfficientNet for the classification task as used in this
work saves training time and computational resources while achieving good accu-
racy. For future work, we will consider other possible pre-processing techniques
that would improve the accuracy of the model.

References
1. de Castro, P.V.: Age estimation using deep learning on 3D facial features (2018)
2. Angulu, R., Tapamo, J.R., Adewumi, A.O.: Age estimation via face images: a
survey. EURASIP J. Image Video Process. 2018(1), 1–35 (2018)
3. Huerta, I., Fernández, C., Segura, C., Hernando, J., Prati, A.: A deep analysis on
age estimation. Pattern Recogn. Lett. 68, 239–249 (2015)
4. Liu, H., Lu, J., Feng, J., Zhou, J.: Ordinal deep feature learning for facial age
estimation. In: 2017 12th IEEE International Conference on Automatic Face &
Gesture Recognition (FG 2017), pp. 157–164. IEEE, May 2017
418 I. Aruleba and S. Viriri

5. Yi, D., Lei, Z., Li, S.Z.: Age estimation by multi-scale convolutional network. In:
Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9005,
pp. 144–158. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16811-
1 10
6. Liu, X., et al.: AgeNet: deeply learned regressor and classifier for robust apparent
age estimation. In: Proceedings of the IEEE International Conference on Computer
Vision Workshops, pp. 16–24 (2015)
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
8. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
9. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:.
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model
size. arXiv preprint arXiv:1602.07360 (2016)
10. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1251–1258 (2017)
11. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
12. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the
IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
13. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural net-
works. In: International Conference on Machine Learning, pp. 6105–6114. PMLR,
May 2019
14. Levi, G., Hassner, T.: Age and gender classification using convolutional neural
networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pp. 34–42 (2015)
15. Anand, A., Labati, R.D., Genovese, A., Munoz, E., Piuri, V., Scotti, F.: Age estima-
tion based on face images and pre-trained convolutional neural networks. In: 2017
IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. IEEE,
November 2017
16. Pohjalainen, J., Räsänen, O., Kadioglu, S.: Feature selection methods and their
combinations in high-dimensional classification of speaker likability, intelligibility
and personality traits. Comput. Speech Lang. 29(1), 145–171 (2015)
17. Malhi, A., Gao, R.X.: PCA-based feature selection scheme for machine defect clas-
sification. IEEE Trans. Instrum. Meas. 53(6), 1517–1525 (2004)
18. Qawaqneh, Z., Mallouh, A.A., Barkana, B.D.: Deep convolutional neural network
for age estimation based on VGG-face model. arXiv preprint arXiv:1709.01664
(2017)
19. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:
a database for studying face recognition in unconstrained environments. In: Work-
shop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Octo-
ber 2008
20. Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with
matched background similarity. In: CVPR 2011, pp. 529–534. IEEE, June 2011
21. Zhang, K.: Age group and gender estimation in the wild with deep RoR architec-
ture. IEEE Access 5, 22492–22503 (2017)
22. Smith, P., Chen, C.: Transfer learning with deep CNNs for gender recognition and
age estimation. In: 2018 IEEE International Conference on Big Data (Big Data),
pp. 2564–2571. IEEE, December 2018
Deep Learning for Age Estimation Using EfficientNet 419

23. Lin, J., Zheng, T., Liao, Y., Deng, W.: CNN-based age classification via transfer
learning. In: Sun, Y., Lu, H., Zhang, L., Yang, J., Huang, H. (eds.) IScIDE 2017.
LNCS, vol. 10559, pp. 161–168. Springer, Cham (2017). https://doi.org/10.1007/
978-3-319-67777-4 14
24. Das, A., Dantcheva, A., Bremond, F.: Mitigating bias in gender, age and ethnicity
classification: a multi-task convolution neural network approach. In: Proceedings
of the European Conference on Computer Vision (ECCV) Workshops (2018). hal-
01892103
25. Han, S.: Age estimation from face images based on deep learning. In: 2020 Inter-
national Conference on Computing and Data Science (CDS), pp. 288–292. IEEE,
August 2020
26. Dagher, I., Barbara, D.: Facial age estimation using pre-trained CNN and transfer
learning. Multimedia Tools Appl. 80(13), 20369–20380 (2021). https://doi.org/10.
1007/s11042-021-10739-w
27. Sukh-Erdene, B., Cho, H.C.: Facial age estimation using convolutional neural net-
works based on inception modules. Trans. Korean Inst. Electr. Eng. 67(9), 1224–
1231 (2018)
28. Lapuschkin, S., Binder, A., Muller, K.R., Samek, W.: Understanding and compar-
ing deep neural networks for age and gender classification. In: Proceedings of the
IEEE International Conference on Computer Vision Workshops, pp. 1629–1638
(2017)
Towards a Deep Reinforcement Approach
for Crowd Flow Management

Wejden Abdallah1(B) , Dalel Kanzari2 , and Kurosh Madani3


1 National School of Computer Sciences, LARIA Laboratory,
University of Manouba, Manouba, Tunisia
[email protected]
2 Higher Institute of Applied Sciences and Technology,
University of Sousse, Sousse, Tunisia
3 LISSI/EA 3956 Laboratory, Senart Institute of Technology,
Paris-East University (UPEC), Lieusaint, France
[email protected]

Abstract. Reinforcement Learning knows an important success in var-


ious applications such as robotics, games, resource management, etc.
However, it is proving insufficient to solve the problem of crowd evac-
uation, in a realistic environment because the crowd situation is very
dynamic, with many changing variables and complex constraints that
make it difficult to solve. And there is no standard reference environment
that can be used to train agents in an evacuation. A realistic environ-
ment can be complex to design. In this paper, we use Deep Reinforce-
ment Learning to train agents in evacuation planning. The environment
is modeled as a grid with obstacles and the solution is modeled using
intelligent agents. It takes into account certain parameters, such as the
number of occupants, the capacity level, and the time to pass through
the exit doors. The objective is to allow the evacuation of the occu-
pants as quickly as possible and to help the agent decide on the optimal
escape route under varying conditions over time. And subsequently, this
approach will be useful to evacuation decision makers to better imple-
ment dynamic arrow signs. The results are motivating to use this type
of learning to optimize decision support in evacuation situations.

Keywords: Intelligent agent · Reinforcement learning · Crowd


evacuation · Shortest path

1 Introduction

Reinforcement Learning (RL) is a type of Machine Learning. The learning sys-


tem, called an agent, can observe the environment, select and perform actions,
and get rewards in return. It must then learn by itself what is the best strategy,
called a policy, to get the most reward over. Reinforcement Learning [1], which
has been successfully applied in many fields, has strong online adaptability and
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 420–431, 2021.
https://doi.org/10.1007/978-3-030-85030-2_35
Towards a Deep Reinforcement Approach for Crowd Flow Management 421

the ability to self-learn for complex systems among all types of learning meth-
ods. Especially when we use simulations to teach machines to make sequential
decisions.
The purpose of this paper is to study the usefulness of Deep Reinforcement
Learning as a decision-making tool in crowd evacuation scenarios. RL is a method
used in the Artificial intelligence field to allow agents to learn, in the absence
of a “teacher”, from interactions with their environment [12]. Over the past few
years, many simulation models have been proposed by researchers to simulate
the behavior of pedestrian crowds in open and closed areas. The growth of the
population has created more opportunities for crowds to gather in public spaces.
In a crowded area, when an emergency happens, it is easy to cause malicious
events such as crowd congestion and trampling. Planning evacuation paths is
important to reduce evacuation time in densely crowded areas, especially com-
plex environments with obstacles and multiple exits. Also, is one of the important
issues of evacuation simulations caused by emergency disasters. In such a situa-
tion, most of the pedestrians run to the nearest exit or follow the crowd to the
exit, resulting in traffic delays, trampling deaths, and other safety incidents. To
better assure the evacuation process, simulations and experiments are needed
to better predict crowd behavior and plan evacuation strategies. Therefore, it is
helpful to develop such strategies to evacuate spaces as quickly as possible, and
to avoid the risk of traffic accidents.
Crowd evacuation is important for the exit of people from a crowded zone
like subway stations, stadiums, theaters, cinemas, large shopping malls, build-
ings, or an area that contains a danger [4]. Searching for the ideal escape route
in an overcrowded situation is an extremely difficult task. In recent years, much
attention has been devoted to research on crowd evacuation in emergencies. Fur-
thermore, crowd simulation is rapidly becoming a standard tool for evacuation
planning and evaluation [9]. Crowd movement simulation is an essential basis for
analyzing and researching crowd behavior characteristics, self-organization, and
other crowd evacuation phenomena [14]. As crowd distributions can be estimated
for some events such as concerts and New Year celebrations, evacuation direc-
tion signs can be positioned before crowds gather at a place. In this document,
our objective is to find the shortest way to exit avoiding the danger, obstacles,
and crowded paths or emergency exits. The solution is based on graph theory
by introducing the Deep Reinforcement Learning algorithm.
The rest of this paper is organized as follows: Sect. 2, is a theoretical back-
ground and Sect. 3 gives an overview of some related works. Section 4, presents
the proposed approach and Sect. 5 details the results of experiments and we
discuss them. Finally, we conclude the paper and give future perspectives.

2 Theoretical Background

Reinforcement Learning is the problem faced by an artificial agent that must


learn behavior through trial-and-error interactions with a dynamic environment
[2,3]. The learner is not informed which actions to take, as in most forms of
422 W. Abdallah et al.

machine learning, instead must discover which actions provide the most reward
by trying them. At each time t the agent interacts with its environment in a
certain state st ∈ S and takes one of the possible available actions at ∈ A(t).
The action changes the state of the environment from st to st+1 . Additionally,
the agent receives a reward Rt+1 from the environment that serves as feedback
about how good it was to take action at state st [6]. Then the agent takes the
best possible action for this new state st+1 , thereby invoking a reward and so on.
Throughout iterations, the agent tries to improve upon its decision of which is the
“right move” that could be taken in a given state of the environment using the
rewards that it receives during the training process. Q-learning by Watkins [11]
is a value-based RL algorithm. It belongs to the class of Temporal Difference
learning methods, as its estimation of the maximum Q-value in the successor
state is used to update the Q value of the predecessor state, without explic-
itly knowing the model. Also, Q-learning is considered an off-policy algorithm
since we approximate the optimal Q-value function independently of the cur-
rent policy. The main aim of Q-Learning is to determine the optimal Q-function
(Q∗ (st , at )) for the agent. The actions taken by the agent will be decided by the
behavior policy. There will also be considered an alternative successor action
decided by the target policy [11]. The purpose is to update Q(st , at ), at each
step, towards the Q-value obtained from the target policy [7].
In classical Reinforcement Learning, the agent has a tabular setting, starts
with making arbitrary assumptions for all Q-values. With trial-and-error, the
Q-table gets updated and the policy progresses towards a convergence. How-
ever, computing Q-values for all possible combinations of state-action pairs and
storing them in the Q-table becomes computationally unmanageable and leads
to limitations. On the one hand, the amount of memory required to save and
update that table would increase as the number of states increases. On the other
hand, the amount of time required to explore each state to create the required
Q-table would be unrealistic. Agents in most real-world settings do not even tra-
verse whole state-action spaces in their respective environments. Instead, they
traverse specific state paths more frequently. To overcome the mentioned prob-
lem, a common approach is to create a neural network that will approximate,
given a state, the different Q-values for each action.
Deep Reinforcement Learning (DRL) is the combination of reinforcement
learning (RL) and Deep learning (DL). This field of research has been able to
solve a wide range of complex decision-making tasks that were previously out
of reach for a machine [2]. A deep Q network (DQN), is a multi-layered neural
network that for a given state s outputs a vector of action values Q(s, ·; θ), where
θ are the parameters of the network. Before DQN, it is well known that DRL is
unstable or even divergent when the action-value function is approximated with
neural networks.
Towards a Deep Reinforcement Approach for Crowd Flow Management 423

3 Related Works
In literature, there are two approaches for crowd evacuation: The macroscopic
model that takes the group as a whole without considering the local details
of individual behavior, and the microscopic model that takes into account the
interaction of each individual with the environment from an individual point of
view. In this work we focus on the microscopic model, this one includes cellular
automata models, social force model, and Agent-Based Models (ABM) [13,14].
ABM is a particular sub-category of microscopic modeling. A multi-agent sys-
tem, which could be a system with autonomous or semi-autonomous modules,
consists of many agents. Multi-agent learning is not a simple addition to the
learning of a single agent. But, the multi-agent learning process is a complex
problem that depends on the interaction of several agents. Therefore, combin-
ing the advantages of multi-agent learning with reinforcement learning is a very
challenging task in crowd evacuation [5]. For example, the authors of this work
[15] have proposed a path-finding model based on the probability that is capa-
ble of produce a variety of behaviors that allow individuals to adapt to changes
in knowledge in unfamiliar environments. Zhang et al. [16] have proposed a
framework of congestion based on Multi-Agent Reinforcement Learning which
improves the efficiency of evacuation for large-scale crowd path planning. The
specification of this framework [16], they proposed the improved Multi-Agent
Deep Deterministic Policy Gradient (IMADDPG) algorithm, adding the average
network of the elders to maximize the performance of the other agents, allow-
ing all agents to maximize the performance of a collaborative planning task.
[10] have proposed an algorithm Improved Multi-Agent Reinforcement Learning
(IMRAL) combined with the improved social force model for crowd evacuation
simulation. The path is selected by using the IMRAL algorithm and the evacua-
tion is based on the improved social force model. The method of this paper not
only has solved the dimensionality disaster problem of reinforcement learning
but also has improved the convergence speed.

4 Proposed Approach

The proposed approach presented in Fig. 1 is based on three components: intel-


ligent Reasoning based on Deep Reinforcement Learning, Crowd Valuation and
Decision-Making. By using these components we aim to guide agents to choose
the shortest route and avoid crowds. The idea is that each agent receives the
positions of obstacles, goals, and other agents in the environment. There is also a
monitoring agent that updates the crowd values and shares them with the other
agents. The agent computes dynamically the crowd value for each position. The
agent performs shortest path learning based on the matrices of three positions:
agents, goals, and obstacles. The training helps the agent to choose the right
path and avoid the crowd. As a DRL algorithm, we work on IDQN, they give
us as result a set of (s, a). For each action, we have a state and for each couple
(s, a) we compute a value of the crowd. Then, in the second decision phase, the
424 W. Abdallah et al.

Fig. 1. Proposed approach; path finding architecture in crowd evacuation based on


DeepR-reasoning, crowd evacuation and decision making

agent chooses the best combination of (s, a) and a minimal crowd value. The
goal is to optimize the number of neighbors of each state and choose the best
combination: the least crowd. IDQN helps us to choose the shortest and safest
path because in parallel we avoid obstacles. The IDQN is based on a multilayer
perceptron (MLP) using experience replay and the target network. In the exper-
iment section, we make comparisons on the number of hidden layers. The idea
of using experience replay is to store the agent’s experiences (st , at , rt+1 , st+1 )
in a buffer that can hold fixed number of experiences. In each training step, a
batch of experiences is uniformly sampled from the buffer and train the network
with them. Experienced replay removes the correlations in the data sequences
and feeds the network with independent data. It also ensures that old experi-
ences are repeated from time to time. Therefore, the agent can more robustly
learn to perform well in the task. The target network is used to generate the
target-Q values that will be used to compute the loss for every action during
training. The issue is that at every step of training, the Q network’s values shift,
and if we are using a constantly shifting set of values to adjust our network
values, then the value estimations can easily spiral out of control. The network
can become destabilized by falling into feedback loops between the target and
estimated Q-values. To reduce that risk, the target network’s weights are fixed,
and only periodically or slowly updated to the primary Q-networks values. In
this way, training can proceed more stably.
The network is updated according to the loss function computed by taking
the squared Temporal Difference error denoted by the following equation:
 
Li (θi ) = Êt (Rt+1 + γmaxa Q(st+1 , a, θi− ) − Q(st , at , θi ))2 (1)
Towards a Deep Reinforcement Approach for Crowd Flow Management 425

The Q-network is regularly updated according to the loss function from


Eq. 1, while the target network [8], with parameters θ− , is the same as the
online network except that it is updated by copying the parameters of the Q-
network to the target network θ− = θ− every T steps from the online network. It
smooths oscillating policies and leads to more stabilized learning. The following
Algorithm 1 represent the process for the simulation cycle for one episode.
Algorithm 1: Compelete of Simulation cycle for one episode
This is to hide end and get the last vertical line straight Initialize: An
environment E including a set of agents N={n1 , n2 ..., nn } ;
Initialize : Max-step , maximum step number and step = 0 , step
counter;
Initialize : reward-all= 0 and terminated flag done=False ;
while done = True and step ≤ Max-step do
for each agent ni ∈ N do
Pass the current state (sstep ) to the agent ni ;
Prepare the encoded matrices and create the input layer of the
MLP ;
Create the input layer and Perform a forward propagation ;

Obtain the list of possible actions A ⊆ A ;
if With a random probability 0 ≤ p ≤ 1 ; p ≤ exploration − rate
then

Explore a random move , a ∈ A ;
else
Select the action a = argmaxa∈A Qi (sstep , a) ;
end
Perform the selected action a by the agent mi to get the next state

s ,raward r and the terminated flag done ;

Perform action a , Get the next state s ,raward r and done ;
Get a priority , P , from the evaluation of the action ;

Store transition (sstep ,a , r , s ,P) in Agent’s Memory ;
Sample mini-batch from Memory ;
if done = True then
Set target Q̂ = r ;
else
  
Q̂ = r + γQ(s , argmaxa Q(s , a , θ); θ− ) ;
end
Perform a learning step on (Q̂ − Q(sstep , a))2 ;
reward-all = reward-all+r ;

sstep ← s ;
end
step = step+1 ;
end
426 W. Abdallah et al.

5 Experimental Study
In this section, we test the performance of the model and make comparisons. The
selected indicators to explain the outcomes of the experiments are the reward
obtained by the agents and the number of steps taken over all the simulations of
an experiment. The evolution of these metrics over the episodes of the simulations
is demonstrated.

5.1 Comparison Number of Hidden Layers

Three different hidden layers in the MLP configuration are evaluated in the
same configuration and complexity environment, a 10 × 10 grid, two agents,
and two goals. We trained the algorithm with the same parameters but with
different structures of the MLP: 2 hidden layers, 3 hidden layers, and 4 hidden
layers. The progression of the total cumulative reward received by the agents
over the episodes performed in each scenario is presented in Fig. 2, as well
as the number of steps in each episode in Fig. 3. In addition, other metrics
for all scenarios are shown in Table 1. The first noticeable remark is, for the
three different configurations, the agents obtain the same maximum rewards
with the same minimal number of steps. It is clear that there is a gap between
the performance with two hidden layers and with three or four hidden layers with
a deeper network, we can achieve better results. Despite the high performance
of the model with four hidden layers, shown in Table 1, we can see that in the
four hidden layers case, the agent, continued to do several steps at the end of
the training and obtained fewer rewards than in three hidden layers case.

Fig. 2. Reward evolution for each of the scenarios; with 2, 3 and 4 hidden layers. Note
that the y axis scales differ between plots.
Towards a Deep Reinforcement Approach for Crowd Flow Management 427

Fig. 3. Evolution of steps number for each of the scenarios; with 2, 3 and 4 hidden
layers.

Table 1. Performance according to hidden layers number.

2 layers 3 layers 4 layers


Max-reward −72 −72 −72
Average-reward −80,04 −79,16 −88,37
Min-reward −2028 −2092 −2123
Min-steps 9 9 9
Average-steps 10,29 10 10,93
Episodes with missed goals 489 249 242
Episodes with goals achieved in minimal steps 30872 30644 31083

5.2 Number Agents Comparison


Here, we kept the same environment and same settings but in each scenario the
environment included different numbers of agents and goals as shown in Fig. 4.
We should concentrate on two cases in the analysis of these findings in Figs. 5
and 6 and Table 2: same agents number with different goals number and different
agents number with same goals number. Note that the y axis scales differ between
plots. With several goals less than the number of agents, we can see that the
agents took more steps and less reward compared to the case where the number
of agents and goals are equal, as mentioned in Table 2. This is due to the goal
location is closer to one agent than to the other, so, the agents should make more
steps to get the goal which reduced minimum rewards. For the second example,
the more agents the contact between them increases, to avoid a collision that
explains the expansion of steps number and the rewards decline.
428 W. Abdallah et al.

Fig. 4. Environments with different number of agents and goals.

Fig. 5. Test results for reward evolution for each of the number of agent’s scenarios.

Fig. 6. Test results for evolution of steps number for each of the number of agent’s
scenarios; 2 agents 2 goals, 2 agents 1 goal, 5 agents 2 goals.
Towards a Deep Reinforcement Approach for Crowd Flow Management 429

Table 2. Performance according to the number of agents.

2ag2go 1ag1go 5ag2go


Max-reward −72 −189 −201
Average-reward −90,34 −212,51 −333,86
Min-reward −2028 −2945 −4818
Min-steps 9 17 12
Average-steps 11,91 18,49 35,16
Episodes with missed goals 486 231 89”0
Episodes with goals achieved in minimal steps 30872 33520 11577

5.3 Path Finding: Case Study


Among our objectives is to ensure that the agents achieve the goals taking the
shortest path. In this context we ran several scenarios, each time changing the
number of agents and the number of goals. We present here how agents reach
the goals within a complicated scenario. It should be noted that each step is
displayed separately in the training, but here we seek to replicate all the steps in
one figure by arrows. In this scenario, the environment includes five agents who
should reach two goals. Each agent takes the path to the goal as shown in the
Fig. 7. Such findings indicate that our agents succeed to reach the goals with
minimal steps. Furthermore, if more than one agent achieves the same goal, only
one agent fulfills the goal position and the others wait in the cell just before the
goal cell.

Fig. 7. Path-finding in an environment with 5 agents, 2 goals, and obstacles.


430 W. Abdallah et al.

6 Conclusion
In this paper, we use Deep Reinforcement Learning in crowd evacuation. The
objective is to choose the shortest path based on IDQN and the information
about the crowd gate. The main goal is to help the agent decide on the optimal
escape route under varying conditions over time. The reported results show that
this type of learning optimizes decision support in evacuation situations and
adapts to dynamic changes in the environment. In addition, the computation
of the crowd value for each couple (s, a) allow the agent to choose the suitable
decision. This work is just the beginning of future work. We plan to add more
agents for example 50 or 100 in different size grids and introduce some parameters
like emotional character using fuzzy logic to have a real evacuation environment.

References
1. Ciaburro, G.: Keras reinforcement learning projects: 9 projects exploring popular
reinforcement learning techniques to build self-learning agents. Packt Publishing
Ltd. (2018)
2. François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An intro-
duction to deep reinforcement learning. arXiv preprint arXiv:1811.12560 (2018)
3. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J.
Artif. Intell. Res. 4, 237–285 (1996)
4. Li, X., Liang, Y., Zhao, M., Wang, C., Bai, H., Jiang, Y.: Simulation of evacuating
crowd based on deep learning and social force model. IEEE Access 7, 155361–
155371 (2019)
5. Martinez-Gil, F., Lozano, M., Fernández, F.: Multi-agent reinforcement learn-
ing for simulating pedestrian navigation. In: Vrancx, P., Knudson, M., Grześ, M.
(eds.) ALA 2011. LNCS (LNAI), vol. 7113, pp. 54–69. Springer, Heidelberg (2012).
https://doi.org/10.1007/978-3-642-28499-1 4
6. Ravichandiran, S.: Hands-on reinforcement learning with python: master reinforce-
ment and deep reinforcement learning using OpenAI gym and TensorFlow. Packt
Publishing Ltd. (2018)
7. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press,
Cambridge (2018)
8. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double
q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
30 (2016)
9. Viswanathan, V., Lee, C.E., Lees, M.H., Cheong, S.A., Sloot, P.M.: Quantitative
comparison between crowd models for evacuation planning and evaluation. Eur.
Phys. J. B 87(2), 1–11 (2014)
10. Wang, Q., Liu, H., Gao, K., Zhang, L.: Improved multi-agent reinforcement learn-
ing for path planning-based crowd simulation. IEEE Access 7, 73841–73855 (2019)
11. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
12. Wharton, A.: Simulation and investigation of multi-agent reinforcement learning
for building evacuation scenarios. Report, St Catherine’s College, 55 p. (2009)
13. Xu, D., Huang, X., Mango, J., Li, X., Li, Z.: Simulating multi-exit evacuation using
deep reinforcement learning. arXiv preprint arXiv:2007.05783 (2020)
14. Yang, S., Li, T., Gong, X., Peng, B., Hu, J.: A review on crowd simulation and
modeling. Graph. Models 111, 101081 (2020)
Towards a Deep Reinforcement Approach for Crowd Flow Management 431

15. Zhang, G., Lu, D., Lv, L., Yu, H., Liu, H.: Knowledge-based crowd motion for the
unfamiliar environment. IEEE Access 6, 72581–72593 (2018)
16. Zheng, S., Liu, H.: Improved multi-agent deep deterministic policy gradient for
path planning-based crowd simulation. IEEE Access 7, 147755–147770 (2019)
Classification of Images as Photographs
or Paintings by Using Convolutional
Neural Networks

José Miguel López-Rubio1 , Miguel A. Molina-Cabello1,2(B) ,


Gonzalo Ramos-Jiménez1 , and Ezequiel López-Rubio1,2
1
Department of Computer Languages and Computer Science, University of Malaga,
Bulevar Louis Pasteur, 35, 29071 Málaga, Spain
2
Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
{miguelangel,ramos,ezeqlr}@lcc.uma.es

Abstract. Determining whether an image is a photograph or a painting


is an unsolved problem, and it is not trivially or automatically performed
by humans. In previous works, humans decided which metrics should be
calculated on an image to make a prediction, achieving a maximum pre-
cision of 94.82%. In this work, we propose the use of a deep learning
convolutional neural network that processes the images directly, without
determining the most relevant properties of an image in advance. Differ-
ent modifications of the VGG network architecture are analyzed. After
training the network with 16,000 images and for 100 epochs, an AUC
ROC above 0.99 is achieved in images from ImageNet and in the Kaggle
Painters by Numbers competition, and 0.942 in the images used by the
most recent proposal in the field.

Keywords: Convolutional neural networks · Feature extraction ·


Image classification · Distinguishing photographs · Paintings

1 Introduction

Currently, determining if an image is a photograph or a painting is an unsolved


problem [10] and, furthermore, it is an activity that people do not perform
trivially or automatically [2,6]. Cutzu et al. [6] stand that a photograph can be
considered as a photorealistic painting, so the problem can be reformulated as
determining the degree of photorealism of an image.
Some direct applications of this technique are automatic filtering and clas-
sification of large collections of photographs (for example, in museums) and

This work is partially supported by the following Spanish grants: TIN2016-75097-P,


RTI2018-094645-B-I00 and UMA18-FEDERJA-084. All of them include funds from the
European Regional Development Fund (ERDF). The authors acknowledge the funding
from the Instituto de Investigación Biomédica de Málaga - IBIMA and the Universidad
de Málaga.
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 432–442, 2021.
https://doi.org/10.1007/978-3-030-85030-2_36
Classification of Images as Photographs or Paintings by Using CNNs 433

searching for images from the web. It is also useful when transferring the style
from one image to another [24], as it is necessary to preserve the structure of the
original image in a photograph, while in a painting, more freedom is admissible.
The first related work was carried out by Athitsos et al. [3], which aims to
differentiate between photographs and computer-generated images (cliparts) in
order to help in multimedia file searching. Based on the same idea of automat-
ically classifying photographs, we also find the works of Szummer et al. [17] to
distinguish between indoor and outdoor images or the work of Vailaya et al. [1]
to differentiate between photos of cities and landscapes.
Recent research by Gando et al. [11] uses convolutional neural networks to
distinguish between illustrations and photographs. They make use of the AlexNet
network, which is one of the networks, apart from the VGG, proposed to solve the
ImageNet Large Scale Visual Recognition Challenge [23]1 . The authors retrain
the final layer (called FC7) and initialize the FC8 layer to carry out a binary clas-
sification. They use 20,000 images as the training dataset, achieving an overall
precision of 96.8%.
Since we are essentially extracting high-level properties from images automat-
ically, it is also worth mentioning the work done by Karayev et al. [14] where the
style of the images is predicted using a multilayer neural network capable of dis-
tinguishing between 25 different styles. A more complex version of this problem
consists of discerning if two paintings belong to the same artist [13]. Likewise,
there are other works in the line of extracting the style of a photograph and
applying it to another image such as that of Gatys et al. [12] or, more recently,
an optimized version from Fujun et al. [10] to transfer the style in a faster way.
With an identical objective to our work, we found authors such as Cutzu et
al. [5,6] who consider that photographs differ from paintings in features such
as color, edges, or texture. Specifically, in [5] the authors observe that paintings
have a greater amount of pure color edges than photographs, while certain met-
rics such as the mean and variance of Gabor filters are greater in photographs.
Consequently, it is possible to determine to which class an image belongs with
an accuracy of 90%.
One of the most recent publications on the subject of this work was carried
out by Carballal et al. [2]. They start from the concept of complexity of an image,
which is related to the entropy of the image, as opposed to the concept of order.
Specifically, it refers to the fact that according to Moles [21], the predictability
of each pixel in an image is what determines its complexity. It is concluded that
visual complexity can help to discern whether we are dealing with a photograph
or a painting. Previously, different measures had been used to estimate the com-
plexity of an image [4,9,18,21], but some are not computable or very difficult
to calculate. Therefore, the authors propose the use of four types of metrics:
(i) based on estimates of the complexity of the image measuring the error of
the JPEG or Fractal compression method [18,19,22]; (ii) applying Zipf’s law
[27]; (iii) considering the Fractal Dimension of the image [26] and (iv) using the
above methods again after applying edge detection filters to the image. Finally,
1
http://www.image-net.org/challenges/LSVRC/2014/index.
434 J. M. López-Rubio et al.

after calculating the different metrics and combinations, they obtain a maximum
precision in the classification of 94.82%.
In conclusion, the most recent proposals on the subject are based on the pre-
analysis of the images by using different metrics and then applying a classical
classification system. On the other hand, some research is based on the idea of
using CNN networks to perform classification tasks in images.
In this work, we fine-tune a VGG 16-layer-version convolutional neural net-
work [15] to determine whether an image is a photograph or a painting, achieving
an AUC ROC equal to or greater than 0.99 in images from ImageNet and Kag-
gle Painters by Numbers competition, and 0.942 in the images used by the most
recent proposal in the field [2].
The paper is structured as follows. Section 2 presents the methodology of the
proposal. The dataset used for the experiments is detailed in Sect. 3. Section 4
depicts the results and, finally, the conclusions are provided in Sect. 5.

2 Methodology

Our proposal is based on fine-tuning the 16-layer version of the VGG convo-
lutional neural network [15] to obtain a binary classifier capable of deciding if
an image is a photograph or a painting. Thus, we will automatically extract the
most relevant features of the images without having to manually choose the most
appropriate metrics and filters in advance.
Originally, in the VGG neural network, each image goes through a stack of
convolutional layers that use 3 × 3 size filters. After the convolutional layers,
three fully-connected layers are included in order to classify the image in a
category out of 1000 possible. It was used to solve the Imagenet Large Scale
Visual Recognition Challenge of 2014 [23].
In our model, the first four convolutional layers have been frozen so that all
weights calculated in the training process used to solve the original problem are
exactly the same. The fifth layer has been left active in order to adjust its weights
after the training phase for the batch of images of our new problem. Finally, the
classification layer has been replaced by two new layers: a dropout layer used
to prevent overfitting by deactivating a certain percentage of the previous layer
during the training phase (two different configurations have been tested: 25% or
50%), and a binary classification layer used to classify the image in one of the
two possible categories: photograph or painting.
There are other neural network parameters which are worth highlighting,
such as the use of the Adam optimizer with a low learning rate (0.00001) in order
to limit the magnitude of the changes on the layers which are being retrained,
the use of the binary cross entropy as loss function, and the adoption of the
accuracy to assess the performance of our model during the training phase.
Classification of Images as Photographs or Paintings by Using CNNs 435

3 Dataset
The considered image dataset has been separated into three disjoint subsets:
train, validation, and test. For each subset, images from two sources have been
used: (i) Imagenet [7] for the photographs and (ii) Kaggle competition - Painter
by numbers competition [13] for the paintings. In both cases, images have been
shuffled to achieve a random distribution for the style of the paintings and pho-
tographs. Additionally, the test images used by the authors of the most recent
article on this subject (Adrián Carballal et al. [2]) have been used to compare
their results with our model. The number of images taken from each source is
detailed in Table 1.

Table 1. Number of images used on each phase

Class Imagenet and Kaggle Carballal et al. [2]


Training Validation Tests Tests
Photographs 8000 2000 2000 2625
Paintings 8000 2000 2000 2422
Total 16000 4000 4000 5047

After some testing, it was concluded that the VGG network performs better
if the training images have higher resolution, so images of 299 × 299 pixels were
used instead of the original 224 × 224 resolution. Their aspect ratio has been
maintained by filling the gaps with white pixels as proposed by Baldassarre et al.
[8] in his article on grayscale image coloring. The resulting images have been left
centered and occupying the maximum possible area because this transformation
has been proven to be more effective when working with CNN as concluded by
Molina-Cabello et al. [20] in his article on classification of vehicle types. On the
other hand, images from Carballal et al. [2] have not been modified.
For the training phase, 16,000 images (8,000 from each category) from Ima-
geNet and Kaggle have been used. The following data augmentation techniques
have been applied to them: rotation between 0 and 40◦ , zoom in and out, hor-
izontal mirror, shear effect, and horizontal and vertical slide. A total of 4000
images (2000 per category) were used for validation, and another 4000 images
(2000 per category) were used for testing. Images provided by Carballal et al.
[2] have not been distorted.

4 Results
During the training phase, accuracy is used to determine the quality of our
model. It is calculated by dividing the number of correctly classified images
by the total number of images tested. Thus, the closer it is to 1, the better
performance our model yields.
436 J. M. López-Rubio et al.

Fig. 1. Evolution of the accuracy of our model in training and validation by epochs
for dropout values 0.5 and 0.25

As we are implementing a binary classifier, it is common to use the area under


the ROC curve (Receiver Operating Characteristic) as a performance measure.
The ROC curve shows the relationship between the ratio of false positives and
false negatives as a function of the threshold used for classification. As with the
accuracy, the closer to 1 the area under the ROC curve (AUC ROC) is, the
better the classifier is.
In Figs. 1 and 2 we can observe the evolution of the accuracy and loss in
the training and validation phases for the dropout values of 0.25 and 0.5. As
expected, the model fits better to the training data as the epochs progress. This
overfitting is the reason why it is important to use as many training images
as possible. After analyzing the behavior of the model with the two possible
dropout values, we can conclude that this factor does not decisively influence its
performance since the figures show a similar evolution.
Figure 3 shows the evolution of the AUC ROC of our proposed model using
the test images of the Carballal et al. article [2]. There are two maxima labeled on
the graphs. For a dropout of 0.25, the maximum is reached at epoch 79, obtaining
an AUC ROC of 0.942. With a dropout of 0.5, the maximum is achieved at epoch
81 with a value of 0.938. They are similar values and close to the maximum of
94.8% achieved by Carballal et al., although it should be mentioned that in their
article, precision is used as a measure instead of the AUC ROC.
The evolution of the AUC-ROC for the test images from Imagenet and Kag-
gle is not shown because our model gets an AUC ROC of 99% above epoch
20, regardless of the dropout value we use. This leads us to conclude that our
model adapts very well to these datasets, even for images that have never been
previously shown to the neural network.
Classification of Images as Photographs or Paintings by Using CNNs 437

Fig. 2. Evolution of the loss of our model in training and validation by epochs for
dropout values 0.5 and 0.25

Fig. 3. Evolution of AUC ROC per epoch for the images from the article of Carballal
et al. [2]

This difference between the results in the Carballal images and those of
Imagenet and Kaggle may be motivated by the use of a higher resolution in the
images, which allows the network to distinguish more details and, therefore, to
perform a better classification.
In Fig. 4 we can observe that the best results for the ROC curve have been
obtained with a dropout of 0.25 and in the epoch 79, both for the Carballal
images and for the Imagenet and Kaggle images. Those curves show the qual-
ity of our classifier and how its precision evolves during the training process.
438 J. M. López-Rubio et al.

Fig. 4. ROC graphs with a dropout of 0.25 for the images from Carballal et al. [2] and
Imagenet-Kaggle (both for epoch 79)

Fig. 5. Samples of photographs classified as photographs (correctly)

In this sense, we must emphasize that increasing the number of epochs does not
improve the prediction capacity of our model, so it is reasonable to conclude
that increasing the duration of this phase will not imply a substantial change in
the results.
In Figs. 5, 6, 7, 8 we can qualitatively evaluate some results of our classifier.
In particular, in Figs. 5 and 6 we can see how our model correctly classifies the
images in the appropriate categories. It is interesting to mention that in some
cases it is not trivial for a person to distinguish whether it is a photograph
or a painting, as noted at the beginning of this paper. Regarding the wrongly
Classification of Images as Photographs or Paintings by Using CNNs 439

Fig. 6. Samples of paintings classified as paintings (correctly)

Fig. 7. Samples of photographs classified as paintings (wrongly)

classified images, they are shown in Figs. 7 and 8. After visually analyzing some
of them, we could say that there are some cases in which we cannot really
consider them as a true classification errors. On the other hand, it seems that
increasing the level of detail (resolution) of the images used for training could
benefit the quality of the predictions (for instance, in the extreme case, we could
distinguish the strokes of the brush used in a hyper-realistic painting).
440 J. M. López-Rubio et al.

Fig. 8. Samples of paintings classified as photographs (wrongly)

5 Conclusion
The proposal has achieved a similar outcome to that of the latest research on
the problem. For the images from ImageNet and Kaggle, the AUC ROC is equal
to or greater than 0.99, so it can be considered as an excellent classifier. In the
case of the images provided by Carballal et al. [2] an AUC ROC above 0.94 is
achieved, which is a very good classifier.
We have shown that determining if an image is a photograph or a painting can
be addressed by fine-tuning previously trained convolutional neural networks.
The use of dropout layers is useful for this problem, as it reduces the overfitting
of the network. Finally, the resolution of the images used for training is also an
important factor, as it could be the reason why our model behaves better with
Imagenet and Kaggle images compared to Carballal et al.
As future work, it would be possible to improve the performance of the net-
work by using a larger set of training images to reduce overfitting, implementing
other CNN architecture (such as Inception-v4 [25] or AlexNet [16]), using an
ensemble of networks, allowing the retraining of more layers of the VGG neural
network or increasing the resolution of the images used for training and vali-
dation. However, all these ideas will increase the execution time and memory
consumption during the training phase.

References
1. Vailaya, A., Jain, A.K., Zhang, H.-J.: On image classification: city vs. landscapes.
Int. J. Pattern Recognit. 31, 1921–1936 (1998)
2. Carballal, A., Santos, A., Romero, J., Machado, P., João, C.: Distinguishing paint-
ings from photographs by complexity estimates. Neural Comput. Appl. 30, 1957–
1969 (2016)
Classification of Images as Photographs or Paintings by Using CNNs 441

3. Athitsos, V., Swain, M.J., Frankel, C.: Distinguishing photographs and graphics
on the World Wide Web. In: 1997 Proceedings IEEE Workshop on Content-Based
Access of Image and Video Libraries, pp. 10–17 (June 1997). https://doi.org/10.
1109/IVL.1997.629715
4. Birkhoff, G.D.: Aesthetic Measure. Mass, Cambridge (1933)
5. Cutzu, F., Hammoud, R., Leykin, A.: Estimating the photorealism of images: dis-
tinguishing paintings from photographs. In: 2003 IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition, 2003, Proceedings, vol. 2, pp.
II-305 (June 2003). https://doi.org/10.1109/CVPR.2003.1211484
6. Cutzu, F., Hammoud, R., Leykin, A.: Distinguishing paintings from photographs.
Comput. Vis. Image Underst. 100(3), 249–273 (2005). https://doi.org/10.1016/j.
cviu.2004.12.002
7. Deng, J., Dong, W., Socher, R., Li, L.: ImageNet: a large-scale hierarchical image
database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition,
pp. 248–255 (June 2009). https://doi.org/10.1109/CVPR.2009.5206848
8. Baldassarre, F., Morı́n, D.G., Rodés-Guirao, L.: Deep koalarization: image col-
orization using CNNs and inception-ResNet-v2. CoRR abs/1712.03400 (2017)
9. Forsythe, A., Nadal, M., Sheehy, N., Cela-Conde, C.J., Sawey, M.: Predicting
beauty: fractal dimension and visual complexity in art. Br. J. Psychol. 102(1),
49–70 (2011)
10. Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. In: 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July
2017)
11. Gando, G., Yamada, T., Sato, H., Oyama, S., Kurihara, M.: Fine-tuning deep
convolutional neural networks for distinguishing illustrations from photographs.
Expert Syst. Appl. 66, 295–301 (2016)
12. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. CoRR
abs/1508.06576 (2015). http://arxiv.org/abs/1508.06576
13. Kaggle Competition: Painter by numbers (2016). https://www.kaggle.com/c/
painter-by-numbers
14. Karayev, S., Hertzmann, A., Winnemoeller, H., Agarwala, A., Darrell, T.: Recog-
nizing image style. CoRR abs/1311.3715 (2013). http://arxiv.org/abs/1311.3715
15. Karen Simonyan, A.Z.: Very deep convolutional networks for large-scale image
recognition. CoRR abs/1409.1556 (2014)
16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger,
K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–
1105. Curran Associates, Inc. (2012)
17. Szummer, M., Picard, R.W.: Indoor-outdoor image classification. In: IEEE Inter-
national Workshop on Content-Based Access of Image and Video Databases, in
Conjunction with CAIVD 1998, pp. 42–51 (1998)
18. Machado, P., Cardoso, A.: Computing aesthetics. In: de Oliveira, F.M. (ed.) SBIA
1998. LNCS (LNAI), vol. 1515, pp. 219–228. Springer, Heidelberg (1998). https://
doi.org/10.1007/10692710 23
19. Machado, P., Romero, J., Manaris, B.: Experiments in computational aesthet-
ics. In: Romero, J., Machado, P. (eds.) The Art of Artificial Evolution. Natu-
ral Computing Series. Springer, Heidelberg (2008). https://doi.org/10.1007/978-
3-540-72877-1 18
442 J. M. López-Rubio et al.

20. Molina-Cabello, M.A., Luque-Baena, R.M., López-Rubio, E., Thurnhofer-Hemsi,


K.: Vehicle type detection by convolutional neural networks. In: Ferrández Vicente,
J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Toledo Moreo, J., Adeli, H. (eds.)
IWINAC 2017, Part II. LNCS, vol. 10338, pp. 268–278. Springer, Cham (2017).
https://doi.org/10.1007/978-3-319-59773-7 28
21. Moles, A.: Théorie de l’information et perception esthétique. Rev. Philos de la
France et de l’Étranger 147, 233–242 (1957)
22. Rigau, J., Feixas, M., Sbert, M.: An information-theoretic framework for image
complexity. In: Proceedings of the First Eurographics Conference on Computa-
tional Aesthetics in Graphics. Visualization and Imaging, pp. 177–184. Eurograph-
ics Association (2005)
23. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J.
Comput. Vis. 115(3), 211–252 (2015)
24. Sebastian Penhouët, P.S.: Automated deep photo style transfer. CoRR
abs/1901.03915 (January 2019)
25. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet
and the impact of residual connections on learning. In: ICLR 2016 Workshop
(2016). https://arxiv.org/abs/1602.07261
26. Taylor, R.P., Micolich, A.P., Jonas, D.: Fractal analysis of Pollock’s drip paintings.
Nature 399(6735), 422 (1999)
27. Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley
Press, Boston (1949)
Parallel Corpora Preparation
for English-Amharic Machine Translation

Yohanens Biadgligne1(B) and Kamel Smaı̈li2(B)


1
Bahir Dar Institute of Technology, Bahir Dar, Ethiopia
2
Loria - University of Lorraine, Nancy, France
[email protected]

Abstract. In this paper, we describe the development of an English-


Amharic parallel corpus and Machine Translation (MT) experiments
conducted on it. Two different tests have been achieved. Statistical
Machine Translation (SMT) and Neural Machine Translation (NMT)
experiments. The performance using the bilingual evaluation understudy
metric (BLEU) shows 26.47 and 32.44 respectively for SMT and NMT.
The corpus was collected from the Internet using automatic and semi
automatic techniques. The harvested corpus concerns domains coming
from Religion, Law, and News. Finally, the corpus, we built is composed
of 225,304 parallel sentences, it will be shared for free with the commu-
nity. In our knowledge, this is the biggest parallel corpus so far concerning
the Amharic language.

Keywords: Amharic language · Machine translation · SMT · NMT ·


Parallel corpus · BLEU

1 Introduction
The field of machine translation (MT) is almost as old as the modern digital com-
puter. Throughout these times it undergoes in many technological, algorithmic
and methodological milestones [1]. Since its emerging time various approaches
have been and being proposed by different researchers in the domain [2,3]. Lexi-
con (Dictionary) based MT- this strategy for translation depends on entries of a
language dictionary. The word’s comparable is utilized to build up the deciphered
verse. The original of machine translation (late 1940s to 1960s) was altogether
in light of machine-readable or electronic lexicons [4–6]. The rule based MT
demands various kinds of linguistic resources such as morphological analyzer
and synthesizer, syntactic parsers, semantic analyzers and so on [7,8]. On the
other hand, corpus based approaches (as the name implies) require parallel and
monolingual corpora [8,9].

Supported by Bahir Dar Institute of Technology.


c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 443–455, 2021.
https://doi.org/10.1007/978-3-030-85030-2_37
444 Y. Biadgligne and K. Smaı̈li

1.1 Motivation

Obtaining accurate translations between two languages using a machine is an


open problem. It is expected to be improved in translation accuracy, transla-
tion speed, inclusiveness of all languages of the world etc. That is why many
researchers and organizations (for example Google, Yandex, Being, Facebook
etc.) working hard to create a system that is robust and dependable. But most
of the researches ignored the languages that are spoken by people who live in
under-developed countries. Amharic is one these languages. Even if the per-
formance of MT systems needs improvement; it has been broadly used in the
translation sector as an assistant for professional human translators in devel-
oped countries. As indicated in [10] it’s market share will reach $983.3 million
by 2022. So, the main motivation behind this proposal is to contribute our share
for the advancement of robust MT system for English-Amharic language pairs
and make Amharic one of the language in the market share.

1.2 Machine Translation on English-Amharic Language Pairs

Globally, most of MT researches are done for the languages that are spoken by
technologically advanced countries. As a result, a significant improvement has
been observed towards development and use of MT systems for these languages.
However, MT researches for languages like Amharic (which is considered under
resourced) has started very recently. According to literature [8,11–15] many of
English-Amharic MT researches have been conducted using SMT which requires
large bilingual corpus. However despite this efforts, still we don’t have sufficient
amount of digital data to train the SMT model. This shortage of digital data,
affects the fluency of the translation model and hence the quality of the trans-
lation.
Even-tough Amharic is one of the under resourced languages its counter-
part English is the most richest language in terms of data availability. Means,
we can find a huge amount digital texts on different resources. In spite of this
discrepancies there is a massive need of translation between these languages.
News agencies, magazine producers, FM radios, schools, private translators of
books, newspaper producers and governmental law announcement paper produc-
ers which prints its newspaper in English and Amharic versions are in need of
translation on a daily basis. So, we need a MT system to make easy the delivery
of courses taught in Ethiopian high schools and universities; to make translation
faster and cost effective; to avoid biases in translation; especially, in political
domain [13–15].
The major and basic resource required for SMT is a huge parallel corpora
[16]. Unfortunately this is not the case for Amharic language. The collection
and preparation of parallel corpora for this language is, therefore, an important
endeavor to facilitate future MT research and development. We have, therefore,
collected and prepared parallel corpora for English-Amharic Languages. This
paper will describe an attempt that we have made to collect and prepare English-
Amharic parallel corpora and the experiments conducted using the corpora.
Parallel Corpora Preparation for English-Amharic Machine Translation 445

1.3 Nature of English-Amharic Language Pairs

Amharic Language. Amharic is the second most-spoken Semitic language


on the planet, next to Arabic. Of all the languages being spoken in Ethiopia,
Amharic is the most widely spoken language. It is the official or working lan-
guage of the states within the federal system. Moreover, it is used in governmen-
tal administration, public media and mass communication (television, radio,
literature, entertainment, etc.), and national commerce. Figures change between
scientists; notwithstanding, numerous vibe that it has around 57 million speak-
ers. Outside Ethiopia, Amharic is the language of somewhere in the range of
4.7 million emigrants (mainly in Egypt, America, Israel, and Sweden). As of
late the number of Amharic talking populace has expanded in Britain and other
European nations significantly [17,19].
Amharic is composed with its own script (a variant of the
Ge’ez script known as Fidel a semi-syllabic frame-
work (Depicted in Table 1). Amharic characters represent a consonant vowel
(CV) sequence and the basic shape of each character is determined by the con-
sonant, which is modified for the vowel. It has 33 primary characters, each repre-
senting a consonant and each having 7 varieties in form to demonstrate the vowel
which takes after the consonant (Amharic vowels are depicted in Table 2). These
33 sets of 7 shapes are the “common characters”; yet close to them there are
additionally various “diphthong characters”, each representing a consonant and
a following vowel with a /wu/ sound (or, in one case, a /yu/ sound) interposed
between them. In composing, none of them is crucial in light of the fact that
similar sounds can simply be spoken to by mixes of the customary characters,
yet a large number of them are in common use and, in general, they can’t be
disregarded [13,19]. Additionally, even if they are not used regularly, Amharic
has its own numerals. These are depicted in Table 3 and Table 4.
Both Amharic and the related languages of Ethiopia are written and read
from left to right, in contrast to the other Semitic languages like Arabic and
Hebrew.

– Syntactic and morphological nature of the language. Unlike English,


Amharic is a morphological complex language. Amharic make use of the root
and pattern system [13,18,19]. A root (which is called a radical) is a set
of consonants which bears the basic meaning of the lexical item whereas a
pattern is composed of a set of vowels inserted between the consonants of
the root such as in Arabic. Such derivation process makes these languages
morphological complex. A derivation process that deals with word-formation;
such methods can create new words from existing ones, potentially changing
the category of the original word.
In addition to the morphological information, some syntactic information are
also expressed at word level. Furthermore, an orthographic word may attach
some syntactic words like prepositions, conjunctions, negation, etc. [20,21].
In this languages, nominals are inflected for number, gender, etc. At the
sentence level Amharic follow Subject-Object-Verb (SOV) word order. On
446 Y. Biadgligne and K. Smaı̈li

Table 1. A list of Amharic scripts.

Table 2. A list of Amharic vowels and their pronunciation.

Table 3. A list of Gee’z/Amharic numbers.

Table 4. A list of Gee’z/Amharic numbers.


Parallel Corpora Preparation for English-Amharic Machine Translation 447

the contrary, English language uses Subject-Verb-Object (SVO) word-order


(Amharic morphology alteration, Amharic syntactic structure and English
syntactic structure are depicted by Table 5, Table 6, and Table 7 respectively).

Table 5. An example of Amharic morphology alteration.

Table 6. Amharic syntactic structure

Table 7. English syntactic structure

Subject Verb Object


Ethiopia is in Africa

2 Related Works
Different attempts have been made to collect English-Amharic parallel corpus.
Below we summarize the researches with most significance to our research. The
most recent attempt to collect English-Amharic parallel corpus is done by Gezmu
et al. [22]. They have managed to collect 145,364 English-Amharic parallel
448 Y. Biadgligne and K. Smaı̈li

sentences. The experimental results show that they achieved 20.2 and 26.6 in
BLEU score by using Phrase Based Statistical Machine Translation (PBSMT)
and NMT models respectively.
Abate et al. [8] collected the English Ethiopian Language (EEL) parallel cor-
pus. They made an attempt to collect parallel corpus for seven major languages
of Ethiopia. Amharic was one of them and totally, they collected 40,726 English-
Amharic parallel sentences. The SMT approach applied to the collected corpus
produced 13.31 BLEU score.
The low resource languages for emergent incidents (LORELEI-Amharic) was
developed by the Linguistic Data Consortium and is comprised of a monolingual
and parallel Amharic text [23]. It has 60,884 English-Amharic sentences.
As we can observe from the above paragraphs; the largest parallel corpora
for English-Amharic language pairs is collected by Gezmu et al. [8].

3 Parallel Corpora Preparation for the Language Pairs

A corpus is a collection of linguistic data, either written texts or a transcription


of recorded speech, which can be used as a starting-point of linguistic description
or as a means of verifying hypotheses about a language [24]. Corpus is not any
kind of text. It is a sample/collection of texts of a given language which should
be representative with regards to the research hypothesis [25]. In this section we
will discuss step by step the tasks we have accomplished to collect our bilingual
parallel corpora. Work flow of this process is depicted in (Fig. 1).

3.1 Selection of Data Sources

High quality parallel corpus is crucial for creating SMT or NMT systems [26].
Although high quality parallel corpora is largely available for official languages
of the European Union, the United Nations and other organization. It is hard
to find enough amount of open parallel corpus for languages like Amharic. So,
the only option we have is to create this corpus by ourselves. To do that we
should first identify domains with abundant amount of information in a text
format. After identifying the domains we collected the raw digital texts from the
internet. The collected text data fall under the religious, legal and news domains
for which the Amharic text has the corresponding translation in English. Even if
there is no shortage of data for English; these are the domains with huge amount
of digital text data for Amharic language.
Parallel Corpora Preparation for English-Amharic Machine Translation 449

Fig. 1. Process of collecting parallel data.

3.2 Collection of Crude Data


In this work, we used different tools and techniques to collect the parallel cor-
pus. As the main tools HTTrack and Heritrix are utilized to crawl and archive
different websites and news blogs [27,28]. Additionally, we downloaded a consid-
erable amount of legal documents from different sources. Finally, we extracted
the parallelly aligned text data from the collected raw data and merged them
into a single UTF-8 file for each language. Still now we have collected a total of
225,304 sentences for each language. Table 8 show detailed information about
our corpus.
450 Y. Biadgligne and K. Smaı̈li

Table 8. Detailed information about the parallel corpus

Domain Number of sentences


Religion 194,023
Law 14,515
News 16,766
Total 225,304

The total corpus consists of 2,814,888 and 4,068,831 tokens (words) for
Amharic and English languages respectively. These figures reveals an interesting
fact. In our corpus English uses approximately 1.44 words to express an idea
which is written in Amharic with one word.

3.3 Data Pre-processing

Data pre-processing is an important and basic step in preparing parallel corpora.


Since the collected parallel data have different formats, punctuation marks and
other unimportant contents, it is very difficult and time-consuming to prepare
usable corpus from it. As part of the pre-processing, unnecessary links, numbers,
symbols and foreign texts in each language have been removed. Additionally,
character normalization, sentence tokenization, sentence level alignment, true-
casing and cleaning are performed [16].

– Character normalization there are characters in Amharic that have sim-


ilar roles and are redundant. To avoid words with same meaning from being
taken as different words we replaced these set of characters with similar func-
tion into a single most frequently used character. For Example: have
similar sound and usage in Amharic. They are used to represent sound /h@/.
Similarly, and are used to represent the sound /s@/. So, we removed these
characters and substitute them by and respectively. Additionally, numer-
als from Amharic to Arabic (1, 2, 3 ...9) have been changed.
– Tokenization. Tokenization or segmentation is a wide concept that covers
simple processes such as separating punctuation from words, or more sophisti-
cated processes such as applying morphological treatments. Separating punc-
tuation and splitting tokens into words or sub-words has proven to be helpful
to reduce vocabulary and increase the number of examples of each word,
improving the translation quality for certain languages [29]. Tokenization is
more challenging when dealing with languages with no separator between
words. But this is not the case in this work. Inherently both languages use
a word level tokenization. The main task done in this stage was separating
words from punctuation marks.
– True-Casing we perform this task in order to insure the proper capital-
ization of every sentence in the corpora. To achieve this we used the Moses
Parallel Corpora Preparation for English-Amharic Machine Translation 451

built-in truecaser script. This pre-processing is done only for English Lan-
guage. Because, grammatically every English sentence should start with an
uppercase letter but this is not the case for Amharic. Means, there is not
uppercase and lowercase letters in Amharic character map [30].
– Cleaning this step is performed to remove empty lines; to avoid redundant
space between characters and words; and to cut and discard very long sen-
tences on the parallel corpus simultaneously [31]. At this stage we only con-
sider sentences with 80 words long at most. After performing this task the
total number of sentences are reduced to 218,365 sentences from the collected
225,304 Sentences.

4 Experiments and Results

4.1 Experimental Setup

We present two different methods for translation. We used Moses and Open-
NMT to train different MT systems. Statistical and neural network based models
respectively.

– SMT experimental setup creating SMT systems involves two main steps.
Creating the language model and training the translation system. A statis-
tical language model is a probability distribution over sequences of words
and assigns a probability to every string in the language [32]. Our language
model is built with the target language Amharic. We used KenLm to cre-
ate a 3-gram language model. Totally, 225,304 sentences are utilized for this
purpose. After our language model is created the next step was training the
translation system. This process enables the model to grasp the relation-
ship between English and Amharic. The model was trained with our pre-
processed and cleaned parallel corpus (with 218,365 parallel sentences). As
part of the training; word alignment (using GIZA++), phrase extraction and
scoring are done. Additionally lexical reordering tables were also created.
Then we binarised the model for quick loading at testing stage. Mathemati-
cally, the translation model is depicted by Eq. (1) and (2) where a indicates
the Amharic language and e, the English one. Before testing our translation
model it should be tuned on other unseen data set. This process enable us to
modify the training parameters. These parameters come with default values.
However the parameters should be adjusted for each new corpus. In order to
tune our translation model we used a distinct small amount of parallel corpus
with size of 3121 sentences. This corpus is tokenized and true-cased before it
was used.

P (e|a)P (a)
P (a|e) = (1)
P (e)
â = argmaxP (a|e) = P (e|a)P (a) (2)
a
452 Y. Biadgligne and K. Smaı̈li

Testing is the final stage of our SMT experiment. At this stage we measure
how fluent our translation model is. For this purpose we used a distinct corpus
of size 2500 sentences. This test set corpus was tokenized and true-cased
before it was used. Since our goal is to translate from English to Amharic; we
tested our translation model by providing the source language testing corpus
(the English sentences). Finally, our translation model translates this English
sentences to an Amharic version.
– NMT experimental setup for the sake of this experiment we used Open-
NMT: Neural Machine Translation Toolkit [34]. The corpus was split as for the
SMT experiment into three parts training, validation and testing sets. Then
we perform Byte Pair Encoding(BPE). BPE enables NMT model translation
on open-vocabulary by encoding rare and unknown words as sequences of
sub-word units. This is based on an intuition that various word classes are
translatable via smaller units than words. The next step is preprocessing;
actually it computes the vocabularies given the most frequent tokens, filters
too long sentences, and assigns an index to each token. Training the main
and time consuming task in this experiment. To train our NMT model we
used Recurrent Neural Networks(RNN) with attention mechanisms. Because,
attention mechanism has been shown to produce state-of-the-art results in
machine translation and other natural language processing tasks. The atten-
tion mechanism takes two sentences, turns them into a matrix where the
words of one sentence form the columns, and the words of another sentence
form the rows, and then it makes matches, identifying relevant context [35].
This is very useful in machine translation. While we train our RNN model
it takes approximately eight and half hours on a GPU equipped device. The
detailed parameters of the RNN model are depicted in Table 9.

Table 9. Parameters and values of RNN model

Parameters Values
Hidden units 512
Layers 6
Word vec size 512
Train steps 20000
Batch size 4096
Label smoothing 0.1

4.2 Experimental Results

With this experiment, we created SMT and NMT models for English-Amharic
translation. These two languages are different in nature. It means, they are
different in language family, scripts, morphology and syntax. Nonetheless, we
build and evaluate our SMT and NMT translation models for the language
Parallel Corpora Preparation for English-Amharic Machine Translation 453

pairs. We used the BLEU metric to evaluate the performance of our models. The
BLEU metric is an algorithm for evaluating the quality of machine translated
texts from a source text with reference translations of that text, using weighted
averages of the resulting matches. Accordingly, the obtained results are described
in Table 10.

Table 10. Comparison of our work with other similar works.

Therefore, from Table 10 we can observe that our NMT model shows better
translation accuracy than that of the SMT system. The translation accuracy is
increased by 22.55%. According to [36] the BLEU score of our NMT model fall
between 30 and 40 (actually it is 32.44); which means that the NMT translated
texts are understandable to good translations when they are compared with the
source texts.
Different attempts have been made to create English-Amharic parallel cor-
pus. Along with that some SMT and NMT experiments are also conducted.
For example, Abate et al. [8] collected 40,726 parallel sentences and their SMT
model BLEU score was 13.31. Additionally, Ambaye and Yared [37] performed
the same (SMT) experiment by using their own corpus and registered 18.74 in
a BLEU metrics.
On the other hand even if they are very limited in number, some NMT
experiments are also done for the language pairs. Most recently, Gezmu et al.
[22] used NMT models and produced 26.6 BLEU score. In [38], the authors
collected English-Amharic parallel corpus and conducted NMT experiment on
it. As indicated in their research paper the BLEU score ranges between 10 and
12 for different size corpus. Over all by comparing our experiment with the
aforementioned attempts, we can say that our research shows an advancement
in corpus size and BLEU score for both SMT and NMT.

4.3 Conclusion

MT needs a quite large amount of parallel data. But most of the researches
conducted for Amharic language uses small amount of parallel sentences. Their
magnitude is measured in terms of thousands and tens of thousands sentences.
This is due to the difficulties of finding abundant amount of translated digital
texts from English version. In addition to the size the quality of the available
translated documents are not good (can not be directly used for MT purpose).
454 Y. Biadgligne and K. Smaı̈li

The main objective of this study was to alleviate the aforementioned problem.
That is to collect a sizable amount of clean parallel corpus for English-Amharic
language pairs. After a prolonged effort, so far we have manged to collect 225,304
parallel and clean sentences. In order to make sure that this parallel corpus is
usable for MT or not, we conducted two different experiments. The Results
obtained by the two models (SMT and NMT) are promising and our created
corpus could be used as a good benchmark corpus which will be proposed for
free for the community. Generally, according to the BLEU score interpretation
and the results registered by the two models; we can conclude that the prepared
parallel corpus is usable for MT researches.

References
1. Slocum, J.: A survey of machine translation: its history, current status and future
prospects. Comput. Linguist. 11(1), 1–17 (1985)
2. Antony, P.J.: Machine translation approaches and survey for Indian languages. Int.
J. Comput. Linguist. Chin. Lang. Process. 18(1), 47–78 (2013)
3. Hutchins, J.: Latest developments in machine translation technology: beginning a
new era in MT research. In: Proceedings MT Summit IV.: International Coopera-
tion for Global Communication, pp. 11–34 (1993)
4. Ashraf, N., Manzoor, A.: Machine translation techniques and their comparative
study. Int. J. Comput. Appl. 125(7), 25–31 (2015)
5. Lambert, P., Rafael, E., Núria, C.: Exploiting lexical information and discrimi-
native alignment training in statistical machine translation. Diss. Ph. D. thesis,
Universitat Politecnica de Catalunya. Spain (2008)
6. Poibeau, T.: Machine Translation. MIT Press, Cambridge (2017)
7. Antony, P.J., Soman, K.P.: Computational morphology and natural language pars-
ing for Indian languages: a literature survey. Int. J. Sci. Eng. Res. 3, 589–599 (2012)
8. Abate, S.T., et al.: Parallel corpora for bi-directional statistical machine translation
for seven Ethiopian language Pairs. In: Proceedings of the First Workshop on
Linguistic Resources for Natural Language Processing (2018)
9. Romdhane, A.B., Jamoussi, S., Hamadou, A.B., Smaı̈li, K.: Phrase-based language
model in statistical machine translation. Int. J. Comput. Linguist. Appl. 3 (2016)
10. https://www.grandviewresearch.com/press-release/global-machine-translation-
market . Accessed 03 June 2021
11. Gebreegziabher, M., Besacier, L.: English-Amharic Statistical Machine Translation
(2012)
12. Teshome, E.: Bidirectional English-Amharic machine translation: an experiment
using constrained corpus. Master’s thesis. Addis Ababa University (2013)
13. Teferra, A., Grover, H.: Essentials of Amharic. Rüdiger Köppe. Verlag, Köln (2007)
14. Daba, J.: Bi-directional English-Afaan oromo machine translation using hybrid
approach. Master’s thesis. Addis Ababa University (2013)
15. Saba, A., Sisay F.: Machine translation for Amharic: where we are. In: proceedings
of LREC, pp. 47–50 (2006)
16. Rauf, S., Holger, S.: Parallel sentence generation from comparable corpora for
improved SMT. Mach. Transl. 25(4), 341–375 (2011)
17. Abiodun, S., Asemahagn, A.: Language policy, ideologies, power and the Ethiopian
media. Communicatio 41(1), 71–89 (2015). https://doi.org/10.1080/02500167.
2015.1018288
Parallel Corpora Preparation for English-Amharic Machine Translation 455

18. Leslau, W.: Reference Grammar of Amharic. Otto Harrassowitz, Wiesbaden (1995)
19. Yimam, B.: Root reductions and extensions in Amharic. Ethiop. J. Lang. Lit. 9,
56–88 (1999)
20. Gasser, M.: A dependency grammar for Amharic. In: Workshop on Language
Resource and Human Language Technologies for Semitic Languages (2010)
21. Gasser, M.: HornMorpho: a system for morphological processing of Amharic,
Oromo, and Tigrinya. In: Conference on Human Language Technology for Devel-
opment, Alexandria, Egypt (2011)
22. Gezmu, A.M., Nürnberger, A., Bati, T.B.: Extended parallel corpus for Amharic-
English machine translation. arXiv e-prints, arXiv-2104 (2021)
23. Strassel, S., Jennifer, T.: LORELEI language packs: data, tools, and resources for
technology development in low resource languages. In: Tenth International Con-
ference on Language Resources and Evaluation, pp. 3273–3280 (2016)
24. John, S.: Corpus Concordance Collection. OUP, Oxford (1991)
25. Crystal, D.: An Encyclopedic Dictionary of Language and Languages. Blackwell,
Oxford (1992)
26. Dogru, G., Martı́n-Mor A., Aguilar-Amat, A.: Parallel corpora preparation for
machine translation of low-resource languages: Turkish to English Cardiology Cor-
pora (2018)
27. HTTrack Website Copier Homepage. https://www.httrack.com/page/2/. Accessed
10 Oct 2020
28. Heritrix Home Page. http://crawler.archive.org/index.html. Accessed 15 Sep 2020
29. Palmer, D.D.: Tokenisation and sentence segmentation. Handbook of Natural Lan-
guage Processing, pp. 11–35 (2000)
30. Lita, L.V., Ittycheriah, A., Roukos, S., Kambhatla, N.: tRuEcasIng. In: Proceed-
ings of the 41st Annual Meeting of the Association for Computational Linguistics,
Sapporo, Japan, pp. 152–159 (2003)
31. Achraf, O., Mohamed, J.: Designing high accuracy statistical machine translation
for sign language using parallel corpus-case study English and American sign lan-
guage. J. Inf. Technol. Res. 12(2), 134–158 (2019)
32. Goyal, V., Gurpreet, S.: Advances in machine translation systems. Lang. India
9(11), 138–150 (2009)
33. Daniel, J., James, H.M.: Speech and Language Processing. Handbook of Natural
Language Processing. Draft of October 2 (2019)
34. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: open-source
toolkit for neural machine translation. In: Proceedings of ACL, System Demon-
strations. Association for Computational Linguistics, Vancouver (2017)
35. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning
based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018)
36. Google Cloud Home Page. https://cloud.google.com/translate/automl/docs/
evaluate. Accessed 03 Jan 2021
37. Ambaye, T., Yared, M.: English to Amharic machine translation. The Prague Bul-
letin of Mathematical Linguistics (2012)
38. Yeabsira, A., Rosa, T., Surafel, L.: Context based machine translation with recur-
rent neural network For English-Amharic translation. In: Proceedings of ICLR
2020 (2020)
Fast Depth Reconstruction Using Deep
Convolutional Neural Networks

Dmitrii Maslov1 and Ilya Makarov1,2(B)


1
HSE University, Moscow, Russia
[email protected]
2
Artificial Intelligence Research Institute, Moscow, Russia

Abstract. In this paper, we study depth reconstruction via RGB-based,


Sparse-Depth, and RGBd approaches. We showed that combination of
RGB and Sparse Depth approach in RGBd scenario provides the best
results. We also proved that the models performance can be further tuned
via proper selection of architecture blocks and number of depth points
guiding RGB-to-depth reconstruction. We also provide real-time archi-
tecture for depth estimation that is on par with state-of-the-art real-time
depth reconstruction methods.

Keywords: Depth reconstruction · Computer vision · Deep


convolutional neural networks

1 Introduction
Depth estimation and sensing play an important role in a wide range of engi-
neering applications such as self-driving, robotics, augmented reality (AR), and
video mapping. However, existing depth sensors, including LiDARs, structured
light depth sensors, and stereo cameras have their limitations. For example, the
best 3D LiDARs sensor is quite expensive (about $ 75,000 per unit), and yet they
only provide sparse data for distant objects. Depth sensors based on structured
light (for example, Kinect) are energy-intensive and sensitive to sunlight while
having a short sensor range. Finally, stereo cameras require careful calibration
for accurate triangulation, which is computationally intensive and usually fails
in unrecognized regions. Because of these limitations, there has always been a
strong interest in assessing depth using a single camera, being small, relatively
cheap, energy-efficient, and ubiquitous in the consumer electronics market.
Despite a decade of research into RGB-based depth estimation, including
recent advances in deep learning approaches, the accuracy and reliability of such
methods are far from practical. For example, State of the art RGB-based depth
estimation methods [7] have an RMS error of about 50 cm in indoor images (eg,
NYU-Depth-v2 dataset [19]). If we consider images outdoors (for example, on
the KITTI dataset [2]), then these methods show themselves even worse, having
error close to 4 m.
The limitations of full-resolution depth reconstruction inherent in RGB-based
depth estimation techniques can be overcome by using RGB data in conjunction
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 456–467, 2021.
https://doi.org/10.1007/978-3-030-85030-2_38
Fast Depth Reconstruction Using Deep Convolutional Neural Networks 457

with sparse depth data. Many applications provide access to sparse data. An
example is the relatively cheap LiDARs, which are low-resolution sensors. These
sensors provide sparse data. There are also other ways to obtain sparse data:
it can be calculated using the output of the SLAM algorithm. If we consider
the ORB-SLAM algorithm [16], then it allows tracking hundreds of points (3D
landmarks) on each frame.
In this work, three methods of depth reconstruction will be compared: based
on RGB data, sparse depth data, and a combination of RGB and sparse depth
data. The impact of the number of instances with sparse data on the target
metrics will also be discussed.
The rest of the work is organized as follows. Section 2 briefly reviews the early
work on depth reconstruction and discusses the main approaches to solving the
problem of depth estimation. Section 3 describes the used neural network archi-
tectures. We present a description of the main blocks of the network, together
with the dataset, loss function choice, and training pipeline. Section 4 deals with
the system configuration with which this work was performed, and the quality
metrics used to assess the effectiveness of the experiments. It presents the results
of experiments carried out within the framework of this work related to the field
of fast real-time depth estimation.

2 Related Work
Earlier work in the field of depth reconstruction with only RGB images, usu-
ally used features created by hand (feature engineering), as well as probabilistic
graphical models, for e.g., [18], in which the absolute values of individual pieces
of images were estimated, and then the depth image was obtained using the
model of Markov random fields. Also in the early works, there were nonpara-
metric approaches [3], in which the depth was estimated by combining the depth
with similar photometric content.
As a result of the recent boom in deep learning, there have been success-
ful attempts to use neural networks in depth estimation problems. Due to the
fact that the task is close to the task of semantic labeling, most of the work
is built around architectures that have shown their best side in ILSVRC (Ima-
geNet Large Scale Visual Recognition) tasks [17]. In these works as basic network
researchers often use used AlexNet or the deeper VGG network. One of the first
successful attempts to solve the problem of depth estimation using deep convolu-
tional neural networks was presented in [1], which used a two-stage architecture:
the first stage, based on AlexNet, outputs the compressed output of the image,
and the second stage decompresses to the original resolution. In the future, there
were successful attempts to use other networks instead of AlexNet, such as VGG
and ResNet. ResNet was first presented in [3]. A prerequisite for the creation
of ResNet was the problems of training deep neural networks associated with
gradient vanishing. This problem was solved by adding “shortcut connections”.
Their idea is that the output of the layer is thrown through several layers for-
ward. Thus, the ResNet architecture has shown itself to be excellent in solving
the problem of estimating image depth.
458 D. Maslov and I. Makarov

The RGB-based depth estimation approach is not unique: the sparse data-
based depth estimation approach is also widespread. An example of an approach
to sampling depth is [8], which creates an input image of a sparse depth based
on a known depth (ground truth). During training, the input sparse depth is
sampled from ground truth on the fly. More precisely, a certain number of depth
instances are selected, and then the Bernoulli probability is calculated, relative to
the total number of available pixels from the available ground truth depth. Thus,
for each pixel, such a probability is calculated, and as a result, the real number
of non-zero depth pixels is varied for each element of the train sample, depending
on the number of initially selected depth instances. The main idea of using this
approach is to increase the train sample size and increase the robustness of
the network. Sparse Depth values can be obtained using low-resolution depth
sensors (LiDAR) or using the feature-based SLAM (Simultaneous localization
and mapping) algorithm, both efficiently helping fast dense depth reconstruction
as shown in [9–12,14] and depth super-resolution [4,13].
The Sparse Depth approach performs much better than the purely RGB
approach. This is due to the fact that methods based on using only RGB are
too unreliable since a single image itself does not contain any information about
the depth of space. There is also a third approach, which concatenates RGB
values with Sparse Depth. This paper will compare the effectiveness of these
approaches in the Results section.

3 Methodology

This section considers the architecture of the convolutional neural network used
in this work. We will consider various blocks of neural networks that can be
used to solve the problem of depth estimation. This section describes the dataset
used in the work. The applied augmentation will be described, as well as the loss
functions that can be used to solve the current problem.

3.1 Neural Network Architectures

In this work, the encoder-decoder architecture was chosen as the architecture of


the convolutional neural network, the first part of which compresses the image,
and the second expands it to the size of the original image. This architecture
has proven itself in solving other problems in the field of computer vision, for
example, in the task of image segmentation (U-Net architecture).
ResNet architecture is used as an encoder. There are various variations of
ResNet, differing from each other in the number of layers (18, 34, 50). In this
work, the ResNet-18 architecture will be used, due to the faster network learning.
A more detailed description of the learning process will be given in the Experi-
ments section. ResNet-18 weights are pre-trained on ImageNet. Moreover, Aver-
age Pooling and Fully connected layers were removed from the network. Instead,
a convolutional layer with a 1 × 1 kernel with batch normalization was added.
As a decoder, a set of upsample layers (4 in total) is used, which ends with a
Fast Depth Reconstruction Using Deep Convolutional Neural Networks 459

bilinear upsample layer, the main role of which is to increase the resolution of the
representation to the resolution of the original image using bilinear interpolation.
Figure 1 shows the architecture used.
The following blocks will be considered as the upsampling layer:

– deconv 2 × 2,
– deconv 3 × 3,
– upproj,
– upconv.

Figure 2 shows the architecture of these blocks.

Fig. 1. The architecture of our convolutional neural network

Fig. 2. Upsampling blocks used in Decoder [6]. The main idea is to use efficient upsam-
pling by retaining general-level features across decoder blocks.

3.2 Dataset

The NYU-Depth-v2 dataset consists of room images. More specifically, this


dataset contains 464 different indoor scenes that were captured using Microsoft
Kinect. For train sampling, 249 objects are used, while for test sampling, 215
objects are used. To be more precise, the test sample consists of 654 images, and
460 D. Maslov and I. Makarov

the train sample consists of 5948 images. The original resolution of the images
is 640 × 480. To improve the learning speed, the images are reduced in half and
then cropped in the center, resulting in a final image with a resolution of 304 ×
228.
In this work, data augmentation is also performed. The following random
transformations are applied to images:
– Rotations (±5◦ );
– Scaling
– Reflections(horizontal);
– RGB image normalization.
To avoid the appearance of various artifacts in the image during rotation and
scaling, interpolation based on nearest neighbor was used instead of bicubic or
bilinear interpolations.

3.3 Loss Function Choice


When solving the regression problem, it is customary to use the mean square
error (MSE, L2) as the default loss function. This loss function is sensitive to
outliers as it penalizes more in the event of a large error. Alternatively, one can
use visual perceptive loss [9,10] or the Huber loss function, which is described
in [8]. This function is defined as
l2 + c2
B(l) = {| l |, if | l |≤ c; , otherwise}.
2c
This loss function uses the batch-dependent parameter c: it is calculated as
the 20% largest mean absolute error (MAE) at all pixels in the batch. Thus, the
Huber loss function in the case when the bit error is less than the parameter
c, behaves like mean average error (MAE, L1), and behaves approximately like
MSE (L2), in the case when the error exceeds the value of the parameter c.
According to [8], the L1 loss function performed best when solving the problem
of RGB-based depth prediction. Thus, in this work, the loss function L1 will be
used.

4 Experiments
This section describes the configuration of the system on which the experiments
were carried out. We also consider the main error metrics by which the quality
of the model was assessed. It also provides information regarding the training of
neural networks, as well as the sequence of experiments carried out in this work.

4.1 Hardware Configuration


All training iterations and experiments were carried out using a GTX-1060 with
Max-Q Design video card with 6 Gb video memory, as well as an Intel (R) i7-
7700HQ central processor. The system has 16 Gb of RAM. The operating system
is Windows 10 Pro. The entire pipeline was executed using the free open source
Anaconda distribution with Python version 3.6.5.
Fast Depth Reconstruction Using Deep Convolutional Neural Networks 461

4.2 Quality Metrics

During the experiments, the following error metrics were used:


– Root mean square error (RMSE);
– Average absolute relative error (Rel);
– Percentage of pixels for which the relative error does not exceed a certain
threshold (δi ).

RMSE values can be thought of as the standard deviation of residuals (pre-


diction errors). This measure shows how scattered the resulting residuals are.
The lower the RMSE, the better. Unlike the RMSE, the ‘Rel’ error shows how
different the objects are, relative to the original object. The smaller the Rel, the
better. The third metric δi was used in [8]. It is calculated using the following
formula:
cardinality({yˆi : max( yyˆii , yyˆii ) < 1.25i })
δi =
cardinality({yi })
In this formula yˆi and yi , respectively, are the prediction and the ground
truth. The higher the δi , the better. In [8], during the experiments, RMSE, Rel,
δ1 , δ2 , δ3 were considered. In this work, the following metrics will be considered:
RMSE, Rel, δ1 .

4.3 Experiment Setting

In this work, neural networks were trained over 15 epochs. Each epoch took
approximately 45 min. Thus, the complete training took place in 11 h. The learn-
ing rate is initially 0.01, and decreases by 20% every 5 epochs. We also added
regularization: weight decay with λ = 10−4 . Adam was chosen as the optimizer.
All experiments carried out in this work will be described below.
The first experiment compares three approaches: RGB-based, Sparse-Depth,
and RGBd (which is essentially a combination of the first two approaches). For
this comparison, three iterations of training in the network were carried out, in
each of which the input data were changed.
In the case of the RGB-based approach, an RGB image (228 × 304 × 3)
is used as an input. In the case of the Sparse-Depth approach, the input is a
sparse depth image obtained as a result of applying the Depth Sampling strategy
described in Sect. 2. In this case, the input is an object of size 228 × 304 × 1. At
the same time, before starting training, it is necessary to select the number of
depth points M that will be selected from the depth map with M = 100 as the
default value. In a subsequent experiment, the influence of the number of depth
points on the metrics RMSE, Rel, δ1 will be studied. In the case of the RGBd
approach, the RGB image is concatenated with a map of depth points. Thus,
the input of the neural network receives an object of size 228 × 304 × 4. In this
approach, as in the previous one, it is necessary to select the number of depth
points (M). As a result of this experiment, a comparison table will be built for
all three approaches. The results are shown in Table 1.
462 D. Maslov and I. Makarov

Table 1. First experiment on comparing RGB, Sparse Depth and RGBd approaches

Setting Method RMSE Rel δ1


RGB Ours-RGB 0.565 0.163 0.773
Fancheng et al. [8] 0.514 0.143 0.810
Laina et al. [5] 0.573 0.127 0.811
Sparse Depth Ours-SparseDepth 0.301 0.065 0.950
(#Samples = 100)
Fancheng et al. [8] 0.259 0.054 0.963
(#Samples = 200)
RGBd Ours-RGBd 0.278 0.059 0.958
(#Samples = 100)
Fancheng et al. [8] 0.281 0.059 0.955
(#Samples = 50)
Fancheng et al. [8] 0.230 0.044 0.971
(#Samples = 200)

It can be seen from Table 1 that RGB methods are far behind the Sparse
Depth and RGBd methods. At the same time, RGBd demonstrates a two-fold
improvement in comparison with the RGB method in the RMSE metric. It
should be noted that the state-of-the-art method presented in [5] demonstrates
the best results for the metric Rel and δ1 .
As we can see, in the case of using more depth points in the RGBd and Sparse
Depth approaches, the results are significantly improved. Therefore, instead of
100 depth samples, more should be used. Figures 3 and 4 show examples of depth
reconstruction for RGB-based and RGBd approaches.
In the second experiment, the influence of the number of depth points in
the RGBd approach on the main quality metrics will be studied. We perform 4
training iterations with a different number of depth points fed to the input. The
following M values will be used: 50, 100, 150, 200. As a result of this experiment,
a comparison Table 2 shows the dependence of quality metrics on the number
of depth samples.

Table 2. Second experiment on dependence of quality on number of depth points in


RGBd model

# of Depth points RMSE Rel δ1


50 0.317 0.070 0.948
100 0.278 0.059 0.958
150 0.267 0.058 0.964
200 0.248 0.052 0.967
Fast Depth Reconstruction Using Deep Convolutional Neural Networks 463

Fig. 3. An example of RGB-based depth reconstruction on images from the NYU


dataset. The results show over-smoothed silhouettes of objects and general room depth
information while not retaining details due to small number of parameters used in real-
time model.

Fig. 4. An example of RGBd-based depth reconstruction on images from the NYU


dataset. The results show silhouettes of objects and room while not retaining details
due to small number of parameters used in real-time model.

It can be seen from Table 2 that in the case of using a larger number of
depth samples compared to the number that we used in the first experiment,
the results for all metrics improve. Figures 5, 6, 7 show the dependence of the
quality metrics RMSE, Rel, and δ1 on the number of depth points. From the
464 D. Maslov and I. Makarov

Figs. 5, 6 and 7, we see that at some point it becomes pointless to add more
depth points. This moment just happens within the number 200.

Fig. 5. Dependence of the RMSE metric on the number of Depth points in the RGBd
approach

Fig. 6. Dependence of the Rel metric on the number of Depth points in the RGBd
approach

Finally, in the third experiment, four training iterations will be carried out
using the RGBd approach, with each iteration changing the upsample blocks in
the Decoder part of the grid. Thus, the following blocks will be used in Decoder:
Deconv2, Deconv3, UpProj, UpConv as described in Sect. 3.
Fast Depth Reconstruction Using Deep Convolutional Neural Networks 465

Fig. 7. Dependence of the δ1 metric on the number of Depth points in the RGBd
approach

It should be noted that in the first and second experiments, Deconv3 was used
as the upsample block, since it will show the best results in the third experiment
on different neural network architectures as shown in Table 3. All four iterations
used the RGBd approach with default # of depth points M = 100.

Table 3. Third experiments on neural network architectures comparison

Upsampling layer RMSE Rel δ1


Deconv2 0.288 0.062 0.957
Deconv3 0.278 0.059 0.958
UpConv 0.289 0.064 0.957
UpProj 0.286 0.062 0.958

Based on Table 3, we can see that using Deconv3 as the upsampling layer
in the decoder shows the best results in all metrics. At the same time, UpProj
shows the same result in δ1 and is the second in terms of results after Deconv3.
UpConv proved to be the worst. Thus, the results obtained in this work are
slightly worse than the results presented in [8].
– RMSE: 0.248 versus 0.230;
– Rel: 0.052 versus 0.044;
– δ1 : 0.967 versus 0.971.
Most likely, the main reason is the chosen Encoder: in our work, we used
ResNet-18, and in [8] authors used ResNet-50. In our work, experiments were
carried out only on the NYU dataset, while in [8] experiments were carried out
not only with NYU, but also with the popular KITTI dataset. To improve the
current work, it is necessary to conduct research on different datasets.
466 D. Maslov and I. Makarov

5 Conclusion
In this paper, we studied three approaches to solving the problem of depth
reconstruction: RGB-based, Sparse-Depth, and RGBd. In Table 1, we showed
that the RGBd approach demonstrates the result 2.03 times better than the
RGB-based approach. As shown in Table 2, the RGBd method with 200 depth
points was found to perform 1.28 times better than the RGBd method using 50
depth points. It was found that after adding 200 points, the results practically do
not change. A comparison was made of 4 upsampling blocks: Deconv2, Deconv3,
UpConv, UpProj, with Deconv3 outperforming other blocks.
We leave research on fast monocular depth estimation for the future work
on improving our results via recurrent approaches [15]. The main problem of
existing solutions that they are highly demanding in GPU and could not perform
in real-time for RGB resolution. We showed that one can obtain sufficient quality
for basic tasks such as collision-free navigation of perceptive in 3D scene while
running in real-time on low resource hardware setting.

References
1. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using
a multi-scale deep network. arXiv preprint arXiv:1406.2283 (2014)
2. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti
vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3354–3361. IEEE (2012)
3. Karsch, K., Liu, C., Kang, S.B.: Depth extraction from video using non-parametric
sampling. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.)
ECCV 2012. LNCS, vol. 7576, pp. 775–788. Springer, Heidelberg (2012). https://
doi.org/10.1007/978-3-642-33715-4 56
4. Korinevskaya, A., Makarov, I.: Fast depth map super-resolution using deep neu-
ral network. In: 2018 IEEE International Symposium on Mixed and Augmented
Reality Adjunct (ISMAR-Adjunct), pp. 117–122. IEEE (2018)
5. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth
prediction with fully convolutional residual networks. In: 2016 Fourth International
Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)
6. Li, Q., et al.: Deep learning based monocular depth prediction: Datasets, methods
and applications. arXiv preprint arXiv:2011.04123 (2020)
7. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation
from a single image. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 5162–5170 (2015)
8. Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth sam-
ples and a single image. In: 2018 IEEE International Conference on Robotics and
Automation (ICRA), pp. 4796–4803. IEEE (2018)
9. Makarov, I., Aliev, V., Gerasimova, O.: Semi-dense depth interpolation using deep
convolutional neural networks. In: Proceedings of the 25th ACM International
Conference on Multimedia, pp. 1407–1415. MM ’17, Association for Computing
Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3123266.3123360
10. Makarov, I., Aliev, V., Gerasimova, O., Polyakov, P.: Depth map interpolation
using perceptual loss. In: 2017 IEEE International Symposium on Mixed and Aug-
mented Reality (ISMAR-Adjunct), pp. 93–94. IEEE (2017)
Fast Depth Reconstruction Using Deep Convolutional Neural Networks 467

11. Makarov, I., Korinevskaya, A., Aliev, V.: Fast semi-dense depth map estimation.
In: Proceedings of the 2018 ACM Workshop on Multimedia for Real Estate Tech,
pp. 18–21 (2018)
12. Makarov, I., Korinevskaya, A., Aliev, V.: Sparse depth map interpolation using
deep convolutional neural networks. In: 2018 41st International Conference on
Telecommunications and Signal Processing (TSP), pp. 1–5. IEEE (2018)
13. Makarov, I., Korinevskaya, A., Aliev, V.: Super-resolution of interpolated down-
sampled semi-dense depth map. In: Proceedings of the 23rd International ACM
Conference on 3D Web Technology, pp. 1–2 (2018)
14. Makarov, I., et al.: On reproducing semi-dense depth map reconstruction using
deep convolutional neural networks with perceptual loss. In: Proceedings of the
27th ACM International Conference on Multimedia, pp. 1080–1084 (2019)
15. Maslov, D., Makarov, I.: Online supervised attention-based recurrent depth esti-
mation from monocular video. Peer J. Comput. Sci. 6, e317 (2020)
16. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate
monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
17. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J.
Comput. Vis. 115(3), 211–252 (2015)
18. Saxena, A., Chung, S.H., Ng, A.Y., et al.: Learning depth from single monocular
images. NIPS 18, 1–8 (2005)
19. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y.,
Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33715-4 54
Meta-Learning and Other Automatic
Learning Approaches in Intelligent
Systems
A Study of the Correlation
of Metafeatures Used for Metalearning

Adriano Rivolli1(B) , Luı́s P. F. Garcia2 , Ana C. Lorena3 ,


and André C. P. L. F. de Carvalho4
1
Computing Department, Universidade Tecnológica Federal Do Paraná,
Av. Alberto Carazzai, 1640, Cornélio Procópio, Paraná 86300-000, Brazil
[email protected]
2
Department of Computer Science, University of Brası́lia, Campus Darcy Ribeiro,
Asa Norte, Brası́lia 70910-900, Brazil
[email protected]
3
Aeronautics Institute of Technology, Praça Marechal Eduardo Gomes, 50,
São José dos Campos, São Paulo 12228-900, Brazil
[email protected]
4
Institute of Mathematical and Computer Sciences, University of São Paulo,
Av. Trabalhador São-carlense, 400, São Carlos, São Paulo 13560-970, Brazil
[email protected]

Abstract. Metalearning has been largely used over the last years to rec-
ommend machine learning algorithms for new problems based on past
experience. For such, the first step is the creation of metabase, or meta-
dataset, containing metafeatures extracted from several datasets along
with the performance of a pool of candidate algorithm(s). The next step
is the induction of machine learning metamodels using the metabase
as input. These models can recommend the most suitable algorithms
for new datasets based on their metafeatures values. An effective met-
alearning system must employ metafeatures that characterize essential
aspects of the datasets while also distinguishing different problems and
solutions. The characterization process should also show a low compu-
tational cost, otherwise, the recommendation system can be replaced by
a standard trial-and-error approach. This paper proposes the use of an
unsupervised correlation-based feature selection strategy to identify a
reduced subset of metafeatures for metalearning systems. Empirically,
the predictive performance achieved by metalearning systems using the
subset of selected metafeatures is similar or better than the performance
obtained using the whole set of metafeatures. In addition, a noteworthy
reduction in the number of metafeatures needed is observed, implying
computational cost reductions.

Keywords: Metalearning · Metafeature · Characterization measures ·


Metafeature selection

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 471–483, 2021.
https://doi.org/10.1007/978-3-030-85030-2_39
472 A. Rivolli et al.

1 Introduction
A typical application of machine learning (ML) algorithms to a dataset usu-
ally includes several other tasks, such as data exploration, data transformation,
cleansing, preprocessing and hyperparameter tuning, creating a pipeline of pro-
cesses, also known as end-to-end ML [22]. For each pipeline component, a tech-
nique considered the most suitable is selected among a set of several possible
techniques. The selection of the particular technique to be used in each step
usually occurs by trial-and-error, which is subjective and time-consuming. An
alternative for overcoming these limitations, supporting the automatic selection
of an ML algorithm and other components of the ML pipeline for a new dataset,
is to use metalearning (MtL) [3]. By combining data characteristics extracted
from datasets and the performance of ML algorithms when applied to these
datasets, a predictive metamodel can be induced, which can be used to recom-
mend a suitable algorithm for a new dataset.
The first step to use MtL is to build a metabase. The metabase is composed of
meta-examples, represented by descriptive characteristics, named metafeatures,
extracted from a dataset. In addition, each meta-example is labeled according
to the performance obtained by a set of ML algorithms when applied to the
dataset. The next step is the standard approach followed when applying an ML
algorithm to a dataset: the induction of ML models from the metabase. Distinct
metamodel can be induced by the same or different ML algorithms. It can be used
as part of a recommender system to predict the best performing algorithm, the
expected performance of one or more algorithms or a ranking of best performing
algorithms for a new dataset [3].
Although some studies are interested in showing which group of metafeatures
can better characterize the problems [2,6,10], the lack of an in-depth analysis of
the metafeatures is still an open issue in most MtL studies. Moreover, most of
the studies consider only supervised scenarios and reduced subsets of candidate
metafeatures [18]. More recent papers [16,20] try to systematize the extraction
and reproducibility of metafeatures, but do not discuss how redundant different
metafeatures are.
The main goal of this paper is to investigate the correlation between the
metafeatures values and how this impacts the predictive performance of MtL
systems. From the hypothesis that it is possible to reduce the subset of metafea-
tures by removing highly correlated metafeatures’ pairs, we propose and evaluate
the use of an unsupervised correlation-based feature selection filter to validate
this claim. In order to analyze the impact of the correlation in this scenario, we
compare the predictive accuracy of a metamodel induced using different sets of
metafeatures by varying a correlation threshold. The results validate the hypoth-
esis: while the predictive performance is maintained, the computational cost is
reduced with such a reduction in the number of metafeatures. This is crucial in
MtL given that the characterization process cannot demand more time than the
trial-and-error approach.
A Study of the Correlation of Metafeatures Used for Metalearning 473

2 Metalearning
The first proposal for using a predictive model to deal with the algorithm selec-
tion problem was [19]. Smith-Miles [23] summarized the problem in a framework
with four components: the problem or instances space (P ), the features space
(F ), the algorithms space (A) and the evaluation measures space (Y ). Given
a problem instance p characterized by a set of metafeatures f , one seeks to
find a mapping to the algorithm α with maximized performance on p, as mea-
sured by y(α(p)). This mapping can be found by a metamodel S(f (p)) acting as
a recommendation system of suitable algorithms for new problems. This algo-
rithm selection framework can be used to support tasks like time series analysis,
hyperparameter tuning, among others [13,21].
In MtL, the problem instances (P ) represent the datasets used to compose
the metabase. Increasing the number of datasets can reduce the bias and provide
a reliable metamodel. To reduce the bias in this choice, datasets from several
data repositories, like UCI [8] and OpenML [24], can be used.
Another component is the metafeatures set (F ) used to describe the proper-
ties of the datasets [20]. To construct an efficient recommendation system, the
metafeatures must describe the complexity of the datasets and provide evidence
about the future performance of the algorithms in A [18]. Additionally, they
need to have a low computational cost, respecting the trade-off between running
all algorithms and extracting the metafeatures values for a new dataset. The
metafeatures used in the MtL literature can be divided into six main groups [20]:

– Simple. Metafeatures that are easily extracted from data with low compu-
tational cost.
– Statistical. Metafeatures that capture statistical properties of the data,
mainly indicators of localization and distribution.
– Information-theoretic. Metafeatures based on information theory which
capture the amount of information a dataset has.
– Model-based. Metafeatures extracted from a model induced from the data.
They are often based on properties of decision tree (DT) models.
– Landmarking. Metafeatures that compute the performance of simple and
fast learning algorithms to characterize the datasets.
– Others. Mixture of metafeatures did not include in the previous groups. In
this work, we used clustering-based metafeatures.

The algorithm space A represents a set of candidate algorithms to be rec-


ommended in the algorithm selection process. These algorithms should also be
sufficiently different from each other and represent all regions in the algorithm
space [15]. The models induced by the algorithms can be evaluated by different
measures, composing the set Y . For classification tasks, most of the studies in the
MtL use accuracy. For regression problems, Mean Squared Error (MSE) or Root
MSE (RMSE) (or normalized versions of such measures) are usually employed.
The metabase is created by characterizing the datasets and assessing the
performance of the algorithms over them. Brazdil et al. [3] summarize the three
474 A. Rivolli et al.

main approaches frequently employed to label the meta-examples in MtL: (i) the
algorithm that presented the best performance; (ii) a ranking of the algorithms
according to their performance, where the algorithm with the best performance is
top-ranked; and (iii) the performance measure value obtained by each evaluated
algorithm.

3 Methodology
This section presents the details and procedure adopted to empirically investigate
the correlation of the metafeatures and how much we can reduce the set of
metafeatures without impairing the learning predictive performance of an MtL
system. The experiments are organized into two phases: base and meta levels.
In the former, we discuss the construction of the metabase. In the latter, we
present the method adopted to filter the metafeatures based on a grid search of
the correlated measures and the evaluation process.

3.1 Base Level


The base level experiments generate a metabase, which is populated with the
metafeature values and the targets for a collection of problem instances. In this
study the meta-examples are labeled with performances of known classification
techniques.
Four hundred datasets from the OpenML repository [24], representing diverse
application contexts and domains, were used in this experiment. They were
selected considering a maximum number of 10, 000 observations, 500 features
and 10 classes to constrain the computational costs of the process. For each
dataset, metafeatures from the groups simple, statistical, information-theoretic,
model-based, landmarking and clustering were extracted using the MFE tool [1].
The mean and standard deviation were used to summarize the multi-valued
metafeatures.
This process resulted in 130 metafeatures values extracted from 400 classi-
fication datasets. After removing the metafeatures with missing values due to
some error during the extraction procedure (as reported in Rivolli et al. [20]),
the metabase contains 115 metafeatures. The errors are related to the specific
characteristics of the datasets, which makes the extraction of some metafeatures
impracticable. None of the metainstances were discarded.
Next, the averages of 10-fold cross-validated predictive accuracy (ACC), Area
under the ROC Curve (AUC), and F1 predictive performances achieved by each
classification technique for each dataset were also calculated for labeling the
meta-examples. The classification techniques used were: C4.5 decision tree [17]
with pruning based on subtree raising; k-Nearest Neighbors (kNN) model [14]
with k = 3; Multilayer Perceptron (MLP) [11] with learning rate of 0.3, momen-
tum of 0.5 and a single hidden layer; Random Forest (RF) [4] with 500 trees;
and Support Vector Machine (SVM) [7] with radial basis kernel. These hyper-
parameter values were defined following the standard configurations of the imple-
mentations used in the R environment, without any specific tuning.
A Study of the Correlation of Metafeatures Used for Metalearning 475

Algorithm 1: Correlation filter algorithm


Input: Absolute correlation matrix M of all pairs of metafeatures
Correlation threshold τ
Result: List of selected metafeatures S
1 S←{}
2 L ← names(sort(rowMeans(M )))
3 while L = { } do
4 i ← L[1]
5 S ← S ∪ {i}
6 R ← names(M [i, ] ≥ τ )
7 L←L−R
8 end

Thus, we generated 15 metabases varying only the target, which is defined


by an algorithm (C4.5, kNN, MLP, RF and SVM) and a performance evaluation
measure (ACC, AUC and F1). From this point on, a specific metabase is referred
to by its target: a pair of algorithm and evaluation measure, e.g., RF.AUC refers
to the AUC performance of the RF algorithm.

3.2 Meta Level


The metalevel comprises experiments involving learning from the metabase.
Here, the procedures adopted aim to assess how the correlation of metafeatures
affects the learning process.
There are many and distinct unsupervised feature selection algorithms [9];
however, none of them comply with our demand for this study: a subset of fea-
tures that are not correlated between themselves given a threshold value of cor-
relation. Therefore, we developed a simple multivariate filter based on a greed
search approach to cut off the highest correlated metafeatures according to a
given threshold value τ . Algorithm 1 presents the steps used to filter the corre-
lated metafeatures.
The algorithm receives two arguments: a correlation matrix, in which each cell
contains the absolute correlation between two metafeatures (row and column);
and the threshold value used to remove correlated metafeatures. The set of candi-
date metafeatures is placed in L, which is initially composed by all metafeatures,
sorted according to their average correlation to the others. The metafeature with
the highest averaged correlation value to the others is selected in lines 4 and 5.
This is the metafeature expected to represent a larger set of other metafeatures
in L. Next, the metafeatures with an absolute correlation value greater than or
equal to τ with the previous metafeature chosen are removed from L in lines 6
and 7. The process is iterated until the set L becomes empty.
This filter ensures that any selected metafeature in the final set S has an abso-
lute correlation between each other lower than the threshold value τ . Although
the algorithm does not guarantee an optimum minimum subset, by always select-
ing the most correlated metafeature, the greed approach usually results in a small
476 A. Rivolli et al.

subset of measures compared with a random strategy. Furthermore, by assum-


ing this bias, the process becomes deterministic, such that a given input always
results in the same subset of selected metafeatures.
In the experiments, a range of correlation threshold values from 0.5 to 1
increasing by .02 are used and compared, resulting in 26 different values of τ .
The Kendall’s Tau Correlation [12] is applied to compute the correlation matrix,
which is the other hyperparameter for this algorithm.
For each value of τ , the set S of less correlated metafeatures are selected and
used to induce regression metamodels for the different metabases (described
in the previous section). Five regression techniques known to have different
biases were tested for building the metamodels: Classification And Regression
Trees (CART) [5] algorithm with pruning; Distance-weighted k-Nearest Neigh-
bor (DWNN) [14] with k = 3 and Gaussian kernel; Random Forest Regressor
(RFR) [4] with 500 trees; and Support Vector Regression (SVR) [7] using radial
basis kernel. As with the classifiers, the regressor hyper-parameters were also set
as the default values of their implementations in the R environment. Therefore,
they were used without any problem-specific tuning.
To train and evaluate the regression models, a paired 10-fold cross-validation
procedure between all metalearners and targets was adopted. The models
obtained were evaluated for quality, considering a comparison to a simple base-
line that returns the averaged performance computed from the target of the
training set. Given that the correlation of metafeatures can vary subtly accord-
ing to the selected instances, the feature selection was performed for each train-
ing set. As the correlation filter algorithm ignores the target, the same subset
of features is used for all algorithms and metabases, given a training data and a
correlation threshold value.
The final step is the analysis of the metalearners’ performance compared to
the metafeature reduction, in a cross-validation setup, through direct comparison
of single-threaded experiments.

4 Results
This section presents and discusses the results obtained with the empirical exper-
iments. We analyze the reduction in the number of metafeatures, the metalearn-
ers’ performance for different reductions and conclude with the study of the most
important metafeatures and their correlations.

4.1 Correlation Filter Reduction

To evaluate the metafeatures reduction, Algorithm 1 was applied for different


threshold values in the metabase. A lower threshold represents a higher reduc-
tion, while a higher threshold represents a lower reduction in the number of
metafeatures. Figure 1 indicates the amount of metafeature selected for different
threshold values.
A Study of the Correlation of Metafeatures Used for Metalearning 477

32
34
38
40 42
46
60% 48 49
52
56
58
Reduction (%)

64
68
40% 72
75
77 79
81 83 83
86
89
20% 94
96 96
102

0%
0.5 0.6 0.7 0.8 0.9 1.0
Threshold

Fig. 1. Number of selected metafeatures for different absolute correlation threshold


values (x-axis). The y-axis also shows the percentage of metafeatures discarded.

From the 115 distinct metafeatures considered in this study, only 32 of them
were selected when τ = 0.5, representing a reduction of more than 70% of the
metafeatures. On the other hand, when only the completely correlated metafea-
tures (τ = 1) are removed, around 10% of the metafeatures are removed, com-
prising 102 metafeatures used.
Two features are correlated for the feature selection algorithm when they
have an absolute correlation value greater than the threshold value. Therefore, as
the threshold value increases, the number of selected metafeatures also increases
since fewer correlated metafeatures. A τ = 0.7 reduces by half the number of
metafeatures, which indicates that half of the metafeatures correlate at least
70% to one other metafeatures. The impact of such reductions in the predictive
performances of the MtL algorithms is discussed next.

4.2 Predictive Performance

To assess the performance of the metalearners in the regression task, we used


the RMSE evaluation measure, which indicates how much spread out are the
predicted values in relation to the expected ones. When two regression models are
compared to each other, the lowest RMSE indicates the model that can predict
values more precisely. Figure 2 presents, for different setups and metalearners,
the averaged obtained RMSE performance.
Regardless of the correlation threshold value used, the metalearners obtained
better performance than the baseline in all considered scenarios. In other words,
even with large reductions of metafeatures (e.g., τ = 0.5, see Fig. 1), the induced
models outperformed the baseline. Surprisingly, the reduction did not impair the
results so harshly. On the opposite, at some point, the RMSE can be subtly better
for all evaluation metrics.
Overall, from a threshold τ ≥ 0.6 it is possible to note stability in the per-
formance obtained by the metalearners given a specific scenario. This value cor-
responds to a reduction of almost 60% of the metafeatures. A more conservative
478 A. Rivolli et al.

Meta−learner: BASELINE CART DWNN RFR SVR

C4.5.ACC kNN.ACC MLP.ACC RF.ACC SVM.ACC

0.20

0.10

0.05

C4.5.AUC kNN.AUC MLP.AUC RF.AUC SVM.AUC


RMSE in Log scale

0.20

0.10

0.05

C4.5.F1 kNN.F1 MLP.F1 RF.F1 SVM.F1

0.20

0.10

0.05

0.5 0.6 0.7 0.8 0.9 1.00.5 0.6 0.7 0.8 0.9 1.00.5 0.6 0.7 0.8 0.9 1.00.5 0.6 0.7 0.8 0.9 1.00.5 0.6 0.7 0.8 0.9 1.0
Threshold

Fig. 2. RMSE performance averaged (y-axis) over the 10 folds obtained for the different
metabases and metalearners. The x-axis shows the threshold values of correlation.

reduction, in which 50% of the metafeatures are removed (τ = 0.7), results in


an even lower or best difference to the use of all metafeatures.
To illustrate these differences, Table 1 presents the RMSE performance for
the different metalearners considering: 50% of reduction using the correlation fil-
ter, 50% of metafeatures randomly selected and using the whole set of metafea-
tures. Due to space restrictions, we arbitrarily selected to report three differ-
ent metabases, taking into account targets from distinct pairs of classifier and
evaluation measure. Different from the others, we repeated the random-selected
procedure 10 times, carrying out 10 times 10-fold cross-validation.
From these examples, only one case, the subset of metafeatures selected by
the correlation filter, obtained a worse performance than the random selection
procedure. On the other hand, the filtered subset produced even better pre-
dictive results than using the whole set. To detail this behavior, Fig. 3 shows
the performance difference between using a subset of metafeatures filtered by
distinct threshold values against the use of all metafeatures.
Different results are observed for each algorithm used to induce the meta-
models. For the CART, an absolute correlation of 0.8 was enough not to impair
the predictive result of the metamodel. The DWNN and RFR showed the lowest
observed differences. Similar to CART, at some point, RFR stabilizes and the
results with and without filtering become very similar. This behavior can be
A Study of the Correlation of Metafeatures Used for Metalearning 479

Table 1. Mean and standard deviation (in parenthesis) of RMSE for different met-
alearners considering 50% of reduction using Algorithm 1 and 50% of metafeatures
randomly selected. The bold markup indicates the best result for each metalearner.

Metabase Metalearner 50% of reduction 50% random All metafeatures


RF.ACC CART 0.0687 (0.0130) 0.0766 (0.0206) 0.0672 (0.0127)
DWNN 0.0641 (0.0136) 0.0730 (0.0190) 0.0689 (0.0176)
RFR 0.0420 (0.0084) 0.0527 (0.0139) 0.0423 (0.0094)
SVR 0.0634 (0.0211) 0.1659 (0.0116) 0.1662 (0.0121)
C4.5.AUC CART 0.0725 (0.0145) 0.0792 (0.0172) 0.0758 (0.0139)
DWNN 0.0671 (0.0106) 0.0699 (0.0144) 0.0681 (0.0164)
RFR 0.0510 (0.0104) 0.0558 (0.0106) 0.0507 (0.0105)
SVR 0.0542 (0.0105) 0.1304 (0.0112) 0.1307 (0.0117)
kNN.F1 CART 0.1159 (0.0376) 0.1154 (0.0299) 0.1094 (0.0292)
DWNN 0.1055 (0.0275) 0.1099 (0.0264) 0.1061 (0.0262)
RFR 0.0759 (0.0301) 0.0865 (0.0291) 0.0757 (0.0281)
SVR 0.0961 (0.0249) 0.1925 (0.0175) 0.1931 (0.0181)

related to the fact that these techniques already perform an embedded feature
selection during their inductive process. However, for DWNN, there is a range
(between 0.7 and 0.8) in which the performance is slightly better. This makes
sense considering the bias of the respective algorithm that is impacted by irrele-
vant or redundant features. Finally, the results show that it was essential for the
SVR algorithm to remove the redundant metafeatures regardless the correlation
value used to filter them. In this case, the improvement is large and consistent
when the different metabases are taken into account.
These results reinforce the premise that it is possible to extract fewer metafea-
tures in an MtL study without impairing the predictive results in algorithm rec-
ommendation. Given that many metafeatures are highly correlated with each
other, along with the supposition that some of them can be irrelevant to the
MtL problem, it is reasonable to assume that discarding the correlated metafea-
tures may increase the quality of the metadata. It is also logical to assume that
by avoiding the high correlated measures, the computational costs involving the
characterization of new datasets, necessary to use such metamodels in practice,
are reduced.

4.3 Importance of the Metafeatures

Finally, we compare the frequency in which the most important metafeatures


were selected for the different threshold values. For this analysis, the ranking
with the most important metafeature was created averaging the feature impor-
tance ranking obtained from the RFR metalearner for each metabase. The RFR
achieved the best RMSE for all of them in comparison to the other algorithms.
480 A. Rivolli et al.

CART DWNN RFR SVR


0.01 0.005
0.000 0.100
All − Filtered (RMSE)

0.00 0.000
0.075
−0.01 −0.004
−0.005
0.050
−0.02 −0.008
−0.010
0.025
−0.03
−0.015 −0.012 0.000
0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0
Threshold

Fig. 3. RMSE difference between using all metafeatures and a subset filtered by the
correlation threshold values. The area above the line indicates that the filtering pro-
duced better results whereas the area below the line indicates the opposite.

Figure 4 lists the 30 most important metafeatures and shows the frequency in
which they were selected over the folds for each threshold value. The first remark
is that with a few exceptions, since a metafeature is selected by a threshold value,

Frequency
0% 25% 50% 75%100%

clustering.ntau (30)
clustering.ndis (29)
landmarking.eliteNN.sd (28)
general.freqClass.mean (27)
model.based.leavesPerClass.mean (26)
infotheo.nsRatio (25)
landmarking.naiveBayes.sd (24)
Meta−features (ranking of importance)

general.nrClass (23)
infotheo.mutInf.mean (22)
clustering.npb (21)
infotheo.jointEnt.mean (20)
general.freqClass.sd (19)
infotheo.mutInf.sd (18)
infotheo.classConc.sd (17)
clustering.nnre (16)
infotheo.eqNumAttr (15)
infotheo.classEnt (14)
model.based.leavesCorrob.sd (13)
landmarking.oneNN.sd (12)
landmarking.bestNode.mean (11)
clustering.nc_index (10)
clustering.nray (9)
clustering.nvdb (8)
clustering.nch (7)
landmarking.eliteNN.mean (6)
landmarking.naiveBayes.mean (5)
model.based.nodesPerInst (4)
statistical.canCor.mean (3)
clustering.nsil (2)
landmarking.oneNN.mean (1)
0.5 0.6 0.7 0.8 0.9 1.0
Thresholds

Fig. 4. Frequencies that the correlation filter algorithm selected the most important
metafeatures for different threshold values.
A Study of the Correlation of Metafeatures Used for Metalearning 481

it is also selected for the following threshold values. Given that the filter selected
50% of metafeatures when τ = 0.7, 16 from the 30 most relevant metafeatures
were are selected for this case.
If we considered an absolute correlation of 0.8, only 5 of them are not selected.
Analysing their correlation with the other metafeatures within this subset, the
following values were observed: clustering.nray (9) has a correlation of 0.8 with
clustering.nvdb (8); infotheo.classEnt (14) has a correlation of 1 with cluster-
ing.nnre (16); both, general.nrClass (23) and general.freqClass.mean (27) are
completely inversely correlated and have a correlation of 0.99 (or −0.99) with
model.based.leavesPerClass.mean (26); and, infotheo.nsRatio (25) has a corre-
lation of −0.79 with infotheo.mutInf.mean (22). From these results, τ = 0.8
showed to be a reasonable correlation value, even though more than 30% of the
metafeatures are discarded.
Considering that the correlation filter does not look at the target, it is notable
that most of the selected metafeatures either are in the previous subset or are
highly correlated with others that are in this set. This can also explain why the
performance is not impaired even with high reductions.

5 Conclusion

This paper investigated the metafeatures correlation issue in MtL systems and
how it impacts the predictive performance of the metamodels. The results cor-
roborate the claim that we can reduce the number of metafeatures by removing
highly correlated pairs while maintaining the predictive accuracy in algorithm
recommendation.
The main remarks and insights from this study comprise:

1. There are many redundant metafeatures in the literature. When these rela-
tions are properly identified and highly correlated metafeatures are removed,
the computational costs to characterize a dataset are reduced; it also mini-
mizes the curse of dimensionality frequently observed in MtL problems.
2. The presence of highly correlated metafeatures impairs the predictive per-
formance of models induced by some algorithms. In this study, DWNN and
SVR metamodels were improved the most when redundant metafeatures were
removed.
3. An absolute correlation between 0.7 and 0.8 showed to be the best trade-off
between reducing the number of metafeatures whilst maintaining the predic-
tive performance of the metamodels.

Future works include comparing other feature selection algorithms with and
without the presence of correlated metafeatures and finding a subset of non-
correlated metafeatures that properly characterize datasets at lower computa-
tional costs than the whole set. The investigation of the unsupervised feature
selection correlation filter, proposed in this paper, in other domains also repre-
sents interesting further research.
482 A. Rivolli et al.

Acknowledgements. The authors would also like to thank the São Paulo Research
Foundation (FAPESP), grant 2013/07375-0 (CEPID CeMEAI).

References
1. Alcobaca, E., Siqueira, F., Rivolli, A., Garcia, L.P.F., Oliva, J.T., de Carvalho,
A.C.P.L.F.: Mfe: towards reproducible meta-feature extraction. J. Mach. Learn.
Res. 21(111), 1–5 (2020)
2. Bensusan, H., Kalousis, A.: Estimating the predictive accuracy of a classifier. In:
12th European Conference on Machine Learning (ECML), vol. 2167, pp. 25–36
(2001)
3. Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning - Applications
to Data Mining, 1st edn. Cognitive Technologies, Springer (2009)
4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
5. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regres-
sion Trees. Wadsworth and Brooks (1984)
6. ChristianKopf, I.I.: Combination of task description strategies and case base prop-
erties for meta-learning. In: Workshop on Integrating Aspects of Data Mining,
Decision Support and Meta-Learning (IDDM), pp. 65–76 (2002)
7. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods. Cambridge University Press, Cambridge
(2000)
8. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.
edu/ml
9. Fernández, S.S., Ochoa, J.A.C., Martı́nez-Trinidad, J.F.: A review of unsupervised
feature selection methods. Artif. Intell. Rev. 53(2), 907–948 (2020)
10. Filchenkov, A., Pendryak, A.: Datasets meta-feature description for recommending
feature selection algorithm. Artif. Intell. Nat. Lang. Inform. Extract. Soc. Media
Web Search 7, 11–18 (2015)
11. Haykin, S.: Neural Networks - A Comprehensive Foundation. Prentice Hall, Hobo-
ken (1999)
12. Kendall, M.G.: A new measure of rank correlation. Biometrika 30, 81–93 (1938)
13. Mantovani, R., Rossi, A., Alcobaça, E., Vanschoren, J., de Carvalho, A.: A meta-
learning recommender system for hyperparameter tuning: predicting when tuning
improves svm classifiers. Inform. Sci. 501, 193–221 (2019)
14. Mitchell, T.M.: Machine Learning. McGraw Hill series in computer science,
McGraw Hill, New York (1997)
15. Muñoz, M.A., Villanova, L., Baatar, D., Smith-Miles, K.: Instance spaces for
machine learning classification. Mach. Learn. 107(1), 109–147 (2018)
16. Pinto, F., Soares, C., Mendes-Moreira, J.: Towards automatic generation of
metafeatures. In: Pacific-Asia Conference on Knowledge Discovery and Data Min-
ing (PAKDD), pp. 215–226 (2016)
17. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
18. Reif, M., Shafait, F., Goldstein, M., Breuel, T., Dengel, A.: Automatic classifier
selection for non-experts. Pattern Anal. Appl. 17(1), 83–96 (2014)
19. Rice, J.: The algorithm selection problem. Adv. Comp. 15, 65–118 (1976)
20. Rivolli, A., Garcia, L., Soares, C., Vanschoren, J., de Carvalho, A.: Towards repro-
ducible empirical research in meta-learning. arXiv 1(1808.10406), 1–41 (2019)
21. Sá, J.D., Rossi, A., Batista, G., Garcia, L.P.: Algorithm recommendation for data
streams. In: 25th International Conference on Pattern Recognition, pp. 1–6 (2021)
A Study of the Correlation of Metafeatures Used for Metalearning 483

22. Schelter, S., Whang, S., Stoyanovich, J. (eds.): Proceedings of the Fourth Workshop
on Data Management for End-To-End Machine Learning, In Conjunction with the
2020 ACM SIGMOD/PODS Conference (2020)
23. Smith-Miles, K.A.: Cross-disciplinary perspectives on meta-learning for algorithm
selection. ACM Comput. Surv. 41(1), 1–25 (2008)
24. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science
in machine learning. SIGKDD Explor. 15(2), 49–60 (2013)
Learning Without Forgetting for 3D
Point Cloud Objects

Townim Chowdhury1 , Mahira Jalisha1 , Ali Cheraghian2,3 ,


and Shafin Rahman1(B)
1
North South University, Dhaka, Bangladesh
{townim.faisal,mahira.jalisha,shafin.rahman}@northsouth.edu
2
Australian National University, Canberra, Australia
[email protected]
3
Data61-CSIRO, Canberra, Australia

Abstract. When we fine-tune a well-trained deep learning model for a


new set of classes, the network learns new concepts but gradually forgets
the knowledge of old training. In some real-life applications, we may be
interested in learning new classes without forgetting the capability of
previous experience. Such learning without forgetting problem is often
investigated using 2D image recognition tasks. In this paper, considering
the growth of depth camera technology, we address the same problem for
the 3D point cloud object data. This problem becomes more challenging
in the 3D domain than 2D because of the unavailability of large datasets
and powerful pretrained backbone models. We investigate knowledge dis-
tillation techniques on 3D data to reduce catastrophic forgetting of the
previous training. Moreover, we improve the distillation process by using
semantic word vectors of object classes. We observe that exploring the
interrelation of old and new knowledge during training helps to learn new
concepts without forgetting old ones. Experimenting on three 3D point
cloud recognition backbones (PointNet, DGCNN, and PointConv) and
synthetic (ModelNet40, ModelNet10) and real scanned (ScanObjectNN)
datasets, we establish new baseline results on learning without forgetting
for 3D data. This research will instigate many future works in this area.

Keywords: 3D point cloud · Knowledge distillation · Word vector

1 Introduction
The advent of deep learning models achieves impressive performance in the image
recognition task [19,30,37]. In a real-life application, a trained system that can
classify a given object instance within a fixed number of classes may need to
readjust itself to classify a new set of classes in addition to old classes without
retraining from scratch. For example, a self-driving car already recognizes street
objects (vehicles, traffic lights, etc.). Now, the car manufacturer wants to increase
the car’s capability in recognizing roadside objects (buildings, trees, etc.) by
retraining only on instances of new classes of interest. The main issue of the
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 484–497, 2021.
https://doi.org/10.1007/978-3-030-85030-2_40
Learning Without Forgetting for 3D Point Cloud Objects 485

(a) Without using semantics      (b) With using semantic information  

Semantic
representation

Old class
feature vector
New class
feature vector

Fig. 1. Effect of semantic representation while learning without forgetting. (a) Without
class semantics, the network tries to form clusters of old and new class instances in
feature space. Sometimes clusters overlap with each other because of the lack of class
semantics. (b) After using class semantics, old and new class features cluster them
around their corresponding class semantics. It helps the cluster separate enough for
each other which helps to achieve better performance.

retraining is the catastrophic forgetting of old class knowledge. Since this setup
does not allow old class instances, the model learns new classes but forgets
old ones. Researchers proposed Learning without Forgetting (LwF) methods
[9,12,23,26,41] to address this problem. Traditionally, this problem has been
investigated using 2D image data. This paper explores LwF on 3D point cloud
object data.
Modern 3D camera technology allows us to capture 3D point cloud data
more accessible than ever [7]. Now, it is time to adapt 3D point cloud recogni-
tion models with LwF capabilities. We identify some key difficulties to address
this problem. Firstly, in comparison to image datasets like ImageNet, very large-
scale 3D point cloud datasets are not available. 3D datasets usually contain a
handful number of classes and instances [32,38]. Secondly, a typical pre-trained
model for a 3D recognition system is not as robust as 2D models because of not
being trained on a large dataset [5]. Thirdly, 3D point cloud data (especially real
scanned objects) contains more noise than 2D image data [32]. This paper inves-
tigates how far a 3D point cloud recognition model can obtain LwF capabilities
considering all difficulties mentioned above.
We first train a 3D point cloud model with instances belonging to a set
of pre-defined old classes. Then, we update the trained model using a popular
knowledge distillation technique [8] to address the forgetting problem. Because
of the difficulties of 3D data, this approach exhibits a large amount of forgetting
of old classes. To minimize forgetting, we employ semantic word vectors of classes
inside the network pipeline [4,24,42]. During both new and old task training,
the network tries to align point cloud features to their corresponding semantics.
The class semantics encodes similarities and dissimilarities of different objects
from the natural language domain. The network learns to project new instance
features around the previously obtained and fixed semantic vectors while learning
486 T. Chowdhury et al.

new classes. By performing feature-semantic alignment in both old and new


tasks, the network forgets less than the traditional semantic embedding less
method. For example, during the old model training, the model learns to classify
‘bed’ via its semantic (like is Furniture, is Indoor) representation. Later, during
the new model training, the model could not see ‘bed’, but it observes similar
classes (like sofa, chair, table with shared ‘bed’ semantics) that helps not to forget
about ‘bed’ knowledge. Experimenting on ModelNet40 [38], ScanObjectNN [32],
MIT Scenes [22], and CUB [33] datasets, we show that our proposed method
outperforms traditional knowledge distillation methods in both 3D and 2D data
cases. The contributions of this paper are summarized below:

– To the best of our knowledge, we are the first to experiment learning without
forgetting on 3D point object cloud data.
– Our method applies knowledge distillation to restore previously gained expe-
rience of the old mode and minimize catastrophic forgetting while learning a
set of new classes. In addition, we investigate the advantage of semantic word
vectors in the network distillation process.
– We experiment on both 3D synthetic (ModelNet10, ModelNet40 [38]) and
real scanned (ScanObjectNN [32]) point cloud objects and 2D image datasets
(MIT Scenes [22], CUB [33]), establishing the robustness of the method.

2 Related Works
3D Point Cloud Architecture: There are two streams of works for 3D point
cloud classification: feature-based and end-to-end approaches. Feature-based
methods mostly use Multi-view representation and Volumetric CNNs. Multi-
view representation methods [20,31,40] convert 3D point cloud into 2D images,
which are then classified using 2D convolutional networks. Volumetric CNNs
[14,34] project point cloud objects on a volumetric grid or a set of octrees. Then,
they apply a computationally expensive 3D convolutional neural network. The
main drawback of feature-based methods is that they do not work directly on
the raw point cloud. End-to-end approaches like PointNet [19], PointNet++[21]
use raw point cloud data as input to multi-layer perceptron networks followed by
maxpooling layers. Several other works [10,11,37] apply improved convolution
operation on point cloud objects. Moreover, [29,35] use Graph neural networks
to extract features from 3D point clouds. In this paper, we build our model on
several end-to-end architectures.

Learning Without Forgetting: Many methods have been proposed to solve


the catastrophic forgetting problem [2,6,15]. Exemplar-free methods [1,12,41]
do not require any samples from base/old task. Li et al. [12] proposed to use
Hinton’s [8] knowledge distillation loss to preserve old task’s knowledge in 2D
images, but the domain shift between tasks makes this method weak. Rehearsal
methods [3,9,26] keep a small number of exemplars from the old task. Rebuffi
et al. [26] first introduced replay-based method with bounded memory, but it fails
Learning Without Forgetting for 3D Point Cloud Objects 487

to represent the main distribution of old task when there is a lot of variations.
The Pseudo-rehearsal process used in [17,28,36] learns to produce examples
from the old task. Some methods [1,13] minimizes additional parameters to
solve the problem of model expansion. All of the approaches mentioned above
have proposed solutions to the catastrophic forgetting of 2D image data. Our
method is the first to use knowledge distillation to address LwF of 3D data.

Word Embedding for Catastrophic Forgetting: The use of semantic rep-


resentation to prevent catastrophic forgetting is relatively new [4,24,42]. Such
approaches explore the semantic relation between old and new classes to reduce
the forgetting of old classes while training new classes. Zhu et al. [42] suggested
using semantic representation to train the object detection model by projecting
the feature vector into the semantic space. Similarly, Rahman et al. [24] pro-
posed to use semantic representations of class labels as anchors in the prediction
space for not forgetting the acquired knowledge of old classes. Cheraghian et al.
[4] proposed a knowledge distillation strategy by using semantic representation
as an auxiliary knowledge. Even though semantic representation has yielded
promising results, the experiments are limited to 2D data. In this paper, we use
word vectors in knowledge distillation for 3D point clouds object classification.

3 Method

Problem Formulation: Assume, we have a set of old, Y o , and a set of new,


Y n , classes, where, Y o ∩ Y n = ∅, |Y o | = O and |Y n | = N . The 3D point
cloud recognition model initially observes Y o classes and gets trained to classify
only old classes. Later, Y n classes are added to the model to update previous
training. Suppose, a 3D point cloud input sets, X = {xi }ni=1 for xi ∈ R3 , can
get a label from either old Y o or new Y n classes. Additionally, there is a set of
d-dimensional semantic class embeddings for each of the old and new classes,
denoted as E o ∈ Rd×O and E n ∈ Rd×N , respectively. We define the old set as
No
Do = {Xio , yio , eoi }i=1 , where, Xio is the i-th point cloud object belonging to old
set with the class label yio , and the class embedding eoi , and No is the number of
Nn
old class instances. Similarly, there is a set for new classes Dn = {Xin , yin , eni }i=1 ,
where, Xi is the i-th point cloud object having the class label yi , and the class
n n

embedding eni , and Nn is the number of new class instances. We build a 3D point
cloud object recognition model (termed as old model) using Do set. Then, we
aim to update the same model (termed as new model) using only newly available
Dn data that can predict a class label for a test sample belonging to either old
or new sets, i.e., X ∈ Do ∪ Dn . We assume the model has prior knowledge about
the test sample during inference, whether it belongs to old or new classes.

Main Challenges: While updating the model with new data, Dn , the model
gradually forgets the old training (done on Do ) because of the restriction of not
using old class instances. Previous works address this problem with 2D image
488 T. Chowdhury et al.

Fig. 2. Our proposed architecture. We train the old model using the cross-entropy loss
LCE . Then, we build a new model by modifying a copy of the trained old model. Both
cross-entropy, LCE and knowledge distillation, LKD losses are used to train the new
model. The new model can classify both old and new classes.

data. In this paper, we address the same problem on 3D point cloud objects. Due
to the unavailability of large-scale datasets and pre-trained models, the problem
becomes more complex in the 3D than 2D domain.

3.1 Model Overview

Our proposed method is shown in Fig. 2, which includes old and new mod-
els. The new model is the updated variant of the old model to accommodate
new classes. Both old and new models are presented together because, during
the training of the new model, we use the output of the old model. For both
models, the point cloud input X is fed into the backbone M , which can be any
point cloud architecture (PointNet, DGCNN, PointConv etc.), to extract feature
input,i.e., g ∈ Rm . Additionally, a semantic representation unit is employed to
generate class embedding,i.e., e ∈ Rd , given class label. While training the old
model using old classes, the feature input g and the class embedding eo are
mapped into a common k-dimensional space using projection functions F o and
H o , respectively. The new feature representations for the point cloud feature and
the class embedding are fo ∈ Rk and ho ∈ Rk , respectively. Finally, dot multipli-
cation between fo and ho form the output ŷo for the old classes. A cross-entropy
loss, LCE , is adopted to train the model for the old classes. While updating the
same model with the new classes, we add a parallel pipeline from the output of
backbone M . Two projection modules F n and H n are added to map features
Learning Without Forgetting for 3D Point Cloud Objects 489

of new classes M (X n ) and class embedding en into the common k-dimensional


space. The new representations of feature and class semantics are fn and hn ,
respectively. At the end, fn is dot-multiplied with hn to generate output, ŷn for
the new classes. In order to prevent forgetting of the old classes, a knowledge
distillation loss, function, LKD , [8] is employed between output of the old and
new models.

3.2 Training and Inference

We train the proposed architecture with two stages: old and new model training.
Unlike traditional approaches for learning without forgetting (LwF) [12], both
stages use semantic word vectors of classes to remember past knowledge.

Training Old Model: At the first stage of training, we learn the old model
No
using the training data of Do = {Xio , yio , eoi }i=1 employing a cross-entropy loss.
Unlike 2D image cases, we perform this training from scratch because no pre-
trained model is available to initialize the weights of the backbone, M . The
output of the old model for the ith 3D point cloud instance is

ŷio∗ = F o (gi ; θ1 ) . H o (ei ; θ2 )T (1)

where, θ1 and θ2 are learnable weights associate with F o and H o layers, respec-
tively. After finishing the training, the old model can classify the old set of
classes, Y o . This old model remains frozen during the second stage training.

Training New Model: We build a new model during the second training stage
by updating a copy of the old model, which is trained in the previous stage.
This new model gives predictions for both old and new classes. But, we are not
allowed to cannot any old class instances during training the new model. We add
F n and H n layers to F o and H o layers. Only the training data of new classes
Nn
Dn = {Xin , yin , eni }i=1 is used to train the new model. Similar to Eq. 1, both
pipeline of the new model can produce output for old ŷio and new ŷin classes.

ŷio = F o (gi ; θ1 ) . H o (e; θ2 )T , ŷin = F n (gi ; θ3 ) . H n (e; θ4 )T (2)

where, θ3 and θ4 are weights associated with F n and H n layers, respectively.


Among all trainable weights of new model, θ1 and θ2 are initialized from the old
model but θ3 and θ4 are trained from the scratch. While forwarding an input 3D
point cloud object xi ∈ X , old model outputs ŷio∗ for old classes and new model
outputs ŷio and ŷin for old and new classes, respectively.
We calculate a traditional cross-entropy loss LCE between ŷin and ground-
truth yin . This loss is used to learn new classes. Additionally, using old class
outputs from old ŷio∗ and new ŷio model, we calculate a knowledge distillation
[8] loss LKD . This loss is employed to prevent the forgetting of the old classes.
490 T. Chowdhury et al.

Table 1. Statistics of training and testing instances used in different experiments.

Dataset Settings Task # Classes # Train # Test


3D ModelNet40 → ModelNet10 old 30 5852 1560
new 10 3991 908
ModelNet40 → ScanObjectNN old 26 4999 1496
new 11 1496 475
2D Scenes → CUB old 67 5360 1340
new 200 5994 5794
CUB old 150 4495 4326
new 50 1499 1468

Unlike the traditional LKD , we consider class semantics in the pipeline, which
further helps the LwF process. The entire loss (L) to train this model is
L = LCE + λLKD (3)
where, hyperparameter λ controls the contribution of LKD . To calculate LCE ,
we use negative log likelihood loss common in 3D backbones. To calculate LKD ,
we record the output ŷo∗ from old model for new class dataset’s 3D point clouds
X n . The equations for LCE and LKD are:

1  n 1 
N N
LCE =− y log(ŷin ), LKD =− ω(ŷo∗ o
i ; τ )log(ω(ŷi ; τ )) (4)
N i=1 i N i=1

where, τ is the temperature and ω(yi ; τ ) = exp(yi /τ ) is the softmax function.


j exp(yj /τ )

Inference: For any test instance, a forward pass to the new model calculates
old ŷio and new ŷin class scores. We classify old and new classes by selecting the
maximum score from ŷio and ŷin , respectively.

4 Experiment
Dataset: We evaluate our method on 3D datasets, ModelNet10, ModelNet40
[38], ScanObjectNN [32] and two 2D datasets, MIT Scenes [22], CUB [33]. For
the 3D experiment, we use two different settings related to synthetic and real
scanned point cloud data. The synthetic experiment, ModelNet40 → Model-
Net10 setting use 30 classes of ModelNet40 as old and non-overlapped 10 classes
of ModelNet10 as new classes. The real scanned object experiment, ModelNet40
→ ScanObjectNN use 26 classes of ModelNet40 as old and 11 classes of ScanOb-
jectNN as new classes. Both of these setups were previously introduced in [5].
For the 2D experiment, Scenes → CUB considers 67 classes of MIT Scenes as
old and 200 classes of CUB as new. In another setup, 150 and 50 classes of
CUB dataset are used as old and new, respectively. These setups are proposed
in [12,39]. The statistics of train test instances are summarized in Table 1.
Learning Without Forgetting for 3D Point Cloud Objects 491

Table 2. 3D data experiment using PointNet. ↑ (↓) means higher (lower) is better.

ModelNet40 → ModelNet10 ModelNet40 → ScanObjectNN


Method Acc∗o ↑ Acco ↑ Accn ↑ Δ ↓ Method Acc∗o ↑ Acco ↑ Accn ↑ Δ ↓
Baseline-1 89.2 41.5 90.2 53.5 Baseline-1 89.8 51.0 76.9 43.3
LwF [12] 89.2 83.6 89.3 6.2 LwF [12] 89.8 81.3 73.7 9.5
Baseline-2 89.4 − 22.8 − Baseline-2 89.9 − 21.5 −
Ours 89.4 84.4 90.4 5.5 Ours 89.9 86.2 74.6 4.1

Table 3. Ablation study on varying 3D point cloud backbone.

Backbone ModelNet40 → ModelNet10 ModelNet40 → ScanObjectNN


Acc∗o ↑ Acco ↑ Accn ↑ Δ ↓ Acc∗o ↑ Acco ↑ Accn ↑ Δ ↓
PointNet [19] 89.4 84.4 90.4 5.5 89.9 86.2 74.6 4.1
PointConv [37] 90.5 86.2 87.8 4.8 90.2 73.4 66.6 18.6
DGCNN [35] 91.5 87.1 93.4 4.9 91.6 71.8 75.0 21.6

Semantic Embedding: For semantic representation of classes, we use 300


dimensional word2vec (w2v) [16] and GloVe (glo) [18] word vectors. The word
vector models are usually trained on unannotated text corpus. Unless explicitly
mentioned all performance in this paper are with word2vec vectors.

Evaluation Protocol: We evaluate our method using top-1 accuracy. We cal-


culate the old model’s accuracy as Acc∗o . Similarly, we calculate Acco and Accn to
represent performance of old and new classes, respectively using the final model.
Acc∗
o −Acco
To measure the extent of forgetting, we calculate, Δ = Acc∗ × 100%. A
o
lower Δ indicates less forgetting of the new model.

Validation Strategy: We further randomly divide the set of old classes into
val-old and val-new classes for validation experiment. In the ModelNet40 →
ModelNet10 and ModelNet40 → ScanObjectNN experiments, we choose 24 and
20 classes from old classes, respectively as val-old and the rest of the classes as
val-new to find values for hyperparameters. We choose λ = 3 and τ = 3 for our
3D experiments by performing a grid search within the range (0, 3].
Implementation Details1 : We use PointNet [19], PointConv [37], DGCNN [35]
as 3D point cloud backbone and VGG16 [30] (pretrained on Imagenet [27]) as 2D
image backbone to obtain feature vector. For feature vector projection layers, we
use two fully connected layers (512, 256) and (1024, 512) with ReLU activations
in 3D and 2D experiments, respectively. For 3D and 2D experiments, we use one
fully connected layer of size 256 and 512 with ReLU in the projection layer of
semantic representation. In all experiments, we use the Adam optimizer with a
1
Codes are available at: https://github.com/townim-faisal/lwf-3D.
492 T. Chowdhury et al.

Table 4. Experiment with (a) varying semantic representation and (b) 2D images.

(a) Using PointNet backbone (b) Using VGG16 backbone


Settings Word Acc∗o ↑ Acco ↑ Accn ↑ Δ ↓ Settings Method Acc∗o ↑ Acco ↑ Accn ↑ Δ ↓
ModelNet40 glo 88.2 78.8 90.6 10.6 Scenes LwF [12] 71.0 69.9 52.3 1.7
→ModelNet10 w2v 89.4 84.4 90.4 5.5 → CUB Ours 70.7 69.9 53.0 1.1
ModelNet40 glo 89.7 85.2 70.9 5.0 CUB (150) LwF [12] 58.2 57.1 66.2 1.8
→ScanObjectNN w2v 89.9 86.2 74.6 4.1 → CUB (50) Ours 60.0 59.0 69.4 1.7

Fig. 3. Hyperparameter sensitivity on ModelNet40 → ModelNet10 settings. Varying


(left) λ of Eq. 3 and (right) τ of LKD loss in Eq. 4.

learning rate of 0.0001 and batch sizes of 32 during training. We implement our
work using the PyTorch framework.

Compared Methods: In this paper, we compare the results of the following


methods. (a) Baseline-1: A backbone model is trained using the instances of
old classes. Then, the trained backbone is further fine-tuned using new class
instances only. (b) LwF [12]: The backbone training is same as Baseline-1. Then,
the fine-tuning on new class samples uses a knowledge distillation loss [8] not to
forget the old class knowledge. (c) Baseline-2: This method is an intermediate
stage of our proposed approach. We first train the old model of Fig. 2 using
semantic word vector information inside the architecture. But, it does not have
any fine-tuning stage. This performance can be considered zero-shot learning
[5,25] results because it treats new classes as unseen. This method can classify
new (unseen) classes without having trained on new instances. (d) Ours: This is
our final recommendation as described in Sect. 3.1 and 3.2. On top of Baseline-2
training, it contains fine-tuning on new class instances.

4.1 3D Point Cloud Experiments


Overall Results: Table 2 shows the overall results using two settings, Mod-
elNet40 → ModelNet10 and ModelNet40 → ScanObjectNN. Our observations
are as follows. (1) Baseline-1 gets the worst results in forgetting issue show-
ing high Δ values because the fine-tuning for the new model does not consider
about old classes. High and low value in Accn and Acco , respectively tells that
this method learns new classes but forgets the old classes significantly. (2) LwF
[12] obtains better results on forgetting issue (lower Δ values) than Baseline-1
because this method apply a knowledge distillation loss not to forget old classes.
Learning Without Forgetting for 3D Point Cloud Objects 493

Fig. 4. tSNE visualization of features and semantics for (a) 2D image and (b) 3D point
cloud objects. Ten old and four new classes are shown for better visualization. 2D image
features are clustered better than 3D point cloud features.

(3) Baseline-2 shows the performance after old class training using our method.
Without receiving training on new classes, this model can still classify new classes
considering those as unseen class. Although no forgetting occurred in this case,
there is no balance of old and new class performance. (4) Ours result makes a
nice balance of old and new accuracy with maintaining minimal forgetting. (5)
Although both settings achieve similar results (Acco ) in old classes across meth-
ods, ModelNet40 → ScanObjectNN gets less accuracy on new classes (Accn ) than
ModelNet40 → ModelNet10. The reason is that ScanObjectNN classes (new) are
real-scanned 3D objects with higher noise than synthetic data.

Ablation Studies: In Table 3, we perform ablation study while varying differ-


ent 3D point cloud backbone. Among all backbones, PointNet performs consis-
tently well in both 3D experiment settings. PointConv and DGCNN have some
success in forgetting issue with synthetic data of ModelNet10 but fails to gener-
alize it for real scanned ScanObjectNN classes. The global features extracted by
PointNet may be more helpful than local features from PointConv and DGCNN
backbones. Table 4(a) also compares two different word vector models (word2vec
and GloVe) as semantics. In most cases, word2vec achieves better accuracy and
less forgetting in comparison to GloVe.

Hyperparameter Sensitivity: We experiment with varying λ and τ in Fig. 3.


By fixing one hyperparameter and adjusting another, we observe hyperparameter
sensitivity within the range λ, τ ∈ (0, 3]. We notice that increasing λ and τ from 0
to 3 improves the old (Acco ) and new (Accn ) class performance. From λ = τ = 3
494 T. Chowdhury et al.

to higher, Acco results do not deflect much, but Accn values decrease gradually.
We achieve best results using λ = τ = 3.

4.2 2D Experiments

In addition to 3D point cloud experiments, we conduct 2D image experiments.


We report our results in Table 4(b) using MIT Scenes [22], CUB [33]. For two
different experiment setups, Scenes → CUB and CUB (150) → CUB (50), our
method achieves better performance than LwF [12] in terms of less forgetting
(Δ). Moreover, we observe that the result of the 2D experiments is better than
the 3D experiments (Tables 2 and 3). The amount of forgetting (Δ) is higher
in 3D cases than in 2D cases (5–6% vs. 1–2%). The main reason is the 2D
backbone (VGG16 [30]) has been pre-trained on a large dataset Imagenet [27],
which has million training instances and thousands of classes. In contrast, the
3D backbone (PointNet, DGCNN, PointConv) used in the 3D experiments is
not pre-trained on a huge dataset. Therefore, the feature vector obtained from
the 2D backbone is richer and more clustered than the feature vector obtained
from the 3D backbone. We notice that the feature-semantic alignment in the 2D
experiment is more aligned than the 3D experiment, as shown in Fig. 4.

5 Conclusion
In this paper, we investigate LwF on 3D point cloud objects. Because of the lack
of large-scale 3D datasets and powerful pre-trained models, popular knowledge
distillation on prediction scores poorly performs on 3D data. To improve the
performance further, we use semantic word vectors in the network pipeline. It
helps to improve the traditional knowledge distillation performance. We also
report performance on different 3D recognition backbones and word embeddings.
We notice that the extent of forgetting in 3D is still inferior to the 2D image
case. Future research in this area may investigate this issue further.

Acknowledgment. This work was supported by NSU CTRG 2020–2021 grant


#CTRG-20/SEPS/04.

References
1. Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., Tuytelaars, T.: Memory
aware synapses: learning what (not) to forget. In: Proceedings of the European
Conference on Computer Vision (ECCV) (2018)
2. Aljundi, R., Chakravarty, P., Tuytelaars, T.: Expert gate: lifelong learning with a
network of experts. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (2017)
3. Castro, F.M., Marı́n-Jiménez, M.J., Guil, N., Schmid, C., Alahari, K.: End-to-end
incremental learning. In: Proceedings of the European Conference on Computer
Vision (ECCV) (2018)
Learning Without Forgetting for 3D Point Cloud Objects 495

4. Cheraghian, A., Rahman, S., Fang, P., Roy, S.K., Petersson, L., Harandi, M.:
Semantic-aware knowledge distillation for few-shot class-incremental learning. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR) (2021)
5. Cheraghian, A., Rahman, S., Chowdhury, T.F., Campbell, D., Petersson, L.:
Zero-shot learning on 3D point cloud objects and beyond. arXiv preprint
arXiv:2104.04980 (2021)
6. Goodfellow, I.J., Mirza, M., Xiao, D., Courville, A., Bengio, Y.: An empirical
investigation of catastrophic forgetting in gradient-based neural networks. In: 2nd
International Conference on Learning Representations, ICLR 2014 - Conference
Track Proceedings (2014)
7. Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for
3D point clouds: a survey. IEEE Transactions on Pattern Analysis and Machine
Intelligence (2020)
8. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531 (2015)
9. Hou, S., Pan, X., Loy, C.C., Wang, Z., Lin, D.: Learning a unified classifier incre-
mentally via rebalancing. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR) (2019)
10. Komarichev, A., Zhong, Z., Hua, J.: A-CNN: annularly convolutional neural net-
works on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (2019)
11. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on
x-transformed points. Advances in Neural Information Processing Systems (2018)
12. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach.
Intell. 40(12), 2935–2947 (2018)
13. Mallya, A., Lazebnik, S.: Packnet: adding multiple tasks to a single network by iter-
ative pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018
14. Maturana, D., Scherer, S.: Voxnet: a 3D convolutional neural network for real-time
object recognition. In: IROS (2015)
15. McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks:
the sequential learning problem. Psychol. Learn. Motiv. - Adv. Res. Theor. 24,
109–165(1989)
16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality. In: Burges, C.J.C., Bot-
tou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural
Information Processing Systems, vol. 26, pp. 3111–3119. Curran Associates, Inc.
(2013)
17. Ostapenko, O., Puscas, M., Klein, T., Jahnichen, P., Nabi, M.: Learning to remem-
ber: a synaptic plasticity driven framework for continual learning. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (2019)
18. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word repre-
sentation. In: EMNLP (2014)
19. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for
3D classification and segmentation. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2017)
20. Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-
view CNNs for object classification on 3D data. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
496 T. Chowdhury et al.

21. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++ deep hierarchical feature learning
on point sets in a metric space. In: Proceedings of the 31st International Conference
on Neural Information Processing Systems (2017)
22. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(2009)
23. Rahman, S., Khan, S., Barnes, N.: Transductive learning for zero-shot object detec-
tion. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), October 2019
24. Rahman, S., Khan, S., Barnes, N., Khan, F.S.: Any-shot object detection. In:
Proceedings of the Asian Conference on Computer Vision (ACCV) (2020)
25. Rahman, S., Khan, S.H., Porikli, F.: Zero-shot object detection: joint recognition
and localization of novel concepts. Int. J. Comput. Vis. 128(12), 2979–2999 (2020)
26. Rebuffi, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: iCaRL: incremental clas-
sifier and representation learning. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2017)
27. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J.
Comput. Vis. (IJCV) 115(3), 211–252 (2015)
28. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative
replay. In: Advances in Neural Information Processing Systems (2017)
29. Simonovsky, M., Komodakis, N.: Dynamic edge-conditioned filters in convolutional
neural networks on graphs. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR) (2017)
30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference
on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015,
Conference Track Proceedings (2015)
31. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional
neural networks for 3D shape recognition. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), pp. 945–953 (2015)
32. Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud
classification: a new benchmark dataset and classification model on real-world data.
In: 2019 IEEE/CVF International Conference on Computer Vision ( ICCV) (2019)
33. Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part
localization with humans in the loop. In: International Conference on Computer
Vision ( ICCV) (2011)
34. Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: O-CNN: octree-based convo-
lutional neural networks for 3D shape analysis. ACM Trans. Graph. (TOG) 36(4),
1–11 (2017)
35. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic
graph CNN for learning on point clouds. ACM Trans. Graph. (tog) 38(5), 1–12
(2019)
36. Wu, C., Herranz, L., Liu, X., Wang, Y., Van De Weijer, J., Raducanu, B.: Memory
replay Gans: learning to generate images from new categories without forgetting.
In: Advances in Neural Information Processing Systems (2018)
37. Wu, W., Qi, Z., Fuxin, L.: PointCONV: deep convolutional networks on 3D point
clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) (2019)
38. Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR) (2015)
Learning Without Forgetting for 3D Point Cloud Objects 497

39. Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning-a comprehensive
evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach.
Intell. 41(9), 2251–2265 (2018)
40. Yang, Z., Wang, L.: Learning relationships for multi-view 3D object recognition.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) (2019)
41. Zhang, J., et al.: Class-incremental learning via deep model consolidation. In:
Workshop on Applications of Computer Vision (WACV) (2020)
42. Zhu, C., Chen, F., Ahmed, U., Savvides, M.: Semantic relation reasoning for shot-
stable few-shot object detection. arXiv preprint arXiv:2103.01903 (2021)
Patch-Wise Semantic Segmentation
of Sedimentation from High-Resolution
Satellite Images Using Deep Learning

Tahmid Hasan Pranto, Abdulla All Noman, Asaduzzaman Noor,


Ummeh Habiba Deepty, and Rashedur M. Rahman(B)

Department of Electrical and Computer Engineering, School of Engineering


and Physical Sciences, North South University, Dhaka 1229, Bangladesh
{tahmid.pranto,abdulla.noman01,asaduzzaman.noor,ummeh.deepty,
rashedur.rahman}@northsouth.edu

Abstract. In recent times, satellite data availability has increased sig-


nificantly, helping researchers worldwide to explore, analyze and app-
roach different problems using the most recent techniques. The segmen-
tation of sediment load in coastal areas using satellite imagery can be con-
sidered as a cost-efficient process as sediment load analysis can be costly
and time-consuming if done hands on. In this work, we created dataset
of Bangladesh marine area for segmenting sediment load and showed the
applicability of deep learning technique to segment sedimentation into 5
different classes (Land, Hight Sediment, Moderate Sediment, Low Sedi-
ment and No Sediment) using deep neural network called U-Net. As our
collected satellite image is enormous, we showed how patch-wise learn-
ing technique can be an effective solution in the context of batch-wise
training. Highest dice coefficient of 86% and validation dice coefficient of
87% has been acquired for Dec-2019 time frame data. The highest 77%
of pixel accuracy and 78% of validation pixel accuracy was achieved on
the same time frame data.

Keywords: Multi-class semantic segmentation · Patch-wise learning ·


U-Net · High resolution satellite image

1 Introduction
Bengal delta is the most populated delta in the world. According to the data
of Bangladesh Water Development Board, total of 405 rivers are flowing across
the country among which 48 are inter-country and the other 357 river are intra-
country [1]. Both intra-country and inter-country rivers finally falls into the Bay
of Bengal creating a huge load of sedimentation in the region where these rivers
meet the sea. According to CEGIS study, conducted by the Center for Envi-
ronmental and Geographic Information Services of Bangladesh, an estimated
amount of 1.2 BT(billion-tonnes) of slit is being carried and finally discharged
into the Bay of Bengal by the GBM (Ganga-Brahmaputra-Meghna) river system
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 498–509, 2021.
https://doi.org/10.1007/978-3-030-85030-2_41
Patch-Wise Semantic Segmentation of Sedimentation 499

of Bangladesh [2]. More than 400 rivers carries sedimentation into the Bay of
Bengal which can be seen from satellite images. Remotely sensed satellite infor-
mation is a great source of multi-temporal geo-spatial data which can often be
huge in size but, consisting of compact high level information in small pixels.
These images can be used in deep learning algorithm to segment sedimentation
type into different classes.
Deep learning is going through major advancements in the past few years.
Convolutional Neural Network (CNN) has already achieved great success in seg-
mentation task [3]. But, for the last three decades, one of the most difficult
problems in computer vision has been accurate image segmentation [4]. Recent
progress in semantic image segmentation using transfer learning has shown some
promising scenarios and have significant improvement over previous semantic
segmentation approaches [3]. Segmentation of remotely sensed images has fur-
ther greater implications. As, satellite images contains uniform and compact
data than that of a conventional image, many practical application can be done
using these data. Satellite image can cover a large area in a single picture, mak-
ing it easy for us to conduct study in large areas via these pictures. There are so
many worthwhile applications of remotely sensed images shown by the research
community. Road extraction for automated driving, [5], land cover classification,
[6] urbanization and slam detection [7] etc.
In this study, our intention is to show the applicability of U-Net for segment-
ing sedimentation into five different classes in Bangladesh marine region while
implementing a patch-wise learning technique. The rest of the paper is organized
as follows, Sect. 2 shows the related study then Sect. 3 discusses about our area
of study which is followed by the discussion of our data collection procedure in
Sect. 4. A detailed description of our methodology used for the study has been
shown in Sect. 5. The experiment setup which includes our evaluation metrics
and loss function as well as hyper parameters have been depicted in Sect. 6 and
the result has been shown and analyzed in Sect. 7 and finally the conclusive
discussion is attached in Sect. 8.

2 Related Works

The most important aspect of image segmentation is mapping different section


of an image into specific classes sometimes recognizing objects from an image
and sometimes classifying a single pixel. While the model segments each pixel
rather than an object, then it’s called semantic segmentation [3]. Recent progress
in semantic image segmentation [4] has shown some promising outputs in terms
of their accuracy and performance. And recent advances in CNN have signifi-
cant improvement over previous semantic segmentation techniques [8,9]. Wu et
al. came up with CasFCN, a less complex but high-performing model for ultra-
sound maternity image segmentation [10]. Long et al. Proposed a segmentation
architecture that uses classification models like AlexNet, GoogleNet or VGG
as full convolution part of their model and uses a combination of upsampling
and patch-wise training mechanism. To fuse coarse, they added skip connections
500 T. H. Pranto et al.

between the layers for a better feature extraction [9]. Although deep learning
performs well for segmentation tasks but vanishing gradient and overfitting still
becomes a problem while training deep neural networks [4].
Complex images are tough to segment as localized information can be con-
fusing for the model to interpret. For example, satellite images are collection
of high-level squeezed information which can be very difficult to segment. This
is where more improved and advanced deep learning models like U-Net [11],
DeepLabV3 [12] comes handy. U-Net is a FCNN (Fully Connected Neural Net-
work) which uses skip connection for best possible feature extraction [11]. So
far, one of the most impactful application of U-Net based segmentation has
been seen on the medical sector [3,13]. It has been used for nuclei segmentation
in histology images [11], liver and tumor segmentation [3], chronic stroke lesion
segmentation [13], retina-vessel segmentation [14] as well as for heart-conditions
from ultrasound images (echo-cardiography) [15].
The promising performance of U-net like architecture in medical sector
encouraged so many researchers to use in other scenarios where segmentation
can be proven beneficial. Though, it was initially developed for biomedical image
segmentation, but, several implementations of U-Net have been seen for many
other use cases like satellite image segmentation [16], street tree segmentation
[17], tomato leaf segmentation [18], real-time hair segmentation [19], hand seg-
mentation in complex background [20], pedestrian segmentation by detecting
head and shoulders [21] and sea-land segmentation [22]. In this study, we used
a modified U-Net architecture for segmenting sedimentation type in Bangladesh
Marine region from high resolution satellite images.

3 Study Area
Bangladesh’s marine waters occupy an area of 165,887-km2 , which is greater
than the total ground region of this country. The geographic location of this
area is at 20.99 ◦ N, 90.73 ◦ E [23]. The total coastal zone of Bangladesh consists
of 19 districts of Bangladesh, which covers an area of 47,201 km2 [24]. In the Bay
of Bengal, Bangladesh’s coastal marine regions are spilled into three zones which
are 12 nautical miles(NM) of territorial waters, 200 NM of exclusive economic
zone, and 350 NM of the sea bed from Bangladesh baseline [25]. We chose this
specific location because all the 400 rivers flowing across this country opens
up to the Bay-of-Bengal in our targeted area. High-resolution satellite image
tiles have been clipped on this region from google earth engine. But these image
tiles needed further pre-processing which has been depicted in Sect. 4.

4 Data Collection
4.1 Labelling Satellite Images in Earth Engine Code
Google earth engine code provides an in-built tool for labelling satellite images
into different classes and also provides JavaScript API to conduct different pre-
processing and uploading the images directly to google drive. Labelled images
Patch-Wise Semantic Segmentation of Sedimentation 501

can be further processed to create shapefile which we can directly be added as


a band in the original satellite image via google earth engine’s asset tool. The
process is shown in the diagram below (Fig. 1).

Fig. 1. (a) Original satellite image having 12 multi-spectral band but only RGB bands
selected for the purpose of this study. (b) Labelling using in-built labelling tool of
google earth engine. (c) Processing the labelling file using QGis software. (d) Adding
the shapefile as a band to the original satellite image.

JavaScript API allows direct import and saving of these data in google drive.
Downloaded labelled image we’re categorized using QGis software. Shapefile
created using QGis will be used as our mask image. This mask has been added
to the original image as the fourth band in the corresponding coordinate of our
target study area.

4.2 Data Preparation


We used google earth engine’s JavaScript API to accumulate images of our study
area for four different time frames. They are as follows: November-December
2018, January-February 2019, November-December 2019 and January-February
2020. Image of each time-frame was 36141 × 28197 pixels in width and height
correspondingly which was captured in a rectangular zone that covers the area
located at (89.09, 22.91)◦ in North-West, (92.34, 22.91)◦ in North-East, (92.34,
20.37)◦ in South-East and (89.09, 20.37) in South-West. For collecting the image
502 T. H. Pranto et al.

data, we used the Copernicus Sentinel-2 satellite imagery at 10m resolution.


Each image consists of a total of 4 spectral bands where the first three bands
represent RGB channels and the last band represents the corresponding labels.
By separating the first three bands, we got our input image and we got the mask
image by separating the last band. Image extraction and band separation has
been shown in the figure below (Fig. 2).

Fig. 2. Image retrieval and band separation. (a) Original satellite image with label
added as the 4th band. (b) Band stack of downloaded tiles. (c) Retrieved image. (d)
Retrieved mask.

This 4-band enormous image (Fig. 2b) is our final image which we further
processed with python GDAL library as these enormous images can’t be used to
train the model. Using the image translation method of python GDAL library,
we further created 15762 patches both for the images and the masks. The size of
each image and mask patch is 256×256 pixel which can be used to train image
segmentation models.

5 Methodology: Modified U-Net Architecture

In Sect. 2, we have vividly described the broad application domain of U-Net. We


chose U-Net as it is a popular model for segmentation. And our study objective
is to show the applicability of U-Net in case of segmentation of satellite images.
U-Net architecture has two parts, the left part is called contracting part and
Patch-Wise Semantic Segmentation of Sedimentation 503

the right side of the model is called the expansive part. We modified the origi-
nal architecture for image shape of 256 * 256 having 3-channels. The modified
architecture is shown in figure below (Fig. 3).

Fig. 3. Modified U-Net Architecture.

The input image of 256 * 256, in the contracting path, first goes through a
concatenate layer and then through a layers of unpadded 2d-convolution with
64, (3 * 3) kernels followed by a batch normalization with 2 * 2 kernel, stride
size of 2 and momentum of 0.01. The next layers are almost the same. On the
skip connections, raw values are passed to the expansive part of the model. From
group 7, the expansive part of the model starts. A group in expansive part starts
with a Conv2d-transpose layer with 64 kernels of size (3 * 3) and stride of 2.
This layes does the exact opposite of 2d-Convolution. This layer is followed by a
concatenation layer, then a batch normalization and a 2d-convolution layer for
twice. This arrangement is followed till the 11’th group and finally the last layers
image has a shape of 256 * 256 * 5. The activation function used in between the
layers is ReLU and the output activation function is SoftMax.

6 Experiment Setup

The models were trained on Nvidia 1050 Max-Q 2gb GPU. Batchwise model
training technique was followed as the computational power was fairly marginal
to the purpose. Regular-use machines like laptops or desktops are not often
504 T. H. Pranto et al.

loaded with a high-end GPU, which is why batch-wise learning is often under-
taken. Satellite provided images are not regular sized that we use to train our
model batch-wise. So, creating patches from the original satellite image was a
vital step for our study. As for the training and evaluation, Adam optimizer was
used while training the model with a learning rate of e−4 and decay rate of e−5 .
We used Categorical cross-entropy as our loss function as the task is a multi-class
segmentation having five classes. The model was trained for 30 epochs while 305
images were used in every step of the training epoch and 93 images for validation.
For evaluation metrics, we used Dice Coefficient and pixel-accuracy score. Both
of these matrix’s measure the similarity between ground truth and predicted
mask. All the hyper parameters was found out empirically to be performing well
with our model and data.

6.1 Evaluation Metrics and Loss Function


SoftMax: The modified U-Net that we’re using uses a SoftMax activation func-
tion in the last layer of the model. SoftMax activation will give a vectorized
probability distribution of a pixel belonging to a particular class. A we’re doing
a semantic segmentation task, pixel-wise use of SoftMax is ensured by the struc-
ture of the U-Net model. SoftMax activation function uses the equation below
(Eq. 1).
e(zi )
σ(z i ) = n (zk )
(1)
k=1 e
Where e represents standard exponential function, z is the input vector and
n are the number of classes. σ is the Softmax function that is applied in the
input vector according to Eq. 1.

Dice Coefficient: Dice Coefficient is a performance evaluation metric. It shows


the overlap between the true image and predicted image. The equation of dice
coefficient is as follows (Eq. 2).
2 ∗ intersection(A ∗ B)
DiceCoef f icient = (2)
(A + B)
Where A is the true image and B is the predicted image by the model.
Intersection between the true and predicted image is taken first and resultant
gets multiplied by two. Then, the whole numerator is divided by A added with
B on the denominator.

Pixelwise Accuracy: As this study deals with semantic segmentation, pixel


accuracy is another important metric. Pixel accuracy determines the percentage
of pixel being classified correctly by the segmentation. In other words, pixel
accuracy shows the percentage of true positive rate attained by the model. Pixel
accuracy follows the equation below (Eq. 3).

n T Pn
P ixelAccuracy =  (3)
n (T P n + F Pn )
Patch-Wise Semantic Segmentation of Sedimentation 505

Here in Eq. 3, n represents the number of class and TP means true positive
and FP stands for false positive.

Categorical Cross-Entropy: Categorical Cross-entropy also known as nega-


tive log loss is a modification of binary cross-entropy for multi-class scenario.
[26] The equation for calculating negative log loss is as follows (Eq. 4).


M 
N
L(y, ŷ) = − (yij ∗ log(ŷij )) (4)
j=0 i=0

7 Result and Analysis

We used data for four different time frames and from every dataset, 70% data
has been used for training and the other 30% data has been used for testing.
Images and masks has been drawn in mini-batches. The results after 30 epochs
have been shown in the table below (Table 1).

Table 1. Metric, accuracy and loss for four year dataset.

Dec-2018 Jan-2019 Dec-2019 Jan-2020


Dice coefficient 0.85 0.86 0.86 0.85
Validation dice coefficient 0.85 0.86 0.87 0.86
Loss 0.69 0.62 0.62 0.67
Validation loss 0.66 0.61 0.60 0.66
Pixel accuracy 0.74 0.77 0.77 0.75
Validation pixel accuracy 0.73 0.77 0.78 0.75

Fig. 4 shows the change of training and validation dice co-efficient and Fig. 5
shows the change of training and validation loss over 30 epochs.

By the pixel accuracy, loss and dice coefficient values and graphs, it appears
that the model is learning some pattern in the data as the loss is smoothly
decreasing in accordance with the validation loss while on the other hand, same
scenario can be seen in the case of dice coefficient. As dice coefficient is a per-
formance evaluation metrics, it seems to perform well with a highest coefficient
of 85% both for training and validation on every case.
The model predicts for 5 classes but the total image has two kind of regions,
one where it has one of the five classes and the other is where two separate class
joins. The figure below show the prediction of our model for class join regions
(Fig. 6) as well as for single class regions (Fig. 7).
506 T. H. Pranto et al.

Fig. 4. Training and validation dice co-efficient over epochs.

Fig. 5. Training and validation loss over epochs.


Patch-Wise Semantic Segmentation of Sedimentation 507

Fig. 6. Image and their corresponding true mask and predicted mask for class join
regions.

Fig. 7. Image and their corresponding true mask and predicted mask for single class
regions.
508 T. H. Pranto et al.

8 Conclusion
Bangladesh has numerous vulnerable areas here where soil erosion, river ero-
sion, and frequent flood are well-known phenomena which affect large number
of population. These phenomenons are directly related to sediment load on river
water. Satellite image can be a cost-effective way to investigate the sediment
type through segmentation without even reaching into the spot as deep sea is
not human hospitable. This process will lessen additional efforts for the purpose
and will require minimal computation power while saving valuable time of the
relevant authority. In this work, we showed the usability and effectivity of UNet
architecture to segment the sedimentation of Bangladesh marine region into five
separate classes with highest 87% dice co-efficient and 77% pixel accuracy rate.
As no other reference study has been done on this context, we think that from
a application point of view, this performance is satisfactory. This models can
also be useful for the relevant the authorities to take early decisions because
numerous people of Bangladesh lose their land, houses and asset every year due
to floods and river erosion. The segmentation of sediment can also be helpful in
the case of Bangladesh’s sustainable development. The methods and techniques
used in this study is generic enough and can be used to solve similar kinds of
problems. For our future work, we are highly motivated to find the delineation
of sediment type on a yearly basis. This is why in this study we performed seg-
mentation on different time frame data. We will also try to investigate the effect
of different label noise on segmentation.

References
1. Bwdb. Bangladesh Water Development Board — On Going Project, January 2021
2. Cegis. Comprehensive Resource Database, February 2021
3. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-
Rodriguez, J.: A review on deep learning techniques applied to semantic segmen-
tation. arXiv preprint arXiv:1704.06857 (2017)
4. Guo, Y., Liu, Y., Georgiou, T., Lew, M.S.: A review of semantic segmentation using
deep neural networks. Int. J. Multimedia Inform. Retrieval 7(2), 87–93 (2018)
5. Xu, Z., et al.: Road extraction in mountainous regions from high-resolution images
based on DSDNET and terrain optimization. Remote Sens. 13(1), 90 (2021)
6. Sefrin, O., Riese, F.M., Keller, S.: Deep learning for land cover change detection.
Remote Sens. 13(1), 78 (2021)
7. Wurm, M., Stark, T., Zhu, X.X., Weigand, M., Taubenböck, H.: Semantic segmen-
tation of slums in satellite images using transfer learning on fully convolutional
neural networks. ISPRS J. Photogrammetry Remote Sens. 150, 59–69 (2019)
8. Wu, M., Zhang, C., Liu, J., Zhou, L., Li, X.: Towards accurate high resolution
satellite image semantic segmentation. IEEE Access 7, 55609–55619 (2019)
9. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. Presented at the (2015)
10. Wu, L., Xin, Y., Li, S., Wang, T., Heng, P.-A., Ni, D.: Cascaded fully convolutional
networks for automatic prenatal ultrasound image segmentation. In: 2017 IEEE
14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 663–666.
IEEE (2017)
Patch-Wise Semantic Segmentation of Sedimentation 509

11. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
12. Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. CoRR abs/1706.05587 (2017)
13. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing
imperfect datasets: a review of deep learning solutions for medical image segmen-
tation. Med. Image Anal. 63, 101693 (2020)
14. Haralick, R.M., Shapiro, L.G.: Image segmentation techniques. Comput. Vis.
Graph. Image Process. 29(1), 100–132 (1985)
15. Naz, S., Majeed, H., Irshad, H.: Image segmentation using fuzzy clustering: a sur-
vey, pp. 181–186. IEEE (2010)
16. Khryashchev, V., Larionov, R., Ostrovskaya, A., Semenov, A.: Modification of u-
net neural network in the task of multichannel satellite images segmentation, pp.
1–4. IEEE (2019)
17. Li, Q., Yuan, P., Liu, X., Zhou, H.: Street tree segmentation from mobile laser
scanning data. Int. J. Remote Sens. 41(18), 7145–7162 (2020)
18. Ngugi, L.C., Abelwahab, M., Abo-Zahhad, M.: Tomato leaf segmentation algo-
rithms for mobile phone applications using deep learning. Comput. Electr. Agric.
178, 105788 (2020)
19. Yoon, H., Park, S., Yoo, J.: Real-time hair segmentation using mobile-unet. Elec-
tronics 10, 99 (2021)
20. Wang, X.-Y., Wang, T., Bu, J.: Color image segmentation using pixel wise support
vector machine classification. Pattern Recogn. 44(4), 777–787 (2011)
21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017)
22. Chu, Z., Tian, T., Feng, R., Wang, L.: Sea-land segmentation with res-unet and
fully connected crf, pp. 3840–3843. IEEE (2019)
23. Marineregions.org. Marine Gazetteer Placedetails (2020). https://www.marine
regions.org/gazetteer.php?p=details&id=25431. Accessed 05 Jan 2020
24. Ahmad, H.: Bangladesh coastal zone management status and future trends. J.
Coastal Zone Manag. 22(1), 1–7 (2019)
25. Belal, A.S.M.: Maritime boundary of bangladesh: Is our sea lost. Bangladesh Insti-
tute of Peace and Security Studies (2012)
26. Ho, Y., Wookey, S.: The real-world-weight cross-entropy loss function: modeling
the costs of mislabeling. IEEE Access 8, 4806–4813 (2019)
Learning Image Segmentation from Few
Annotations
A REPTILE Application

Héctor F. Satizábal and Andres Perez-Uribe(B)

Institute for Information and Communication Technologies (IICT),


University of Applied Sciences of Western Switzerland (HEIG-VD/HES-SO),
Yverdon-les-Bains, Switzerland
[email protected]

Abstract. How to build machine learning models from few annota-


tions is an open research question. This article shows an application
of a meta-learning algorithm (REPTILE) to solve the problem of object
segmentation. We evaluate how using REPTILE during a pre-training
phase accelerates the learning process without loosing performance of the
resulting segmentation in poor labeling conditions, and compare these
results against training the detectors using basic transfer learning. Two
scenarios are tested: (i) how segmentation performance evolves through
training epochs with a fixed amount of labels and (ii) how segmentation
performance improves with an increasing amount of labels after a fixed
amount of epochs. The results suggest that REPTILE is useful making
learning faster in both cases.

1 Introduction
Computer vision is one of the leading fields in artificial intelligence research.
Image classification is a common task in computer vision in which a machine
learning algorithm has to classify images into some known categories (Fig. 1a).
Instead of classifying the whole image, in this paper we are focusing on detecting
(and possibly counting) specific types of objects in satellite or drone images.
There are several approaches allowing to spot objects in images.
In object detection, instead of classifying the whole image, the algorithm must
spot objects inside the image, typically with a bounding box around the object,
and classify them (Fig. 1b). Hence, there may be several objects belonging to
several categories within the image, and the location and type of those objects
must be discovered by the algorithm. A further approach is semantic segmen-
tation, in which the algorithm attributes a class label to each individual pixel
in the image, and this allows to detect the silhouette of the object within the
whole image (Fig. 1c). Moreover, there is instance segmentation which allows to
distinguish between several individual objects belonging to the same class, e.g.,
several dogs nearby. In this paper, we are focusing on the application where one
wants to detect specific types of objects on satellite or drone imagery. However,
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 510–522, 2021.
https://doi.org/10.1007/978-3-030-85030-2_42
Learning Image Segmentation 511

Fig. 1. (a) Image classification: the whole image is classified as dog. (b) Object detec-
tion: bounding box surrounding the object. (c) Semantic segmentation: pixels belonging
to the object are highlighted

those objects are not necessarily part of current existing image databases and
thus require the development of specific models taking advantage of very few
examples. Figure 2 shows several examples of applications of such detectors.

Fig. 2. Some application examples of object detectors working on drone images: build-
ing detection (e.g., roofs, parkings, solar cells), object counting (e.g., cars, livestock),
land use (e.g., crop counting)

There are several neural network arquitectures that have been conceived
for performing object detection: R-CNN [9] and its variants [4,10,15], SPP-
net [6], YOLO [5] and its variants [20,21] and SSD [12]. Furthermore, there are
algorithms also providing accurate segmentation masks like ENet [3], Mask R-
CNN [7] and some networks based on the U-Net architecture [22] e.g., [13,16,25].
In this paper, we used a U-Net-based [22] model trained on the ImageNet [8]
database to perform our experiments. We chose U-Net over traditional instance
segmentation models such as Mask R-CNN [7] because it has much less param-
eters and hyperparameters and it is therefore much easier to train on various
tasks without the need to fine-tune the hyperparameters for each task. Section 3.1
describes the datasets we used as detection tasks. We are particularly interested
in taken a pre-trained model and fine-tune it using as few examples as possible.
Real life objects exist in multitude of shapes and colors, and they may be
shot from an infinity number of angles and distances1 . Thus, to build a machine
learning algorithm that learns how to correctly detect those objects, it is impor-
tant that the training dataset contains enough samples to represent the huge
diversity present in the real world. This fact usually translates in the need for
1
In aerial images it translates into different image ground resolutions and object
scales.
512 H. F. Satizábal and A. Perez-Uribe

a huge training dataset with thousands or millions of real world images2 . Hav-
ing huge training datasets is prohibitive for some applications not only because
shooting images can be expensive3 , but also because labeling the images may
be very time consuming. The problem of building machine learning models from
small datasets has been addressed in the machine learning literature as Few-Shot
learning [24], Low-Shot learning or even One-Shot or Zero-Shot learning.
There are two different approaches for performing low-shot learning. On the
one hand there is data-based low-shot learning in which the main idea is to
increase the volume of data that is available for training the models. This can
be achieved for instance, by using other sources of data from related task (in
an unsupervised of semi-supervised manner), adding noise to the training data
(i.e., data augmentation) or generating new training samples by using Genera-
tive Adversarial Networks (i.e., data hallucination). On the other hand, there are
parameter-based approaches in which the parameters of the model are adapted
in such a way that it is capable of learning from few examples without over-
fitting. These are often transfer learning approaches in which the model is first
pre-trained and then fine-tuned on the lower amount of target training data.
Meta-learning, and more specifically REPTILE [19] (the method we used in this
contribution), belongs to this category of parameter-based approaches. For a
comprehensive survey on low-shot learning approaches please refer to [24].
Last but not least, we would like to highlight the fact that the main goal
of the approach we show in this paper is not to improve the performance, but
to allow the detector to learn from few examples as fast as possible without
overfitting.
This contribution is organized as follows. Section 1 introduces the problem
and Sect. 2 describes the strategy we applied to solve it. Section 3 describes
the experimental framework we employed for obtaining the results described in
Sect. 4. At the end, Sect. 5 gives some final thoughts on the use of REPTILE for
building object detectors in poor data scenarios.

2 Few-Shot Learning, Meta-learning, and REPTILE


Learning from few examples is a situation that we (human beings) must face all
the time. When children learn how to recognize a new object or shape, they do
not need to see hundreds or thousands of examples; instead, they can adopt the
new shape after seeing only a few of them. This is possible since when we learn
new concepts, we do not create new representations from scratch, but we do it
incrementally by reusing what we have learnt in the past.
In machine learning, the problem of learning from few examples is addressed
as Low-Shot or Few-Shot learning. Wang et al. [24] performed a survey that
deeply describes the subject and summarizes the most relevant approaches to
2
ImageNet [8]: 14197122 images, Microsoft Common Objects in Context [11]: 2500000
images, CIFAR-10 [18]: 60000 images, CINIC-10 [2]: 270000 images.
3
Not anymore, given the increasing availability of electronic devices capable of cap-
turing images (e.g., smartphones, satellites) nowadays.
Learning Image Segmentation 513

deal with it. Besides increasing the size of the training dataset (which has been
characterized as data-based low-shot learning) by means of data-augmentation
strategies, there are machine learning approaches that operate over the model
parameters (or the way they are computed) to make the system learn from few
examples. One of this approaches is called meta-learning. According to [23],
meta-learning consists in systematically observing how different machine learn-
ing approaches perform on a wide range of learning tasks, and then learning from
this experience to learn new tasks much faster than otherwise possible. This def-
inition states that meta-learning approaches may be useful to make machine
learning algorithms to learn faster, which directly means less learning steps, but
also less learning examples. Finn et al. [14] proposed a meta-learning method
called MAML (Model-Agnostic Meta-Learning) which is compatible with any
model trained with gradient descent. This method can learn the parameters of
any model via meta-learning in a way as to prepare that model for fast adapta-
tion. MAML operates by training a model on a variety of learning tasks, such
that it can solve new learning tasks using only a small number of training sam-
ples. It has a meta-gradient update that involves a gradient through a gradient.
Finn et al. [14] also proposed a first order version of the MAML algorithm which
ignores the second derivative terms. This first-order version of MAML has been
shown to perform nearly as well as its original implementation [14].
The meta-learning approach we used during our experiments is called REP-
TILE [19]. The REPTILE algorithm, which is a further simplification of the first
order version of MAML is described as follows:

Algorithm 1. REPTILE (online version)


1: Initialize θ, the vector of initial parameters
2: for iteration=1,2,... do
3: Sample task T , corresponding to loss LT on weight vectors θ∗
4: Compute θ∗ = UTk (θ), denoting k steps of gradient descent
5: Update θ ← θ + (θ∗ − θ)
6: end for

Similar to MAML, the REPTILE algorithm takes an initial model with param-
eters θ and iterates through a set of learning tasks T doing k backpropagation
steps on each one of them to reduce the loss LT . Each backpropagation step finds
new model parameters θ∗ which are used to update the meta-learned parameters
θ by using the expression: θ = θ + (θ∗ − θ),  being the meta-learning rate.
Figure 3 depicts the dynamics of the algorithm in a simplified example with 4
tasks. Similar to Multitask learning (MTL) [1], the algorithm randomly iterates
through the training tasks and temporarily adapts the model parameters accord-
ing to the gradient of the loss of each tasks ∇Li . The temporary parameters are
then used to update the meta-learned parameters θ. The toy 2D representation
in Fig. 3 shows how the resulting parameters are located in the middle of the
parameters space, minimizing the distance to all the training tasks. Seen this
514 H. F. Satizábal and A. Perez-Uribe

Fig. 3. Diagram representing the REPTILE algorithm. The model is randomly initial-
ized with parameters θ0 . There are 4 tasks in the example T1 , T2 , T3 , T4 . For the sake of
simplicity, tasks are sampled in natural order (they must be sampled randomly). Black
dotted lines show the gradient for the 4 first meta-learning steps and red lines show the
path followed by model parameters after each meta-learning step. In this example the
parameters θ are obtained after going 3 times over the four tasks. (Color figure online)

way, REPTILE uses the training tasks to find a good initialization that makes
learning possible when only few examples of the new task are available.

3 Experiments

This section describes the experimental setup: the datasets we used and how the
different results were obtained.

3.1 Datasets

The Swiss company Picterra (https://picterra.ch/) provided a pool of datasets


whose names are shown in Table 1. The size, resolution and number of annota-
tions in the images are very diverse, generating a rich environment which should
enhance the effects of REPTILE. Each dataset has pre-defined training and vali-
dation areas making it possible to compute a measure of performance for certain
regions that were not available to the detectors during training. The datasets
shown in Table 1 were captured at different resolutions4 and contain objects of
different sizes (i.e., from birds to buildings). Some of them were collected and
annotated by real users and hence, they are not public. Moreover, the dataset
chocolate hearts 1 is not an aerial image and thus, it represents an outlier task,
which should be very different from the others.

4
There are sub-datasets with diverse images of the same type at different resolutions.
Learning Image Segmentation 515

3.2 Leave-One-Dataset-Out Validation

We used the leave-one-dataset-out test policy shown in Table 2 to evaluate the


effects of using REPTILE as pre-training step. In leave-one-dataset-out, every
dataset is used once as test task while the remaining datasets are used as training
tasks to run REPTILE. Bear in mind that every dataset has its own training
and validation partitions that allow an unbiased assessment of the performance.

Table 1. Real-world datasets provided by Picterra. Public ones are highlighted

wood 1 chocolate hearts 1 city 1 africa solar panels 1 agri 6


city 1 sa city 1 europe coconuts 1 agri 7 slum 1
agri 1 colonybirds roof objects 1 turtles 1 cows 1
cows 2 shipping containers festival 1 trees 2 sheep 1
graveyard 1 agri 2 agri 3 trees 1 solar farm 1
agri 4 agri 5 refugee camp 1 trains 1 roads 1

Table 2. Leave-one-dataset-out test schema

1 Select one dataset and put it apart for further training


2 Initialize the base detector m 0
3 Use the n − 1 remaining datasets to train detector m r using REPTILE
4 Fine-tune detector m r using the left-out dataset
5 Fine-tune detector m 0 using the left-out dataset
6 Repeat for all the available datasets

The validation strategy shown in Table 2 makes it possible to evaluate the


difference between fine-tuning a detector whose starting point is a set of raw
parameters taken from basic transfer learning (m 0), or fine-tune the same detec-
tor by starting from a pre-trained state computed with REPTILE (m r). The
experiments were run in a computer with 40 cores and 8 GPUs (4 GeForce RTX
2080 and 4 GeForce GTX 1080 Ti).

4 Results

This section shows the results of training detectors on the datasets described in
Sect. 3.1. The following subsections show two types of results: (1) with a fixed
number of annotations, to see how the performance evolves through the number
of epochs, and (2) with a fixed number of epochs, to see how the performance
evolves with an increasing number of annotations.
516 H. F. Satizábal and A. Perez-Uribe

4.1 Pre-training with REPTILE


This section shows how the loss evolves during the execution of REPTILE. The
plot in Fig. 4 was created by leaving one dataset out and then running REPTILE
with the remaining datasets.

Fig. 4. Evolution of training loss during the execution of REPTILE (100 steps) com-
puted leaving out one of the public datasets provided by Picterra

As it can be seen in Fig. 4, the loss exhibited by most of the training


datasets is very low and thus, it shows that the training algorithm is able to
find a common set of parameters satisfying all tasks. However, there are some
training datasets that do not reach low levels of loss in the 100 REPTILE
steps allowed. The datasets presenting the higher losses are: wood 1 dataset1,
chocolate hearts 1 dataset1 and roads 1 dataset1. The same general behavior
is observed if other datasets are left out. Notice that the loss of task choco-
late hearts 1 dataset1 is high, and as it was aforementioned, this dataset repre-
sents an outlier task which is supposed to be very different from the others.

4.2 Performance vs. Number of Epochs


The results in this section shows the performance of the algorithm at different
number of fine-tuning epochs. The performance of the detectors is assessed using
the Intersection over Union (IoU) metric, which is an evaluation metric used to
measure the accuracy of the segmentation. As it is shown in Fig. 5, this measure is
computed using the ground-truth segmentation and the predicted segmentation.
Hence, the IoU is the quotient between the area of overlap between the area
of union of those segmentations. Moreover, it is possible to compute the IoU
of the surface representing only target objects i.e., ignoring the accuracy of
Learning Image Segmentation 517

detecting the background. This measure, that we call Foreground IoU is used to
better evaluate the performance of the models given that there are much more
background pixels than pixels belonging to the objects.

Fig. 5. The metric used to assess the accuracy of object detectors is the Intersection
over Union (IoU). The green square represents the ground-truth and the red square
represents the prediction given by the detector (Color figure online)

Figure 6 shows the IoU and Foreground IoU of 3 of the public datasets pro-
vided by Picterra. First, the testing policy described in Sect. 3.2 is applied and
thus a REPTILE initialization is computed for every task in the pool of datasets.
Then, the left-out dataset is used to fine-tune the model obtained with REP-
TILE. For the sake of comparison, we also fine-tuned the base model m 0 (before
REPTILE). Continuous lines represent the results of fine-tuning a detector after
a REPTILE pre-training and dashed lines represent the performance of a model
fine-tuned from a base detector m 0 (transfer learning). Blue and orange curves
show the IoU computed on training and validation datasets, and green curves
show the Foreground IoU computed on the validation dataset.

Fig. 6. IoU of the detectors fine-tuned on 3 public tasks provided by Picterra. Contin-
uous lines were obtained from a pre-training with REPTILE whereas dashed lines were
obtained from transfer learning on the base detector m 0. Green curves were computed
using only foreground labels (Color figure online)

As it can be seen on Fig. 6, the final performance on different tasks can


vary a lot. There are detection tasks that are easier than others (the accuracy
of the detector fine-tuned on the task refugee camp 1 dataset 1 is higher than
the one fine-tuned on the coconuts 1 dataset 1. Moreover, Fig. 6 shows that, for
518 H. F. Satizábal and A. Perez-Uribe

every dataset, there is a difference between the Foreground IoU resulting from
detectors fine-tuned from a REPTILE pre-training (m r) and those that were
fine-tuned from a base detector (m 0). This difference is higher at the beginning
of fine-tuning and then become smaller as epochs increase. Figure 7 shows this
difference through the fine-tuning for all datasets available: REPTILE allows
higher performances at the beginning of the training process, and the difference
between the performances of both approaches gets lower as epochs increase.

4.3 Performance vs. Number of Annotations

In this experiment we used the colonybirds half birds dataset only. The image
in this dataset is a real high-resolution picture (−1 cm resolution) of a group of
birds on a beach. The picture was taken by a flying drone and, to ensure that
the elements in the picture do not move, the “birds” were plastic toys carefully
placed by hand [17] on the sand. Figure 8 shows a modified version of the dataset
in which we provided 59 new training annotations5 .

Fig. 7. Difference in performance between a model fine-tuned from a REPTILE initial-


ization (m r) and a model fine-tuned with basic transfer learning (m 0). Green means
a higher performance using REPTILE (Color figure online)

5
Annotation may be very time consuming, and that is the reason why we only tested
one dataset.
Learning Image Segmentation 519

Fig. 8. Annotations done on the colonybirds half birds dataset. Green annotations were
used for training and blue annotations were used for validation (Color figure online)

Following the algorithm described in Table 2, we had two detectors to fine-


tune: a pre-trained detector m r in which we used REPTILE on all the available
datasets excepting colonybirds half birds, and a base model m 0 without pre-
training. We used the 59 training labels to fine-tune during 40 epochs both m r
and m 0 models using n annotations: n ∈ [1, 59]. We repeated the process 10
times by randomly sampling n labels. The training area is updated every time
labels are sampled and it is computed by offsetting the n sampled labels until
the resulting polygon intersects another label. The validation area is computed
beforehand in the same manner. After fine-tuning both models, they were tested
on the validation subset computing the IoUa and F1score. The performance of
the model after fine-tuning with each amount of labels is shown in Fig. 9.

Fig. 9. Performance metrics and the amount of training examples a) with REPTILE
pre-training (m r) and b) with basic transfer learning (m 0)

As it is shown in Fig. 9, model m r (REPTILE) showed a higher performance


even with a small number of labels. On the contrary, model m 0 needs much
more labels to reach a performance plateau. Indeed, the maximum performance
reached by m 0 is lower than the performance of m r with only 1 label.
520 H. F. Satizábal and A. Perez-Uribe

4.4 Visual Validation


This section shows the detections of models fine-tuned with one annotation.
Figure 10a shows the single label and the training area containing the positive
(label) and negative pixels we chose for fine-tuning the detectors.

Fig. 10. Visualisation of the results obtained with the colonybirds half birds dataset.
(a) Annotation and training area. (b) Detection using m 0: basic transfer learning. (c)
Detection using m r: transfer learning with REPTILE pre-training

After a fine-tuning of 40 epochs we visualized the predictions of both models


(i.e., m r and m 0). On the one hand, Fig. 10b shows the resulting detections
obtained with a detector without pre-training. As it can be seen in this figure,
after 40 epochs the detector is unable to detect the birds in the image, even the
one that was labeled in the training subset is not detected as a bird. On the other
hand, Fig. 10c shows the detections of a model fine-tuned from a pre-trained state
(i.e., obtained after applying REPTILE). As it can be seen in Fig. 10c, after 40
epochs the detector successfully finds all the birds in the image.

5 Conclusions
This article shows an application of a meta-learning method called REPTILE to
solve few-shot learning object segmentation tasks. The learning task consists in
training a semantic segmentation algorithm to detect objects from aerial images
(e.g., satellite or drone). We used an architecture based on a U-Net which allows
the definition of free segmentation masks (no bounding boxes) directly from the
output of the model, and thus, allowing to fine-tune the whole process using
gradient descent (which is a pre-requisite for REPTILE).
REPTILE was tested in 30 real-world segmentation tasks and it showed a
gain in segmentation performance in early stages of training (Fig. 7). Therefore,
we found that REPTILE do produce fast-improving detectors but does not nec-
essarily improve the final performance of the detectors. This property can be
useful to build a system in which it is important to reach good performances
in just a few training steps (e.g., to reduce energy consumption in embedded
systems) or in an active learning framework in which a user can put labels
incrementally while testing whether the model improves or not after only a few
Learning Image Segmentation 521

learning epochs i.e., there is no need for waiting the whole learning process given
that REPTILE “foresees” the final performance.
Moreover, we tested REPTILE in very poor labeling conditions. For one of
the available datasets, we evaluated the segmentation performance of models
trained with an increasing number of labels. For this particular dataset, REP-
TILE (m r) performs better than the base model (m 0) for lower and higher
amounts of labels. Indeed, a visual inspection of the resulting segmentation of a
REPTILE pre-trained model using only one label showed satisfactory results.
All these results suggest that REPTILE makes training faster, not only
because the resulting performance in early stages of training is higher, but also
because it makes possible to obtain good results with very few annotations.

Acknowledgments. This work was supported by the Swiss Space Center (SERI/SSO
MdP program). All the experiments shown in this paper were performed thanks to a
thight collaboration with the company Picterra (https://picterra.ch/). Picterra pro-
vided all the datasets used, and participated in constructive discussion about how to
deal with large images and how to build object detectors from few examples.

References
1. Caruana, R.: Multitask learning. In: Thrun, S., Pratt, L. (eds.) Learning to Learn,
pp. 95–133. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5529-2 5
2. Darlow, L.N., Crowley, E.J., Antoniou, A., Storkey, A.J.: CINIC-10 is not ImageNet
or CIFAR-10 (2018)
3. Paszke, A., et al.: ENet: a deep neural network architecture for real-time semantic
segmentation. CoRR, abs/1606.02147 (2016)
4. Pang, J., et al.: Libra R-CNN: towards balanced learning for object detection.
CoRR, abs/1904.02701 (2019)
5. Redmon, J., et al.: You only look once: unified, real-time object detection. CoRR,
abs/1506.02640 (2015)
6. He, K., et al.: Spatial pyramid pooling in deep convolutional networks for visual
recognition. CoRR, abs/1406.4729 (2014)
7. He, K., et al.: Mask R-CNN. CoRR, abs/1703.06870 (2017)
8. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int.
J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-
015-0816-y
9. Girshick, R.B., et al.: Rich feature hierarchies for accurate object detection and
semantic segmentation. CoRR, abs/1311.2524 (2013)
10. Ren, S., et al.: Faster R-CNN: towards real-time object detection with region pro-
posal networks. CoRR, abs/1506.01497 (2015)
11. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D.,
Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp.
740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48
12. Liu, W., et al.: SSD: single shot multibox detector. CoRR, abs/1512.02325 (2015)
13. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested
U-Net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.)
DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018).
https://doi.org/10.1007/978-3-030-00889-5 1
522 H. F. Satizábal and A. Perez-Uribe

14. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation
of deep networks (2017)
15. Girshick, R.: Fast R-CNN (2015)
16. Kamrul Hasan, S.M., Linte, C.A.: U-NetPlus: a modified encoder-decoder u-net
architecture for semantic and instance segmentation of surgical instrument. CoRR,
abs/1902.08994 (2019)
17. Hodgson, J., et al.: Drones count wildlife more accurately and precisely than
humans. Methods Ecol. Evol. 9, 1160–1167 (2018)
18. Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (Canadian Institute for Advanced
Research)
19. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms.
CoRR, abs/1803.02999 (2018)
20. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger (2016)
21. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018)
22. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. CoRR, abs/1505.04597 (2015)
23. Vanschoren, J.: Meta-learning: a survey (2018)
24. Wang, Y., Yao, Q., Kwok, J., Ni, L.M.: Generalizing from a few examples: a survey
on few-shot learning (2019)
25. Zeng, Z., Xie, W., Zhang, Y., Lu, Y.: RIC-Unet: an improved neural network based
on Unet for nuclei segmentation in histology images. IEEE Access 7, 21420–21428
(2019)
Artificial Intelligence and Biomedicine
Impacted Tooth Detection in Panoramic
Radiographs

James Faure and Andries Engelbrecht(B)

University of Stellenbosch, Cape Town, South Africa


[email protected], [email protected]

Abstract. This paper proposes an approach to analyse panoramic


radiographs in order to automate diagnosis of impacted teeth. The
panoramic radiographs go through an intensive labelling process which
demarcates impacted teeth using rectangular bounding boxes. A convo-
lutional neural network is trained on these labelled images to predict
different types of impacted teeth. The empirical results illustrate good
performance with respect to impacted teeth prediction.

Keywords: Panoramic radiographs · Automated diagnosis ·


Convolutional neural networks · Deep learning · ResNet

1 Introduction
Automated analysis of radiographs has become increasingly important in devel-
oping countries such as South Africa. Public health centres face large numbers of
patients to diagnose and treat, and are mostly understaffed, even more so with
reference to specialists such as radiologists. Shortage of radiologists is more evi-
dent in radiology specialization such as orthodontic radiology. The consequences
of the above problems are that many patients are not properly diagnosed and
treated for orthodontic defects and illness.
In order to address the above problems, a project has been launched at
Stellenbosch University, in collaboration with a public health care provider, to
develop an automated process to analyse dental panoramic radiographs. The
ultimate goal is to develop a system to detect defects and illness. A first step
in the designed pipeline is to predict if a received radiograph is a workable
panoramic radiograph, that is to predict if the radiograph is of sufficient quality
for automated analysis. Faure and Engelbrecht presented this convolutional neu-
ral network (CNN) predictive model in [3]. It was shown that the CNN predictive
model accurately identified workable from non-workable panoramic radiographs.
This paper presents the first anomaly prediction model that forms part of this
pipeline. Using the workable radiographs, a CNN predictive model is developed
to predict impacted teeth. An annotation process is first applied to demarcate
and label impacted teeth. Results show that the impacted teeth prediction model

c Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 525–536, 2021.
https://doi.org/10.1007/978-3-030-85030-2_43
526 J. Faure and A. Engelbrecht

is very accurate and stable in its predictions. The paper therefore makes a con-
tribution by proposing, to the knowledge of the authors, the first application of
CNNs to teeth anomaly detection.
The rest of the paper is organized as follows: Sect. 2 discusses the related
work, while Sect. 3 discusses the chosen CNN model library, detectron2. Section 4
then discusses the training of the CNN, while Sect. 5 discusses the empirical
results following the training process.

2 Related Work

The impacted teeth detection model makes use of CNN to identify features.
The process of identifying these features is called object detection. Section 2.1
discusses the process of annotating images by identifying objects of interest.
Types of CNN architectures are discussed in Sect. 2.2. Finally, Sect. 2.3 discusses
two previous applications of CNNs on panoramic radiographs.

2.1 Annotations

A human labeller uses a segmentation algorithm to identify a foreground region


in the image [2]. The region of interest is demarcated with a boundary in the
form of hand-drawn lines. The algorithm then assigns pixels to objects based
on low-level priorities. Various annotating techniques can be used for labelling
data. The best option depends on the nature of what needs to be labelled on the
image. A bounding box is a rectangular box that is used to define the location
of the targeted object. A deep learning model then learns the pixelation for
each annotation so that predictions can be made on novel data. The computer
vision annotation tool (CVAT) was used to annotate the impacted teeth using
bounding boxes.

2.2 Types of CNN Architectures

CNNs are specialized neural networks that are used for processing data in a
grid-like typology [4]. The data can either be in an one dimensional grid, such
as time series data, or a two dimensional grid, such as the pixels of an image.
CNNs are neural networks that use convolution in at least one of the CNN layers
[4]. There are many different architectures of CNNs. This section covers some
of the best CNN architectures that have been developed upon the ImageNet1
dataset [1].
The LeNet-5 [1] architecture was one of the earliest developments of CNNs
in 1998. LeCun et al. used a CNN to recognize digits in a document [7]. At
this point, the ReLU activation function had not yet been discovered, so the
model used the sigmoid function instead. The model consisted of alternating
two convolutional layers and two pooling layers, and then three fully connected
1
http://www.image-net.org/.
Impacted Tooth Detection in Panoramic Radiographs 527

layers at the end. Modern architectures are based on the LeNet-5 architecture.
Enhancements of LeNet-5 were with reference to the depth of the network, the
use of ReLU activation functions, and the use of modern hardware to accelerate
training efficiency [1]. For a more detailed discussion on LeNet-5, refer to [7].
AlexNet [1] was the winner of the 2012 ImageNet large scale visual recognition
competition (ILSVRC) and was the first significant innovation of CNNs. The
model used two parallel pipelines for processing. The images started with a
dimension of 224 × 224 × 3, 96 filters of 11 × 11 × 3 and a stride of four is used
in the first layer [1]. The first four layers alternate between two convolutional
layers with ReLU activation functions and two max-pooling layers. All the max-
pooling layers are 3 × 3 filters with a stride of two [1]. There is no pooling
between the third, fourth and fifth convolutional layers. Finally, there are three
fully connected layers at the end, and a 1000-way softmax function performs the
classification [1].
GoogLeNet [1] was the winner of the 2014 ILSVRC and is referred to as
the inception architecture. The architecture uses three different filter sizes in
parallel, called inception modules, to identify different levels of detail within the
images. A larger filter covers a more significant area, hence limiting variation. On
the other hand, smaller filters detect more detail in closer proximity. The neural
network decides which filter will influence the output the most. The architecture
has a depth of 22 layers and uses average pooling.
VGGNet [1] was runner up in the 2014 ILSVRC, but is still a critical CNN to
discuss. The innovation was that the filter sizes were reduced and the total depth
of the network increased. The architecture has anything between 11 to 19 layers
and induces optimal performance from 16 layers onwards [1]. The smaller filter
sizes allow the layer to capture more detail, and the deeper network enhances
non-linearity due to more ReLU layers. A 2 × 2 max-pooling layer follows each
set of convolutional layers with a stride of two. The pooling never overlaps.
Therefore, the spatial footprint is always reduced by a factor of two.
The winner of the ILSVRC in 2015 was the ResNet architecture [1]. The
ResNet architecture had 152 layers, which required important innovations with
references to the model’s structure. Training the ResNet model became an issue
due to a large number of operations in deep layers causing the gradient flow of
feature maps to be impeded [1]. Therefore, the innovation needed was skip con-
nections. A skip connection allows for information to be copied between layers.
The gradients are propagated backwards using the skip connections. The spatial
size of the outputs does not change between layers due to the use of a padded
filter with a stride of one. The innovations around the ResNet deep network
architecture allowed for the first model to perform at human level classification
[1].
Table 1 compares the model architectures that have been discussed through-
out the section. The top-5 error refers to the percentage of testing images in
the top 5 predicted classes that were not predicted correctly. CNNs have gone
through significant innovations over the past decade with the most notable
improvements happening between 2012 and 2015. The 2015 ResNet architecture
528 J. Faure and A. Engelbrecht

was already being labelled as better than human-level performance [1]. The most
significant trend is that the model performance improved with an increased num-
ber of layers. Furthermore, as with the ResNet architecture, more layers require
important structural innovations.

Table 1. A comparison between the best performing CNN model architectures at the
ILSVRC [1].

Name Year Number of layers Top-5 error


LeNet-5 1998 <5 >25%
AlexNet 2012 8 15.4%
VGG 2014 19 7.3%
GoogLeNet 2014 22 6.7%
ResNet 2015 152 3.6%

Based on the evidence discussed, it was decided to use a ResNet architecture


for detection of impacted teeth.

2.3 Deep Learning in Dentistry

This section serves to analyse and discuss two previous approaches to using deep
learning and computer vision in the dentistry field.
Hiraiwa et al. conducted research to use a CNN for assessment of root mor-
phology of the mandibular first molar on panoramic radiography [5]. The objec-
tive of the study was to detect the root morphology of mandibular first molars
using bounding box annotations. The deep learning algorithm was expected to
identify whether distal roots were single or had extra roots. The dataset con-
sidered 760 mandibular first molars which revealed that 597 teeth (78.6%) had
single roots and 163 teeth (21.4%) had extra roots [5]. The experiment made
use of a five-fold cross-validation procedure. Data augmentation, i.e. alterations
in the rotation, brightness, contrast, and sharpness of the images, were used to
increase the size of each training set. This resulted in 11 472 training image anno-
tations from the single root group, and 11 004 annotations from the extra root
group [5]. Two models were constructed using AlexNet and GoogleNet architec-
tures. AlexNet achieved an accuracy of 87.4% and took 51 min to train, whereas
GoogLeNet achieved an accuracy of 85.3% and took three hours to train. These
results provided evidence that it is possible to detect the number of distal roots of
mandibular first molars on panoramic radiograms. Furthermore, AlexNet proved
to use less computation power.
A more recent study looks at the performance of a CNN in the detection and
diagnosis of maxillary sinus lesions on panoramic radiographs [6]. The objective
was to determine whether or not lesions existed in a person’s sinuses. The dataset
was made up of 1174 images over three classes. Class 0, healthy maxillary sinuses,
Impacted Tooth Detection in Panoramic Radiographs 529

has 578 images. Class 1, inflamed maxillary sinuses, has 416 images. Class 2, cysts
of maxillary sinus regions, has 171 images. The learning process used a model
called DetectNet. The model could detect healthy and inflamed maxillary sinuses
in 100% of the images and cysts in the maxillary sinus regions in 89–98% of the
images [6]. One issue with this study was that the model took 15 h to train. The
authors discuss the possible use of GoogLeNet to shorten the learning time and
to improve the performance [6].

3 Detectron2 ResNet101 Model


Detectron2 [9] is a computer vision model library that utilizes a CNN. Detec-
tron2 has different models that are used for specific computer vision applications.
Within the object detection space, there are three main model categories that
house a number of different models. The model categories in the object detec-
tion library are the Faster R-CNN (region-based convolutional neural networks),
RetinaNet, and RPN (region proposal network) and Fast R-CNN.
The detectron2 models use a ResNet backbone structure. Each model is given
a specific naming convention which can be viewed on the detectron2 GitHub
model repository2 . For the purposes of this article, the X101 model is used,
which is a ResNet101 model trained by Facebook.
The creators of the detectron2 model tested each model on a universally
available dataset. The COCO dataset3 is a larger-scale object detection dataset
that has more than 330 000 labelled images, across 80 object categories, with
more than 1.5 million instances.
The evaluation metrics of the detectron2 models are divided into two groups
of measures. The first is the training evaluation which takes into consideration
many different metrics that assess the models’ training performance. The two
most important are total loss and classification accuracy. The total loss is cal-
culated on the training and validation datasets. The loss is not a percentage
but rather a summation of the errors made for each example in the training or
validation sets. The total loss must converge as close to zero as possible. The
classification accuracy is the percentage of correct predictions on the validation
set during training. It is ideal to have this value as close to 100% as possible.
The second set of evaluation metrics are based on the predictions made on
the validation set. In predictive analytics, these metrics are often conveyed in a
confusion matrix. For object detection, the most common measures of accuracy
are called the intersection of union (IOU) and average precision (AP). The IOU
evaluates the overlap between two bounding boxes [8]. The IOU is the area of
the predicted bounding box with the annotated bounding box, divided by the
area of union between them [8]. IOU is
Bb ∩ Ba
IoU = ,
Bb ∪ Ba
2
https://github.com/facebookresearch/detectron2/blob/master/MODEL ZOO.md.
3
The dataset is publicly available and is aimed at furthering research for object detec-
tion (https://cocodataset.org/).
530 J. Faure and A. Engelbrecht

where Bb is the model’s prediction of the bounding box and Ba is the initially
annotated bounding box. Figure 1 illustrates the formula. The closer the value
is to one, the better the accuracy of the predictions. Generally, if the IOU is
greater than 0.5 (50%), the predicted bounding box represents a true positive.

Fig. 1. The intersection of union, where the green box is Bb and the red box is Ba [8].
(Color figure online)

Important evaluation metrics are the average precision (AP) and mean aver-
age precision (mAP). The AP is the precision averaged across all recall values
in the range [0,1] [8]. Recall is the model’s ability to find all relevant cases [8],
i.e. the percentage of true positives detected among all possible predictions from
the annotations. The mAP is then obtained from the AP. The mAP is the mean
of the AP over all classes [8]. In the model evaluation of the impaction detec-
tion model, the AP and the AP50 are critically analysed. The AP is important
because it can be benchmarked against the selected model performance on the
COCO dataset. If the model has an AP which is greater than the benchmark,
the generalization performance is acceptable. The AP50 is the average precision
at 0.5 (50%) IOU threshold.

4 Training Detectron2 ResNet101 for Impactions


This section describes the process of taking an unstructured radiograph, anno-
tating the impacted teeth, applying data pre-processing and augmentation, and
training a detectron2 model.

4.1 Labelling for Impaction Detection


An impacted tooth falls under the category of anomalies related to the changes
in the position of teeth. The third molar, more commonly known as a wisdom
tooth, is the tooth that is effected by an impaction. An impacted tooth is one
where the tooth is inverted to a certain degree or angle.
The panoramic radiograph dataset consisted of 530 images. Each image has a
dimension of 2440×1280 pixels. Within this dataset, 178 images had one or more
impacted teeth present. In total, 450 impactions were labelled in the dataset.
Figure 2 illustrates a subset of images in the training set that have impaction
labels.
Impacted Tooth Detection in Panoramic Radiographs 531

Fig. 2. A sample of the training images used for impaction detection.

4.2 Configuring the Impacted Tooth Model for Training

The performance of the model is influenced by three factors, namely the dataset
preprocessing and augmentation, the training parameters, and the selected detec-
tron2 model. In order to find the optimal training parameters, three tests were
carried out using a different parameter configuration each time. The best param-
eter configuration was decided by monitoring where the total loss converges,
computational expense is minimized and prediction accuracy is maximized. The
training set is used to train the model and is evaluated using total loss and
training classification accuracy. The validation set is used to validate the model
and is evaluated using AP and AP50 . The testing set is used to make predictions
and to evaluate generalisation performance.
The first two tests used images that had equal dimensions of 416×416, which
ensured a fast training speed. Although the smaller image dimensions reduced
the training time, the predictions were visually difficult to interpret due to a
rectangular image being compressed into a square. For the third test, images
were all resized to 610 × 320 pixels, which is essentially exactly three times
smaller than the original dimensions of the images. The new dimensions ensured
that the model would train quickly without losing as much data due to smaller
dimensions. Decreasing the image dimensions lowers the computational expense.
Two data augmentation techniques were implemented to increase the size of the
dataset, namely adjusting the brightness and cropping the size of the images.
These two augmentations were decided upon because they are most likely to
occur in a real-world situation, where radiographs may appear different in a
different brightness or size. The dataset brightness was augmented to make the
images appear both 15% brighter and darker. By adjusting the brightness, the
model will be more resilient to changes in the lighting. The images were cropped
by 20% to help the model be more resilient to image translations and adjustments
in camera position. The augmentation process increased the dataset by a factor
of three so that the training set contained 1110 images, with 963 annotations.
532 J. Faure and A. Engelbrecht

The model for the first test was trained for 1500 iterations. The training
loss began to converge around 500 iterations, so anything more than that is a
waste of computational power and time, and would increase the risk of the model
overfitting. Early stopping was used on all further testing. Each test was carried
out with 500 iterations, which meant that total training time was reduced. As
mentioned above, the images for the first two tests had equal dimensions and
therefore did not need padding. Padding is used in the third test due to images
not having the same dimensions. Padding adds extra pixels around the image
borders to ensure that all images enter the CNN with the same dimensions.
Training a CNN can be computationally expensive due to the graphic pro-
cessing units (GPU) required for training. Google colab has a free package that
allows 12 GB of GPU and up to 12 h of run time. Google colab’s free package
meant that the computational cost would not be an issue. Therefore, the best
model from the Faster R-CNN model library, i.e. X101-FPN is used for train-
ing. The detectron2 GitHub repository4 provides the training and evaluation
statistics of the X101-FPN model on the COCO dataset. The model has a learn-
ing rate of three times that of the usual schedule, which increases the speed of
training. The training time is expected to be 0.638 s per iteration. The inference
time is around 0.098 s per image. The model is expected to consume 6.7 GB of
memory in GPU. Finally, the benchmark AP is 43%. An AP above this for the
any object detection model will be satisfactory. Due to Google colab’s ability to
run the best detectron2 model, the same model was run throughout all three
tests.
The third test resulted in satisfactory training parameter configuration and
validation accuracy on the testing set. All further tests were conducted using
the same training configuration. After configuration, the training set consisted
of 1110 images, the validation set had 109 images, and the testing set had 53
images.

5 Empirical Results

Section 5 discusses the empirical results of the impaction detection model. The
experiment is conducted for 10 independent tests.

5.1 Evaluation of the Impaction Detection Model

The training time of the impaction detection model was 20 min. The total loss
converges around 0.35, as seen in Fig. 3. The classification accuracy on the train-
ing set, as seen in Fig. 4 converges to 97.40%. Both training metrics provide
evidence that the model is well trained. As a performance benchmark, by refer-
ring to the detectron2 GitHub repository, the X101-FPN model has an AP of
43.0% on the COCO dataset. The AP for the impaction detection dataset is
50.58%. The higher AP of the impaction detection model provides evidence of

4
https://github.com/facebookresearch/detectron2/blob/master/MODEL ZOO.md.
Impacted Tooth Detection in Panoramic Radiographs 533

a model that predicts well. In the context of this project, being there only one
class, the AP and mAP are the same value. The AP50 is the average precision
at 0.5 (50%) IOU threshold, which correlates to a true positive prediction, as
discussed in Sect. 3. The AP50 for the impaction detection model was 87.779%,
which provides evidence that the model’s precise ability to find all relevant cases.

Fig. 3. The training loss of the impaction detection model.

Fig. 4. A graphical representation of the training evaluation metrics.

5.2 Analysis of the Results of the Impaction Detection Model

To provide further evidence of a stable model, additional analysis was be done.


Firstly, the test dataset is used to perform predictions on unlabelled images. The
trained model demarcates a predicted bounding box around an impacted tooth,
and provides a percentage of certainty, or confidence, of the prediction. A high
534 J. Faure and A. Engelbrecht

certainty percentage reflects accurate predictions. Secondly, 10 independent runs


were executed with each experiment conducted on a new set of initial weights
and different image combinations in the training, validation and testing sets. The
mean and standard deviation for each evaluation metric is determined over the 10
runs. The mean of each evaluation metric reflects a better estimate of a model’s
performance. The standard deviation of the performance over the independent
runs gives information on stability. The smaller the standard deviation, the more
stable the model.
Figure 5a shows a subset of correctly labelled impactions that all achieve a
certainty score above 96%. Figure 5b shows an image in the dataset which had a
tooth which was not impacted, incorrectly labelled as impacted. Here it can be
seen that the model’s certainty was considerably lower than the correctly labelled
impactions. In order to avoid incorrect predictions, a certainty score limit can
be set to 90%. Any prediction below this certainty limit will not be displayed
on the final images, or can be flagged for the second opinion of a professional
radiologist or dentist.

(a) A sample of correctly predicted impactions.

(b) A sample of incorrectly predicted impactions.

Fig. 5. A set of images from the testing set.

Table 2 displays the performance from each of the 10 independent runs, as


well as the mean and standard deviation for each evaluation metric. By consider-
ing each metrics, the following can be concluded: Over the 10 independent runs,
the mean total loss was 0.324, and the standard deviation was 0.013. These met-
rics provide evidence that the total loss was stable and successful in converging
Impacted Tooth Detection in Panoramic Radiographs 535

close to zero. The classification accuracy had a mean of 97.2% and a standard
deviation of 0.002 over the 10 tests. The classification accuracy was stable and
close to 100%. Both the total loss and classification accuracy provide evidence
that the model is well trained. The mean average precision was 51.724% over the
10 tests. The result was better than the COCO dataset benchmark (43%). The
standard deviation was 1.471, which provides evidence of an stable model. The
average AP50 is the most important metric to consider when evaluating whether
the model can accurately detect impacted teeth that are present on the radio-
graph. The mean AP50 was 88.924% and the standard standard deviation was
1.536. These metrics provide evidence that the model has good generalization
performance and stability over the 10 independent runs.

Table 2. 10 anomaly detection tests for the mean and standard deviation evaluation
metrics

Test Total loss Classification AP AP50


accuracy
Test 1 0.343 0.974 50.902 87.779
Test 2 0.348 0.973 53.852 90.454
Test 3 0.310 0.971 53.863 90.319
Test 4 0.317 0.971 52.916 90.405
Test 5 0.324 0.972 50.949 88.761
Test 6 0.326 0.970 52.280 91.097
Test 7 0.324 0.973 49.518 88.049
Test 8 0.317 0.971 51.31 86.649
Test 9 0.327 0.971 50.229 87.36
Test 10 0.3075 0.969 51.423 88.371
Mean 0.324 0.972 51.724 88.924
Standard 0.013 0.002 1.471 1.536
deviation

Overall the results show that the model was successful in accurately identify-
ing impacted teeth. The model trained well, had good precision of the validation
set and ultimately was able to detect impacted teeth at a high certainty.

6 Conclusion
The orthodontic area is a complex organ and correct diagnosis of anomalies can
be very time consuming and potentially inaccurate. Many different aspects, such
as impacted teeth, have to be considered and analysed in dental diagnosis. In
South Africa, dental screening centres are overloaded with patients and under-
staffed. Consequently, many patients leave the screening centres undiagnosed.
536 J. Faure and A. Engelbrecht

The work discussed in this paper took a batch of clear panoramic radiograph
which were labelled with impacted teeth. The labelled dataset was then used
to train a detectron2 model to successfully detect impacted teeth on unlabelled
panoramic radiographs. The results from the impaction detection model pro-
vide evidence of a model which is able to detect with great accuracy and high
certainty. The impaction detection (50.902%) X101-FPN model had an average
precision that was higher than the benchmark of 43.0%, found on the COCO
dataset.
The object detection models showed how it is possible to have orthodontic
anomalies diagnosed by a deep learning predictive models. Further research is
to be conducted into building numerous object detection models for a range of
different anomalies, diseases and infections. The models will be constructed into
a automated pipeline that will be used on a vast number of radiographs that
enter the databases on these screening centres.

References
1. Aggarwal, C.: Neural Networks and Deep Learning. Springer, Cham (2018). https://
doi.org/10.1007/978-3-319-94463-0
2. Dutt Jain, S., Grauman, K.: Predicting sufficient annotation strength for interactive
foreground segmentation. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 1–8 (2013)
3. Faure, J., Engelbrecht, A.: A convolution neural network used for dental panoramic
radiograph classification. In: Proceedings of 5th International Conference on Intel-
ligent Systems, Metaheuristics & Swarm Intelligence (2021)
4. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016)
5. Hiraiwa, T., et al.: A deep-learning artificial intelligence system for assessment of
root morphology of the mandibular first molar on panoramic radiography. Den-
tomaxillofac. Radiol. 48(3) (2019). https://doi.org/10.1259/dmfr.20180218
6. Kuwana, R., et al.: Performance of deep learning object detection technology in
the detection and diagnosis of maxillary sinus lesions on panoramic radiographs.
Dentomaxillofac. Radiol. 50(1) (2021)
7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
8. Padilla, R., Netto, S., da Silva, E.: A survey on performance metrics for object-
detection algorithms. In: Proceedings of 2020 International Conference on Systems,
Signals and Image Processing (2020). https://doi.org/10.1109/IWSSIP48289.2020
9. Wu, Y., Kirillov, A., Massa, F., Lo, W., Girshick, R.: Detectron2 (2019). https://
github.com/facebookresearch/detectron2
Deep Learning for Diabetic Retinopathy
Prediction

Ciro Rodriguez-Leon1,2 , William Arevalo3 , Oresti Banos1 ,


and Claudia Villalonga1(B)
1
Research Center for Information and Communication Technologies, University
of Granada, C/ Periodista Rafael Gómez, No2, 18071 Granada, Spain
[email protected], {oresti,cvillalonga}@ugr.es
2
Department of Computer Science, University of Cienfuegos, Km 4,
Cuatro Caminos, Cienfuegos, Cuba
[email protected]
3
School of Engineering and Technology, Universidad Internacional de La Rioja,
Av. de la Paz, 137, 26006 Logroño, La Rioja, Spain
[email protected]

Abstract. Diabetic retinopathy is a complication of diabetes mellitus.


Its early diagnosis can prevent its progression and avoid the development
of other major complications such as blindness. Deep learning and trans-
fer learning appear in this context as powerful tools to aid in diagnosing
this condition. The present work proposes to experiment with different
models of pre-trained convolutional neural networks to determine which
one fits best the problem of predicting diabetic retinopathy. The Diabetic
Retinopathy Detection dataset supported by the EyePACS competition
is used for evaluation. Seven pre-trained CNN models implemented in
the Keras library developed in Python and, in this case, executed in the
Kaggle platform, are used. Results show that no architecture performs
better in all evaluation metrics. From a balanced behaviour perspective,
the MobileNetV2 model stands out, with execution times almost half
that of the slowest CNNs and without falling into overfitting with 20
learning epochs. InceptionResNetV2 stands out from the perspective of
best performance, with a Kappa coefficient of 0.7588.

Keywords: Diabetic retinopathy · Deep learning · Transfer learning

1 Introduction
Diabetes mellitus (DM) is a group of metabolic disorders characterized by a high
blood glucose level over a long time. People with DM are likely to have other
major health problems. DM patients with poor control of the disease have high
chances to develop complications. Especially chronic hyperglycemia can cause
vascular damage and lead to developing diabetic retinopathy (DR) [21]. This
complication is the ocular manifestation of DM and can cause microaneurysms,
hemorrhages, exudates, venous changes, neovascularization, retinal thickening,
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 537–546, 2021.
https://doi.org/10.1007/978-3-030-85030-2_44
538 C. Rodriguez-Leon et al.

and blindness depending on the stage of the disease. DR is the center of attention
of many researchers due to its incidence. For example, in the United States 40%
of diabetic patients suffer some stage of DR and it is the main cause of blindness
among people of working age [17].
There are four possible stages of progress in which the DR can be diag-
nosed in four possible stages ranging from mild to severe: 1) mild non-
proliferative retinopathy –microaneurysms are formed–; 2) moderate non-
proliferative retinopathy –some blood vessels that nourish the retina are blocked–,
3) severe non-proliferative retinopathy –multiple hemorrhages in different places,
venous beading, or intraretinal microvascular abnormalities–, and 4) proliferative
retinopathy –formation of abnormal new vessels and/or vitreous hemorrhages–
[7]. To minimize the complications of DR, healthcare professionals are looking for
improvements through different strategies. One way of progress is the use of arti-
ficial intelligence techniques to support the decisions making on the diagnose of
the disease. There are already multiple ways to address the problem of diagnos-
ing DR or identify the stages of progress, artificial neural networks (ANN) are a
good example, performing very well in these tasks [6]. Results in competitions in
Kaggle platform range between 0.70 and 0.80 Kappa coefficient in 2015 [1]. The
progress achieved in this field recently allows the rapid execution and deployment
of complex algorithms such as convolutional neural networks (CNN), leaders in
the detection of patterns in images [14].
With the development of the ANN and deep learning (DL), transfer learning
has emerged. The general idea of this type of learning is to build on a pre-trained
model generated using an extensive dataset from the same domain. The main
advantage of transfer learning is that the model has not to be initialized from
scratch, which saves time and provides robustness. For example, CNN’s are pre-
trained with reference datasets as large and representative as possible [16]. In
this paper, we propose to experiment with different pre-trained CNN models
to determine the most suitable according to computational cost and accuracy
in the prediction of DR in healthy or any of its four stages of progression. To
choose the best model, in addition to performance metrics such as accuracy,
we measure its efficiency in computational resource consumption at replication
and implementation. Therefore, processing times and computational resources
required are also considered as evaluation metrics.

2 Related Work
Before the increasing popularity of DL techniques, researchers tried to solve the
problem of detecting DR from images using traditional machine learning meth-
ods such as decision trees, support vector machine (SVM), or k-nearest neighbor.
These classical techniques did not achieve the desired performance and were com-
putationally expensive [25]. Over the years, the development and improvement
of different algorithms enhanced the performance of DR detection. In 2012, a
review of algorithms ranging from SVM, Bayesian optimizations, and rule-based
systems to multilayer perceptrons and deep neural networks, showed the advan-
tage of ANN over other algorithms [8]. In the last years, CNN has become one of
Deep Learning for Diabetic Retinopathy Prediction 539

the main techniques used in tasks like image recognition, natural language pro-
cessing, and time series analysis [26]. One of the first CNN architectures used
in these domains was AlexNet in 2012. Later, many more were introduced like
VGG16, VGG19, ResNet, GoogLeNet, DenseNet, and Xception [16].
As a result of the above, in recent years the most common approach to DR
recognition has been the DL techniques. The scientific community has focused
the attention on developing and applying such techniques because it has proven
to be a cost-effective tool for DR screening [18]. Some works have chosen to take
only one of these architectures and try to get the best performance. In this case
the most common CNN architecture used is Inception-V3 [15,19]; Inception-v4 is
also used by some studies [10,20]; other studies used GoogLeNet, VGGNet-19,
VGG16, VGGNet-16, DenseNet, and AlexNet [16]. In contrast, other authors
experiment with a set of CNN and compare its results, for example AlexNet
and VGG [5]; InceptionV3, ResNet18, ResNet101, VGG19, and Inception@4 (an
authors’ version of InceptionV3) [9].

3 Materials and Methods


3.1 Dataset

The dataset used for the experimentation is the Diabetic Retinopathy Detection
by EyePACS with data augmentation [1]. It is the result of a preprocessing of the
images, taken with different types and models of cameras, through a filter that
allows highlighting the main features, such as microaneurysms and hemorrhages.
The dataset has five classes, one representing healthy patients and the remaining
four represent the degrees of retinopathy (see Fig. 1). It also has a preprocessed
image magnification and therefore has a high number of observations, namely
88712 with an approximate resolution of 400 × 400 pixels. It is published by the
medical institution EyePACS and preprocessed by TensorFlow.

3.2 Computing Platform Specs

For this study, we choose a set of CNN architectures to compare its performance
and results and then get more substantial conclusions. The models used are
VGG16 and its variant VGG19 [22], ResNet50V2 variant of ResNet [11], Incep-
tionResNetV2 [23], MobileNetV2 [12], DenseNet121 variant of DenseNet [13],
and EfficientNetB2 variant of EfficientNet [24].
VGG16 is used because achieved an accuracy of 91.90% in the 2014 ImageNet
competition, ranking it within the top five. The main difference between VGG16
and VGG19 is that VGG19 has 19 convolutional layers instead of 16 VGG16,
increasing its parameter amount from 138 357 544 to 143 667 240 because of the
additional layers. The authors suggest that these additional layers make the
architecture more robust and capable of learning more complex patterns.
ResNet achieved first place in the 2015 ImageNet competition with an accu-
racy of 94.29% in the top five. The variant ResNet50V2 used is a modified
540 C. Rodriguez-Leon et al.

(a) Healthy (b) Mild DR (c) Moderate DR

(d) Severe DR (e) Proliferative DR

Fig. 1. Degrees of retinopathy (DR).

version of ResNet50 that performs better than ResNet50 and ResNet101 in the
ImageNet dataset. ResNet50V2 has a convolutional layer, followed by a pooling
layer and 17 convolutional blocks with 3 × 3 filters with residual connections.
InceptionResNetV2 is a highly layered and complex network as internally
it has convolutional layers with 3 × 3 filters, pooling layers, and dropout lay-
ers. Also, there are multiple configurations of convolutional blocks with residual
connections throughout the architecture.
MobileNetV2 is designed and optimized to work in mobile and embedded
computer vision applications. The standard convolutional layers are replaced by
two layers that address this feature allowing for faster training. Besides, two
hyperparameters are introduced to handle latency and accuracy.
DenseNet121, a variant of DenseNet, are architectures inspired by ResNet,
but instead of residual connections, dense blocks are used, consisting of convo-
lution layers placed sequentially, similar to VGG, but each layer is connected to
all subsequent layers. The intended is to reduce the loss of information between
layers, especially in deep layers.
Finally, EfficientNet is claimed to be 8.4× smaller and 6.1× faster in inference
than the best existing convolutional network architectures. The EfficientNetB2
version, similar to MobilNet, has a novel optimization system added to the con-
volution blocks. All named CNN architectures are implemented in Keras, a DL
API from the TensorFlow platform.
Deep Learning for Diabetic Retinopathy Prediction 541

3.3 Experimental Setup

Computing Platform Specs. Kaggle, the online platform for data science,
is selected as a platform to perform the experiments. The R and Python 3
languages can be used, as well as the ANN Keras library developed in Python
and widely supported by the scientific community. Besides, the platform provides
a free GPU and TPU (Tensor Processing Unit) quota weekly (Table 1).

Table 1. Computing platform specs [3, 4].

Resource Specification
Processor Intel(R) Xeon(R) CPU @ 2.30 GHz
# Physical cores 1
# Virtual cores per physical core 2
# Processing threads per virtual core 2
Amount of RAM 16 GB
Hard disk 155GB
GPU Tesla P100 - 13 GB
# of hours of continuous execution allowed 9 h
# of weekly working hours with GPU 39 h

CNN’s Configuration. The data set used is unbalanced, since the number
of examples where there is no retinopathy is much higher (65 343) than in the
rest of the classes (23 359). Hence, a class dictionary is created to compute the
weights passed to the model to give the same importance to the different classes
regardless of the existing unbalance. The extraction of weights is done with the
API provided by the Scikit-Learn library. At the same time, the images are
resized to a size of 400 by 400 pixels, a common resolution in health problems in
Kaggle, and besides small enough to maintain computational performance but
with the quality to recognize relevant patterns. Later the pixels are normalized
to improve the performance in the training of the CNN.
Upholding the stated objective of easy implementation, good performance,
and minimum cost resources, as well as the utilization of the previously men-
tioned pre-trained CNN models, the following hyperparameter configuration, are
used:

– Adam optimizer; as it allows adjustment in the middle of epochs, leading to


great flexibility to improve the model performance while training.
– Learning rate: 0.0001; strongly extended across different computer vision
projects on the Kaggle platform.
– Error metric: categorical crossentropy; due to the type of single labeled mul-
tiple class problem, this metric is the ideal metric for this project.
542 C. Rodriguez-Leon et al.

– Performance metric: CategoricalAccuracy; as it allows finding the average


number of hits regardless of the class.
– Number of epochs: 20; as it is the average where no model exceeds the exe-
cution time allowed in Kaggle.

Evaluation. To run the experimentation the data set is divided into training,
validation, and test using 40%, 12%, and 48% of the data respectively. After
being trained and stored at the epoch of best performance for the loss metric in
the validation set, the model is evaluated with independent data. From this last
evaluation, the loss and the categorical accuracy are obtained. Also, the Kappa
coefficient is calculated to compare the solutions of this work with those of the
competition using the same dataset on the Kaggle platform [2].
Computational complexity is measured by taking into account two features.
First, it is compared the number of layers contained in each CNN architecture,
whereas the higher number of layers higher the complexity. Secondly, is used the
overall running time consumed by the CNNs on training, validation, and testing.

4 Results and Discussion


The results of the pre-trained CNN models are shown in Table 2. Analyzing the
results shown we see that, depending on the metric analyzed, the best model
changes. This is evidence that no architecture performs better on all metrics
at the same time for this problem. Therefore, it is necessary to verify which
model best approximates a balanced behavior among all the metrics taken into
account. MobileNetV2 stands out because it is a small architecture (156 layers),
fast to run (17288 s), almost half time that of the slowest CNNs, and at the same
time performing well on the other metrics with the second-best loss (0.7897) and
third-best AUC (0.9254), for example. These scores may be due to the use of two
types of layers that replace the traditional convolutional layers. The network does
not appear to have fallen into overfitting during training and evaluation times,
even with 20 epochs, so it may still improve if more epochs are added. More
details on the performance of this network are available in Fig. 2.

Table 2. CNN models’ results

Model # of layers Run time Loss Categorical Kappa AUC


accuracy
VGG16 23 21543 s 0.824 0.7169 0.7345 0.9161
VGG19 26 24526 s 0.8503 0.7361 0.7188 0.9179
ResNet50V2 192 16732 s 0.8076 0.7035 0.6024 0.9173
InceptionResNetV2 782 29872 s 0.7919 0.7272 0.7588 0.9233
MobileNetV2 156 17288 s 0.7897 0.726 0.6648 0.9254
DenseNet121 429 19460 s 0.7561 0.7462 0.718 0.9347
EfficientNetB2 342 29142 s 0.8455 0.7585 0.6528 0.9306
Deep Learning for Diabetic Retinopathy Prediction 543

Fig. 2. Performance of MobileNetV2.

From another perspective, if the complexity and training time are put aside
and only based on the resources that the Kaggle environment provides, it can
be seen that the best architecture is InceptionResNetV2 because its loss is low
(0.7919), its categorical accuracy is high (0.7272), even its Kappa metric (0.7588)
is positioned in the 23rd position out of 610 positions for the 2015 Kaggle compe-
tition organized by EyePACS [1]. This may be due to the complexity of the prob-
lem and the need to have an architecture with a large number of layers. Incep-
tionResNetV2 mainly relies on different configurations of convolutional blocks
with residual connections which makes it a robust CNN. On the other hand,
according to the obtained results, there is a tendency to specialize in the pre-
dominant class even if it has regularization, such as dropout and normalization
layers. More details on the performance of this network are available in Fig. 3.
544 C. Rodriguez-Leon et al.

Fig. 3. Performance of InceptionResNetV2.

5 Conclusions

Diabetic retinopathy affects many people worldwide. It can cause severe vision
problems and even blindness. Its identification or classification into its different
types by experts is complex because the differences between the stages of evolu-
tion of the disease are very subtle. In the present work, we propose to use transfer
learning to detect the 5 different classes related to DR: healthy, or one of the 4
different stages of progression of the disease. Therefore, we have compared pre-
trained CNN to solve this classification problem using the Diabetic Retinopathy
Detection by EyePACS from Kaggle. Specifically, VGG16, VGG19, ResNet50V2,
InceptionResNetV2, MobileNetV2, DenseNet121, and EfficientNetB2 architec-
tures have been used on the Kaggle online platform using only the resources
available on it. The comparison of the results between the different models has
been performed using the metrics of loss, categorical accuracy, Kappa coefficient,
execution time, and structural complexity based on the number of layers.
Deep Learning for Diabetic Retinopathy Prediction 545

All evaluated architectures performed above 70% categorical accuracy. Two


models have stood out in the experimentation. Firstly, MobileNetV2, being a
relatively small network that has achieved a good performance in all the proposed
metrics, has not fallen into overfitting and has been trained in approximately half
the time of the more complex networks. Secondly, InceptionResNetV2, although
it has taken the longest time to train, has been the best performer according to
the Kappa coefficient, placing it within the top 25 of 610 solutions for the 2015
competition on the Kaggle platform with this same dataset. The potential of
pre-trained CNNs to solve this and similar problems is demonstrated, it’s rapid
prototyping, fast training, easy implementation, and good outcomes. Results
such as those of the present work are a step forward for creating reliable tools
for experts in health care areas to improve the quality of life of people with DM.

References
1. Diabetic Retinopathy Detection — Kaggle. https://www.kaggle.com/c/diabetic-
retinopathy-detection/overview
2. Diabetic Retinopathy Detection — Kaggle. https://www.kaggle.com/c/diabetic-
retinopathy-detection/leaderboard
3. Kaggle Machine Specification — CPU/GPU/RAM/OS. https://www.kaggle.com/
lukicdarkoo/kaggle-machine-specification-cpu-gpu-ram-os
4. Notebooks Documentation — Kaggle. https://www.kaggle.com/docs/notebooks
5. Abràmoff, M.D., Lavin, P.T., Birch, M., Shah, N., Folk, J.C.: Pivotal trial of an
autonomous AI-based diagnostic system for detection of diabetic retinopathy in
primary care offices. npj Digit. Med. 1(1) (2018). https://doi.org/10.1038/s41746-
018-0040-6
6. Abràmoff, M.D., Niemeijer, M., Suttorp-Schulten, M.S., Viergever, M.A., Russell,
S.R., Van Ginneken, B.: Evaluation of a system for automatic detection of diabetic
retinopathy from color fundus photographs in a large population of patients with
diabetes. Diab. Care 31(2), 193–198 (2008). https://doi.org/10.2337/dc07-1312,
http://care.diabetesjournals.org
7. Engerman, R.L.: Pathogenesis of diabetic retinopathy. Diabetes 38(10), 1203–1206
(1989). https://doi.org/10.2337/diab.38.10.1203
8. Faust, O., Acharya, R., Ng, E.Y., Ng, K.H., Suri, J.S.: Algorithms for the auto-
mated detection of diabetic retinopathy using digital fundus images: a review.
J. Med. Syst. 36(1), 145–157 (2012). https://doi.org/10.1007/s10916-010-9454-7,
https://link.springer.com/article/10.1007/s10916-010-9454-7
9. Gao, Z., Li, J., Guo, J., Chen, Y., Yi, Z., Zhong, J.: Diagnosis of diabetic retinopa-
thy using deep neural networks. IEEE Access 7, 3360–3370 (2019). https://doi.
org/10.1109/ACCESS.2018.2888639
10. Gulshan, V., et al.: Performance of a deep-learning algorithm vs manual grading
for detecting diabetic retinopathy in India. JAMA Ophthalmol. 137(9), 987–993
(2019). https://doi.org/10.1001/jamaophthalmol.2019.2004, https://jamanetwork.
com/
11. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp.
630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 38
12. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv, April 2017. http://arxiv.org/abs/1704.04861
546 C. Rodriguez-Leon et al.

13. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: Proceedings of the 30th IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2017. vol. 2017-Janua, pp. 2261–2269.
Institute of Electrical and Electronics Engineers Inc., November 2017. https://doi.
org/10.1109/CVPR.2017.243
14. Indolia, S., Goswami, A.K., Mishra, S.P., Asopa, P.: Conceptual understanding
of convolutional neural network-a deep learning approach. Procedia Comput. Sci.
132, 679–688 (2018). https://doi.org/10.1016/j.procs.2018.05.069
15. Kanagasingam, Y., Xiao, D., Vignarajan, J., Preetham, A., Tay-Kearney, M.L.,
Mehrotra, A.: Evaluation of artificial intelligence-based grading of diabetic
retinopathy in primary care. JAMA Netw. Open 1(5), e182665 (2018). https://
doi.org/10.1001/jamanetworkopen.2018.2665
16. Kandel, I., Castelli, M.: Transfer learning with convolutional neural networks
for diabetic retinopathy image classification. A review. Appl. Sci. 10(6) (2020).
https://doi.org/10.3390/app10062021, https://www.mdpi.com/2076-3417/10/6/
2021
17. Kanski, J.J., Bowling, B.: Clinical Ophthalmology: A Systematic Approach. Else-
vier Health Sciences (2011)
18. Lim, G., Bellemo, V., Xie, Y., Lee, X.Q., Yip, M.Y.T., Ting, D.S.W.: Different fun-
dus imaging modalities and technical factors in AI screening for diabetic retinopa-
thy: a review. Eye Vis. 7(1), 1–13 (2020). https://doi.org/10.1186/s40662-020-
00182-7
19. Natarajan, S., Jain, A., Krishnan, R., Rogye, A., Sivaprasad, S.: Diagnostic accu-
racy of community-based diabetic retinopathy screening with an offline artificial
intelligence system on a smartphone. JAMA Ophthalmol. 137(10), 1182–1188
(2019). https://doi.org/10.1001/jamaophthalmol.2019.2923, https://jamanetwork.
com/
20. Raumviboonsuk, P., et al.: Deep learning versus human graders for classifying
diabetic retinopathy severity in a nationwide screening program. npj Digit. Med.
2(1) (2019). https://doi.org/10.1038/s41746-019-0099-8
21. Rodriguez-Leon, C., Villalonga, C., Munoz-Torres, M., Ruiz, J.R., Banos, O.:
Mobile and wearable sensing for the monitoring of diabetes-related parameters:
systematic review. JMIR mHealth uHealth (2020). https://doi.org/10.2196/25138
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: 3rd International Conference on Learning Representations.
ICLR 2015 - Conference Track Proceedings International Conference on Learning
Representations, ICLR, September 2015. http://www.robots.ox.ac.uk/
23. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet
and the impact of residual connections on learning. Technical report 1, February
2017. www.aaai.org
24. Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural
networks. In: 36th International Conference on Machine Learning. ICML 2019,
2019-June, pp. 10691–10700, May 2019. http://arxiv.org/abs/1905.11946
25. Williamson, T.H.: Artificial intelligence in diabetic retinopathy. Eye 35(2), 684
(2021). https://doi.org/10.1038/s41433-020-0855-7
26. Zhao, B., Lu, H., Chen, S., Liu, J., Wu, D.: Convolutional neural networks for time
series classification. J. Syst. Eng. Electron. 28(1), 162–169 (2017). https://doi.org/
10.21629/JSEE.2017.01.18
Facial Image Augmentation from Sparse Line
Features Using Small Training Data

Shih-Kai Hung(B) and John Q. Gan

School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK


{sh19143,jqgan}@essex.ac.uk

Abstract. Data collection is expensive in many research fields. Data augmenta-


tion from a very small dataset, such as synthesising realistic images from limited
or incomplete information available from a small number of sample images, is still
an enormous challenge using deep convolutional neural networks that traditionally
require a large number of training data to achieve reasonable performance. For the
purpose of manipulating the synthetic results with diversity, line features, which
can be easily obtained through computer vision, hand-drawn lines, or customer-
designed sketches, can be utilized to provide extra details to effectively augment
a small training dataset for many applications. In this paper, a novel conditional
generative adversarial network (GAN) framework for synthesising photorealis-
tic facial images using small training data and limited line features is proposed,
where sparse line features are expected to simulate abstract and incomplete hand-
drawn sketches for introducing diversity in the augmented facial images. The
proposed GAN framework can automatically recover the lost information caused
by incomplete input features, which has been proved to efficiently reduce unex-
pected distortions but enhance data diversity with controllable sparse line fea-
tures. Experimental results have demonstrated that the proposed method with a
very small dataset, 50 training images only, can generate images of higher quality
than the traditional translation methods and preserve essential details to synthesise
diverse but realistic facial images. Compared to the state-of-the-art methods, the
proposed GAN framework can generate more photorealistic facial images using
controllable sparse line features in terms of higher FID and KID scores as well as
preference evaluation by human perception.

Keywords: Data augmentation · Convolutional neural networks · Generative


adversarial networks (GANs)

1 Introduction
In many applications, it is costly to collect a large amount of training data for deep
learning to achieve robust generalization performance. In general, a critical problem in
generative adversarial networks (GANs) trained with a small training dataset is that the
discriminator will overfit the training data and make the training process hard to reach
photorealistic and diverse results [1], which often leads to meaningless distortions in
generated images. Therefore, it remains extremely challenging for GANs to synthesise

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 547–558, 2021.
https://doi.org/10.1007/978-3-030-85030-2_45
548 S.-K. Hung and J. Q. Gan

images of photorealistic quality based on small training data. Another important issue
in GANs training is its relatively low efficiency: it needs to take into account not only
the issue of multi-objective optimization but also the increased computation time when
attempting to use complicated deep network structures [2].
In recent years, conditional image synthesis has become one of the most popular
research areas, which aims to generate photorealistic and diverse images based on con-
ditional inputs. The conditional inputs can be edges, lines, mark points, masks, semantic
maps, labels and so on [3]. Many popular methods using conditional GANs directly
learn the pixel relations between the assigned input data and given input images instead
of noise with neural networks [4]. GANs have been developed for photorealistic image
synthesis, but it still has a challenging mapping problem for conditional GANs to gen-
erate realistic images from incomplete features, such as abstract sketches, sparse lines,
unspecific marks and a small number of training images [5].
To resolve the issues mentioned above, this paper proposes a novel conditional GAN
framework to generate photorealistic images as well as diverse results for data aug-
mentation purposes based on sparse line features and a very small number of training
images. Instead of deepening the original convolutional layers or increasing the num-
ber of parameters [6], the proposed GAN framework adopts a newly refined domain
corresponding to sparse lines, regional binarisation and segmentation masks, in which
new reference images are obtained from preprocessing training images. Particularly, the
proposed network structure allows to obtain additive pixel correlations between sparse
inputs and ground truth images. The proposed model is expected to reduce the overfitting
during training with a small dataset whilst GANs efficiently learn the critical features
from inputs, such as colour, texture, and shape. The primary contributions of this paper
are as follows:

• A data augmentation technique is proposed for generating photorealistic facial images


using limited line features and a small number of training images, where the proposed
cascaded GANs structure allows to augment insufficient data with diversity for practi-
cal applications. The experimental results show that the proposed method can generate
images of higher quality than the state-of-the-art methods using very small training
data.
• For the first time, the proposed method uses the mixture of pixel values of both binary
images and segmentation masks as a new transformation domain, known as the refined
domain, to enhance the features of the input sparse lines. When adopting the refined
domain, the proposed GAN framework can efficiently integrate facial components,
such as eyes, nose, mouth, etc., and reduce distortions in the diverse images generated
through a very small training dataset.

2 Related Work

The method proposed in this paper attempts to augment image data using modifiable
sparse line features, which is built on the method of image-to-image translation with
conditional GANs. This section introduces related techniques for image synthesis.
Facial Image Augmentation from Sparse Line 549

2.1 Data Augmentation and Image-To-Image Translation


With the development of deep neural networks, many image-to-image translation
approaches have been proposed [7]. Since conditional GANs provide alternative meth-
ods for data augmentation to synthesise and edit images, many researchers have proposed
various GAN structures to generate diverse images of high-quality [3, 8]. Image syn-
thesis based on a small image dataset can boost the applications of data augmentation
because it is still time-consuming and costly to collect a large image dataset in many real
applications. However, with a large number of free parameters in deep neural networks,
using small data to train a deep neural network inevitably needs to prevent overfitting
problems.
Image-to-image translation is an image synthesis method with conditional forms of
input information, such as images [9], text phrases [10], scenes [11], and segmentation
masks [12]. The conditional information can be learned from the source domain (input
domain) to a corresponding target domain (output domain or ground truth domain). The
main concept of image-to-image translation is to discover the distributive relationship
between input and output images [13]. Conditional GANs provide a prominent approach
to translate images [14], in which image-to-image translation methods can automatically
generate images to depict the target objects in different domains. Moreover, for the sake
of mapping input images into another domain [15], it can be leveraged either paired
or unpaired data [16]. Image-to-image translation methods are powerful in learning the
dependence between pairs of images and translating features from segmentation labels,
sketches, or lines into photorealistic images, and supervised learning techniques have
been proved to efficiently improve the resolution and details in synthetic images [3].

2.2 Line-to-Image Synthesis


Compared to the remarkable advantages of image-to-image translation mentioned above,
line-to-image synthesis can achieve visually pleasing results and diverse image data aug-
mentation [17]. Line maps are one of the simple and easy-to-obtain features to represent
image information in computer vision. Since line maps usually contain important object
information, such as edges, shapes, contours, profiles, boundaries and so on, they provide
direct information of objects and positions [18]. Moreover, hand-drawn sketches can be
easily used as another type of line features to extend the application [19]. In contrast to
other image processing (e.g., transition, scaling, rotation, blurring, masking), lines are
flexible to be modified to acquire desirable and diverse images.
However, sparse lines often preserve insufficient details in realistic images, such as
texture, colour, brightness and so on. It is challenging to generate high-quality photo-
realistic images using line-to-image translation. This paper proposes using image pre-
processing to produce refined images for GANs to generate photorealistic images from
sparse line features based on a very small training image dataset.

3 Methods
Some operations in deep convolutional neural networks, such as convolution, normaliza-
tion and down sampling, easily lose spatial information, and it is difficult to completely
550 S.-K. Hung and J. Q. Gan

preserve information through training on a small dataset. Furthermore, with limited


conditional inputs, it is difficult for GANs to generate images with low distortion. To
tackle these drawbacks of the conventional GAN structures trained using small training
data, the proposed method aims to decrease the feature losses in sparse line features and
extract important feature information from training images to refine the limited feature
information.
The proposed model consists of three parts: 1) image pre-processing, 2) genera-
tors and 3) discriminators. An overview of the proposed model is shown in Fig. 1 and
described as follows.

Fig. 1. Overview of the proposed model for translating segmentation masks to photorealistic
images using a cascade of U-nets.

3.1 Proposed Conditional GAN Framework


The proposed model is based on two cascaded U-net structures [20], and the refined
image is essentially responsible for photorealistic outputs. Paired features need to be
specified as the inputs of the conditional GANs. During training, the first U-net creates
a new refined domain from binary images, segmentation masks and ground truth. On
the other hand, the second U-net is liable to acquire extra information from the refined
domain as well as the source domain to further transform refined pixel values into a
photorealistic target domain. It has been observed in our experiments that small feature
distortion in the refined images can enhance the robustness of final output images of this
cascaded structure.
The two generators consist of the same convolutional structures of the U-net, both
of which down-sample and then up-sample to the original size of input images. All con-
volutional layers use convolutional kernels of 3 × 3 [21], and normalization is applied
to all convolutional layers except for the input and output layers [22]. In the training
phase, the first generator is used to generate intermediary images, including the seg-
mentation masks and binary images, from the original sparse lines and ground truth.
The refined images formed from the intermediary images and binary images contain
dependent features in textures, colours and shapes of facial components. The second
generator is designed to generate photorealistic images by translating the refined images
to the final target images. Furthermore, in the training phase, the two discriminators
Facial Image Augmentation from Sparse Line 551

convolutionally distinguish real or fake images, one for distinguishing intermediary


images from pre-processed images and the other for distinguishing generated photore-
alistic images from the ground truth. In the inference phase, without the need of ground
truth and image pre-processing the generators with the same finetuned parameters of the
trained networks generate specific refined features from sparse lines and then generate
photorealistic images from the refined features.

3.2 Image Pre-processing and Refining


Image pre-processing for producing a refined domain is a combination of binarisation
and segmentation masks, which produces new reference images with integrative infor-
mation and filters parts of surplus or unidentical components when the input features
are fragmentary or uncertain lines. Therefore, the main purpose of producing the refined
domain is to transfer the input distribution into a designed refined domain, which attempts
to alleviate the distortions in the images generated by convolutional neural networks.

Fig. 2. Inference results in translating sparse lines to segmentation masks with 50 random train-
ing images. (a) The outputs can roughly integrate the facial components from incomplete input
features. (b) The red boxes indicate the non-corresponding results between the original inputs and
generated masks.

Figure 2(a) illustrates some inference results of using sparse lines to generate seg-
mentation masks from 50 random paired training images. It can be seen that the trained
network can learn from the segmentation masks to generate integrated face components,
such as nose, eyebrow, hair and mouth. Accordingly, segmentation masks are beneficial
to reconstruct facial components even with very abstract and discontinuous sparse lines
as input. However, these segmented outputs cannot exactly translate crucial regions and
contours. Figure 2(b) illustrates the difficulty in synthesising corresponding masks from
small training data. When unidentical lines are used for eye shapes, as shown in the red
boxes in Fig. 2(b), incorrect eye shapes are obtained in the segmentation masks. This
problem is due to using a very small number of training images, and it is difficult for a
deep neural network to recognize the details of input features from a small number of
training images.
To tackle the above problem, a threshold (the default threshold value =0.5) is used
to modify regional values in binarisation during image pre-processing. It is applied
to handle uncertain regions in the input line maps in the inference phase to recover
incomplete line inputs, as shown in Fig. 3. Consequently, the results demonstrate that
the GANs can be trained to better integrate information from the limited inputs. The
552 S.-K. Hung and J. Q. Gan

binary outputs can not only connect broken contours from sparse line inputs, as shown
in Fig. 3(a), but also get rid of meaningless dense inputs, as shown in Fig. 3(b).

Fig. 3. Inference results in translating sparse lines to binary images with 50 random training
images. (a) The outputs recover discontinued contours when having sparse lines as inputs. (b) The
outputs get rid of unidentical lines when having dense inputs.

3.3 Model Training and Loss Function


During the GANs training, it is difficult to find a balance between generator and discrim-
inator, especially when there are only very limited training data. Using an appropriate
loss function is critical to ensure the good quality of generated images. Firstly, to distin-
guish real images from fake ones, the following basic loss function is used for the two
convolutional networks, which is known as conditional adversarial loss.
      
Ladv (D, G) = EI ,S log D(S|I ) + EI ,I  log 1 − D I , G I  |I (1)
where E represents expected value, G the generator, D the discriminator, S the source
image, I the conditional input image with sparse lines, and I’ the generated image. In the
first U-net, S should contain a mixture of pixels of binary image, segmentation mask and
ground truth so as to distinguish between real refined image and fake generated image.
In the second U-net, S needs to be set as the ground truth only.
Secondly, as in the pix2pix GAN model [3], the L1 pixel loss as feature matching loss
in the synthesised fake images is adopted. Since there are paired images in the training
phase, the L1 distance between the generated image (I’) and source image (S) can be
defined as follows:
 
LL1 (G) = ES,I ,I  S − G(I  |I )1 (2)
Finally, the main purpose of using the loss function is to help the generator to syn-
thesise photorealistic images by miniating the loss value with limited input segmented
images. The overall loss function is defined as
min max Ladv (D, G) + αLL1 (G) (3)
G D
where α is a weight parameter. A larger value of α encourages the generator to synthesise
images less blurry in terms of L1 distance.
The second U-net uses the refined images and original sparse lines as inputs to
generate realistic images with the same loss function but different training parameters
and freezing weights. Another difference between these two networks is the source image
S, which should be either the refined images or the ground truth images.
Facial Image Augmentation from Sparse Line 553

4 Experiments with the Proposed GAN Framework


In our experiments, the proposed GAN framework was used to generate photorealistic
images from sparse lines. To demonstrate the performance of the proposed framework,
state-of-the-art methods were compared in terms of Fréchet Inception Distance (FID),
Kernel Inception Distance (KID), and preference evaluation by human perception as
well.

4.1 Data Preparation


Fifty images randomly chosen from CelebA-HD [23] formed the training image dataset
for our experiments. CelebA-HD includes 30,000 high-resolution celebrity facial images.
All the images were resized to 256 × 256 in our experiments. CelebAMask-HQ [24] is
a face image dataset consisting of 30,000 high-resolution face images of size 512 × 512
and from 19 classes, including skin, nose, eyes, eyebrows, ears, mouth, lip, hair, hat,
eyeglass, earring, necklace, neck, cloth and so on. All the images in CelebAMask-HQ
were selected from the CelebA-HD dataset, and each image has segmentation masks of
facial attributes corresponding to CelebAHD. We used 50 segmentation mask images
from CelebAMask-HQ, which corresponded to the 50 randomly chosen face images in
the training dataset and were used as the source images for image pre-processing.

4.2 Implementation Details


To train a conditional GAN, specific paired input images are usually required. However,
most of the sparse lines and hand-drawn sketches represent very abstract concepts, and
it is hard to distinctly depict objects with limited inputs and unpaired data. Hence, in
our experiments, new paired images were constructed by extracting corresponding lines
from real images to simulate the possible uncertain inputs of sketches. Canny edge
detector [25] was used for this purpose, which can obtain simple and continuous edge
lines with a set of intensity gradient from realistic images. The edges produced by a
Canny edge detector seem much more similar to real sketches and possible inputs than
those from other edge detectors, as shown in Fig. 4.
When setting parameter values for the Canny edge detector, those were chosen to
make the generated paired image inputs with clear but sparse edges, such as the one in
the red box in Fig. 4. It is obvious that the quality of synthesised images can be improved
if the inputs in the training phase contain various line types and density. However, for
this preliminary work, only 50 original images and one corresponding line type as the
specific paired images were used to evaluate the performance of the proposed GAN
framework trained with very small training data.
In our experiments, the refined images as the training reference were the mixture
of the following three pixel values: 25% binary images, 25% segmentation masks, and
50% original images. The Adam optimiser was used to minimise the loss function with
initial learning rate =0.0002 and momentum =0.5. The weight parameter in the loss
function α was set to 100. All the experiments were conducted on a desktop computer
with NVIDIA GeForce RTX 2080 GPU, Intel Core i7–6700 (3.4 GHz) processor, and
16G RAM.
554 S.-K. Hung and J. Q. Gan

Fig. 4. Comparison of different edge detectors: (a) results of Canny. (b) results of Sobel. (c)
results of Laplace. (d) results of Gradient.

Fig. 5. Diverse inference results: refined images and final outputs with the progressive inputs.

Figure 5 shows two representative inference results where differences in line density
have led to different refined images and various output images. It can be seen that even
sparse lines without crucial information of all facial components can generate images
without dramatic distortions. With the changeable density of input lines, the output
results can be various and diverse, which is very desirable for data augmentation.
Further experiments were conducted to investigate if sparse lines and limited input
features can be translated into different outputs, which potentially generate diverse results
by modifying parts of the line distribution for data augmentation applications. Figure 6
shows the inference results with different facial images obtained when sparse lines were
modified.

Fig. 6. Diverse inference results when sparse lines are modified to generate different outputs.
Facial Image Augmentation from Sparse Line 555

5 Results and Performance Evaluation


5.1 Qualitative Comparisons

Fig. 7. Inference results with sparse lines (the first row) as inputs, in comparison with the state-of-
the-art conditional GANs. All synthesised images were generated with conditional GANs trained
using the same small dataset of 50 training images.

To evaluate the proposed GAN framework for synthesising photorealistic images, the
images generated by the proposed method were compared with those generated by
the state-of-the-art conditional GANs with line maps as conditional inputs, including
pix2pix [3] and pix2pixHD [8], in terms of quality and reality. The same small dataset
of 50 training images was used to train the conditional GANs in the comparative study.
The image size for the proposed GAN framework and pix2pix is 256 × 256 whilst the
image size is 512 × 512 for pix2pixHD. Figures 7 shows some representative images
generated by the three methods using the same sparse lines as conditional inputs, which
illustrates that our method can generate more photorealistic facial images than pix2pix
and pix2pixHD. Specifically, these images generated by our method maintain facial
integration with less unexpected blurriness and distortion.

5.2 Qualitative Comparisons

FID is widely adopted to evaluate the visual quality of generated images, which is the
Wasserstein distance [26] between the real and generated images. KID is similar to
FID, but it has an unbiased estimator with a cubic kernel [27] to consistently meet the
556 S.-K. Hung and J. Q. Gan

requirements of human perception. Smaller FID and KID values represent better feature
distributions in the generated images and thus indicate they are closer to real images.
We ran five trials to compute 1000 randomly assigned sparse line inputs and synthe-
sised images using different conditional GANs: pix2pix, pix2pixHD and ours. The mean
scores of the inference images generated by the three methods with the same scaling size
of 256 × 256 are shown in Table 1. It can be seen that the proposed GAN framework
outperformed pix2pix and pix2pixHD with much smaller mean FID and KID values,
which indicates that the proposed method can synthesise more photorealistic images.

Table 1. Quantitative comparison in terms of FID and KID

Method FID ↓ KID ↓


Pix2pix [3] 103.7779 11.8408
Pix2pixHD [8] 86.8476 11.4352
Ours 59.2275 8.3682

5.3 Evaluation by Human Perception

To further evaluate the proposed method, randomly selected sparse lines were used to
generate images in the inference phase by the three methods: pix2pix, pix2pixHD and
ours. The generated images were arranged randomly in pairs, ours vs. pix2pix, or ours
vs. pix2pixHD, and presented together with the corresponding ground truth images to
human participants. The participants were asked to choose which image in each pair
is more photorealistic. Google Forms were used for this evaluation with 100 pairs of
generated images. Students in the School of CSEE at Essex University were invited to
take part in the evaluation, and 112 effective responses were received. Based on these
received responses, the percentage of preference to the generated images in terms of
their photorealistic quality was calculated. Table 2 shows the results of the evaluation
by human perception, which indicate that 86% of the participants preferred the images
generated by our method over those generated by pix2pix, and 78% preferred images
generated by our method over those generated by pix2pixHD.

Table 2. Results from user preference study. The percentage indicates the users who favour the
results of our proposed method over the competing method.

Ours vs Pix2pix Ours vs


Pix2pixHD
Preference 86% 78%
Facial Image Augmentation from Sparse Line 557

6 Conclusion and Future Work


In this paper, a novel conditional GAN framework is proposed for image-to-image trans-
formation based on small training data, which is desirable to synthesise photorealistic
images using very sparse lines as input. It has been demonstrated that combining seg-
mentation masks and regional binary images to form refined images as conditional
inputs to GANs can enhance the reality of the generated images with the proposed GAN
framework. The proposed GAN framework can automatically create refined features
using fine-tuned parameters and generate more photorealistic output images than the
state-of-the-art methods. It can be concluded that using the proposed GAN framework
can significantly augment small image dataset in real applications such as generating
photorealistic facial images from hand-drawn sketches.
For future research, more refined domains and different training images could be
further investigated. Undesirable distortions in the synthesised images should be reduced
for more practical applications where only small training data is available. The proposed
method would also be tested by evaluating the effectiveness of image data augmentation
using the proposed GAN framework in machine learning for image classification tasks.

References
1. Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial
networks. In: Proceedings of International Conference on Learning Representations (2017)
2. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adver-
sarial networks. In: Proceedings of the IEEE International Conference on Computer Vision
(2017)
3. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adver-
sarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1125–1134 (2017)
4. Wang, X., Gupta, A.: Generative image modeling using style and structure adversarial
networks. In: Proceedings of European Conference on Computer Vision, pp. 318–335 (2016)
5. Zhou, F., Yang, S., Fujita, H., Chen, D., Wen, C.: Deep learning fault diagnosis method based
on global optimization GAN for unbalanced data. Knowl.-Based Syst. 187 (2020)
6. Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on
deep networks. Advances in Neural Information Processing Systems, pp. 658–666 (2016)
7. Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion.
ACM Trans. Graph. 36(4), 107 (2017)
8. Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image
synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–13 (2018)
9. Azadi, S., Fisher, M., Kim, V., Wang, Z., Shechtman, E., Darrell, T.: Multi-content GAN for
few-shot font style transfer. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 7564–7573 (2018)
10. Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by
redescription. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1505–1514 (2019)
11. Ashual, O. Wolf, L.: Specifying object attributes and relations in interactive scene generation.
In: Proceedings of the IEEE International Conference on Computer Vision (2019)
558 S.-K. Hung and J. Q. Gan

12. Park, T., Liu, M.-Y., Wang, T.-C., Zhu J.-Y.: Semantic image synthesis with spatiallyadaptive
normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2019)
13. Liu, M.-Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks.
Advances in Neural Information Processing Systems, pp. 700–708 (2017)
14. Mirza, M., Osindero, S.: Conditional generative adversarial nets. Advances in Neural
Information Processing Systems, Montreal, Canada (2014)
15. Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-domain gan inversion for real image editing. In:
Proceedings of European Conference on Computer Vision (2020)
16. Zhu, J.-Y., Park, T., Isola, P., Efros, A. A.: Unpaired image-to-image translation using cycle-
consistent adversarial networks. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 2223–2232 (2017)
17. Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: Unsupervised dual learning for image-to-
image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2849–2857 (2017)
18. Eitz, M., Richter, R., Hildebrand, K., Boubekeur, T., Alexa, M.: Photosketcher: interactive
sketch-based image synthesis. IEEE Comput. Graphics Appl. 31(6), 56–66 (2011)
19. Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9416–
9425 (2018)
20. Ibtehaz, N., Rahman, M.S.: MultiResUNet: Rethinking the U-Net architecture for multimodal
biomedical image segmentation. Neural Netw. 121, 74–87 (2020)
21. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T. S.: Free-form image inpainting with gated
convolution. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
22. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved
quality, stability, and variation. In: Proceedings of International Conference on Learning
Representations (2018)
23. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings
of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)
24. Lee, C. -H., Liu, Z., Wu, L., Luo, P.: Maskgan: towards diverse and interactive facial image
manipulation. arXiv:1907.11922 (2019)
25. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach.
Intell. 8(6), 679–698 (1986)
26. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two
time-scale update rule converge to a local nash equilibrium. Advances in Neural Information
Processing Systems, pp. 6626–6637 (2017)
27. Binkowski, M., Sutherland, D., Arbel, M., Gretton, A.: Demystifying MMD GANs. In:
Proceedings of International Conference on Learning Representations, pp. 1–36 (2018)
Ensemble Models for Covid Prediction
in X-Ray Images

Juan Carlos Morales Vega(B) , Francisco Carrillo-Perez, Jesús Toledano Pavón,


Luis Javier Herrera Maldonado, and Ignacio Rojas Ruiz

University of Granada, Granada, Spain

Abstract. Due to the urgency of the COVID pandemic, it is necessary


to develop new and quick methods to detect the infection and stop the
spread of the disease. In this work we compare a simple Deep Learning
(DL) model with an ensemble model in the task of COVID detection
in X-Ray images. For the simple model, we have used only frontal DX
X-Ray images while, for the ensemble model, we have used frontal DX
and CR X-Ray images, as well as lateral DX and CR X-Ray images.
In the ensemble model, the features of the four images are combined to
make a final prediction and, since not every patient possess all types of
images, the model is also robust against missing information, which is
crucial in these types of models. Although the dataset used is very noisy,
the presented system has shown the desired robustness and offers rele-
vant results, showing that ensemble models can generalize better over
the data, which leads to a higher accuracy. Finally, we share our con-
clusions and discuss future work where we want to try using a similar
methodology.

Keywords: COVID · Deep learning · CNN · Ensemble

1 Introduction
The COVID-19 pandemic is nowadays one of the biggest causes of concern in the
entire world, having caused more than 167 million cases and 3.4 million deaths.
The virus, which took only a few months to propagate from Wuhan to the rest
of the world, has demonstrated a extremely high infectivity, which has even led
to the collapse of the sanitary system in numerous countries, causing deaths
that could have been avoided by an early detection. To confront this global
problem is then imperative to develop quick and precise techniques to detect
the infection in the early stages, so that preemptive measures (i.e. two weeks
isolation) can be taken to stop the spread of the disease. To that end, many
methods to provide an early detection have been either developed or applied to
this specific virus, such as Next-Generation Sequencing (NGS), CRISPR-based
methods or the widely known Reverse Transcriptase Polymerase Chain Reaction
(RT-PCR) [9]. However, while these are some of the most popular methods, they
can be complemented by other non-invasive techniques, such as medical imaging,
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 559–569, 2021.
https://doi.org/10.1007/978-3-030-85030-2_46
560 J. C. Morales Vega et al.

which come in the form of X-Rays or Computed Tomography (CT) images.


While an expert radiologist is able to detect with a high degree of accuracy the
presence of COVID inside the lungs using these images, the amount of patients
is so large with respect to the qualified staff that the detection process can be
slowed. Hence, and even though an expert radiologist should always be present
for the final diagnosis, the use of automatic detection algorithms can greatly
improve the detection time.
There are many Machine Learning (ML) techniques that have been tradi-
tionally used to automatically classify images, such as K-Nearest Neighbours
(KNN) [10], Support Vector Machines (SVM) [7] or Random Forest [1] among
others. However, in the case of very high dimensional data like images, many
of these algorithms are either extremely slow or completely non-viable, so that
it is necessary to use algorithms able to reduce the dimensionality of the data
and extract features that are suitable to be used for the classification. These
dimensionality reduction algorithms, such as PCA [18], LDA [27] or t-SNE [28],
are often costly and they are not able to capture spatial local correlations within
the images. However, the increase of computer power and parallelization in the
last decades gives us another option: Neural Networks (NN). Neural Networks
are the basis of Deep Learning (DL) and have the advantage of being able to
automatically extract the features and perform the classification in the same
network, without the need of a dimensionality reduction algorithm.
In the particular case of images, there is one type of NN that presents an
important advantage over other: Convolutional Neural Networks (CNN) [17].
CNNs apply the convolution operation between a series of small filters and the
image, which allows these networks to keep a relatively small number of trainable
parameters while, at the same time, being able to learn local correlations. CNNs
have been widely used for many applications, such as image recognition [2],
object segmentation [21] or even language modeling [4]. However, CNNs still
have many problems that are inherent to ML techniques, such as overfitting —
The network learning by heart the training set and not generalizing to different
data—, small dataset size or missing information. Moreover, a single CNN cannot
deal with data that contains multiple sources of information (multiples images,
image + text, text + audio, etc.). To solve those issues, we need a model able
to combine different types of data while, at the same time, being resistant to
missing data in some of the modalities. These are called ensemble models.
In healthcare, ensemble models can be specially relevant since a single patient
does not usually have a single source of data. They can have images, past clinical
history, genomic data, numerical values for tests such as electrocardiograms, etc.
Moreover, it can be the case that there might be multiple instances of each one
or even missing ones. In the case of a real life diagnosis, doctors usually decide
the final outcome based on the results of all tests (or at least the most relevant
ones). Ensemble models are able to mimic this behavior and thus reach a higher
performance. In this paper, we will compare the performance of two models
capable of predicting COVID-19 in X-rays. The first one is a simple CNN model
trained with only frontal DX images. The second one is an ensemble model
Ensemble Models for Covid Prediction in X-Ray Images 561

trained with frontal and lateral images, both being either DX or CR. This second
model also needs to deal with large amounts of missing information, since usually
only one or two of the possible images are present.
In this section, an introduction to the covid disease and ensemble models
have been presented. In Sect. 2, related work will be discussed. In Sect. 3 the
Dataset and our models will be presented. In Sect. 4 we will show and discuss
our results. Finally, in Sect. 5 we will share our conclusions and talk about future
work.

2 Related Work

In healthcare, CNNs have been used for a variety of tasks, such as automatic
detection and classification of lesions or segmentation of areas of interest, among
many other. For example, Esteva et al. used Inception v3 [25] to classify skin
lessions [6]. First, they tested the network in a three-class problem (benign,
malign and non-neoplastic) and then in a 9-class problem, obtaining 72.1% and
55.4% accuracy respectively, slightly outperforming two expert dermatologists.
Cruz-Roa et al. predicted and segmented breast cancer in Whole Slide Images
(WSI) [3]. To achieve that, the authors first divided the WSI into many small
patches that were passed through the network independently. Then, they com-
bined the predictions of all patches to create the segmentation mask, achieving
a dice coefficient of 0.7586. Another interesting case is [19], where the authors
present an end-to-end training of 3D CNNs over volumetric brain data to predict
Alzheimer and Mild Cognitive Impairment. To achieve that, they first built a
3D autoencoder over the MRI images and then used the encoder part to make a
prediction and segmentation in the case of Alzheimer vs Normal tissue. Finally,
they applied transfer learning over this network to detect two types of Mild
Cognitive Impairment.
In the particular case of chest X-rays, there are numerous works than
are worth mention that use deep networks. For example, Varshni et al. used
DenseNet [14] as a feature extractor in chest X-rays images [29]. These features
were classified later using a SVM to predict pneumonia cases. In [26], Tang et
al. make a comparison of the performance of multiple known deep networks over
the task of abnormality detection in chest X-rays. In [16], Jyoti et al. built a net-
work based in U-net to perform an automatic segmentation over the lung area
and extract the lung masks. The authors obtained a Dice Coefficient of 0.986,
outperforming the previous state of the art.
In COVID related tasks, there is also a huge amount of papers, as it can
be seen in [30], with more than 140K papers almost a year ago and more than
500K at the point of writing. As such, it is also difficult to find relevant informa-
tion within this vast sea of papers. To mention some examples: Roy et al. seg-
mented and predicted COVID infected regions in ultrasound videos [22], which
is particularly interesting due to being a non-aggressive method of detection.
The prediction was done frame by frame and later combined to get the full
video prediction. The segmentation was also done frame by frame. Hussain et al.
562 J. C. Morales Vega et al.

developed a model to predict COVID-19 in chest X-ray images, as well as viral


and bacterial pneumonia [15]. The authors obtain up to 91.2% accuracy for 4
classes (COVID, Normal, Viral Pneumonia and Bacterial Pneumonia) using a
22-layer CNN over a dataset created by the combination of eight of the most well
known COVID datasets. Finally, in [13] the authors use a VGG-based network
to predict COVID-19 in CT images, as well as performing a weakly supervised
segmentation by combining the saliency maps of several layers of the network.
Ensemble learning has also been applied to covid detection. Usually, ensem-
ble learning is applied by using multiple networks over the same input data and
combining the results, which avoids the case of missing information. For exam-
ple, in [31], the authors apply three different networks over the same CT image
and combine the predictions afterwards via majority voting, reaching over 98%
accuracy. Gianchandani et al. took a similar approach using X-Ray images [8].
They compare the performance of four different networks and created an ensem-
ble using the best two (VGG16 and DneseNet), achieving over 99% accuracy
and outperforming both networks separately.

3 Proposed Model

3.1 Dataset

We have used the BIMCV-COVID19 dataset [5]. Although the dataset contains
X-ray images, CT images and a bit of clinical data, we have only used the X-ray
part of the dataset in this study. These X-ray images are of four different types:
frontal DX, lateral DX, frontal CR and lateral CR. The main difference between
CR and DX images is the technique used to take the X-rays. The images are
stored as 16 bit grayscale 2D arrays with varying resolution. After filtering a few
almost black images, we are left with 5392 training images (3229 positives and
2163 negatives) and 1155 validation images (692 positives and 563 negatives).

3.2 Preprocessing

For the images to fit correctly into the network, we first need to preprocess them.
The first thing to consider is that we need to read the images in 16-bit format.
Otherwise, they will have a very poor quality with an incorrect range of values.
Once the images are correctly read, we need to scale them to the size of the
network (512 × 512). While the convolutional layers have no problem running
over variable-sized images, the same is not true for the dense layers at the end of
the network, which will cause an error when trying to run the model. Although
there are methods to prevent this problem, such as Spatial Pyramid Pooling [12],
we have opted for the more classical approach by resizing them. However, just
resizing the image will cause deformations, so we first take the central square
crop.
The resized images are then scaled using a custom algorithm to use the same
range of values and then passed through a contrast enhancement filter to improve
Ensemble Models for Covid Prediction in X-Ray Images 563

the color difference between the lungs and the background. We have tried two
different algorithms for this purpose: BCET [11] and CLAHE [32]. The first one
perform a histogram normalization over the whole input image, rescaling it to
have the chosen min, max and mean values. The second one performs a local
histogram normalization in small patches of the image, which are later combined
via interpolation. Both algorithms have also been tested over the raw image and
the already scaled values. An example of the normalized images can be seen in
Fig. 1.

(a) Original

(b) BCET (c) Custom + BCET

(d) CLAHE (e) Custom + CLAHE

Fig. 1. Normalization techniques implemented. The custom normalization followed by


CLAHE is the one that improves the most the quality of the original image.
564 J. C. Morales Vega et al.

Finally, all images have been augmented using random rotations, random
translations and random zooms to reduce overfitting. The rotation range is 15◦ ,
the translation range is a 10% of the image size in both dimensions and the zoom
range is a 10% in and out.

3.3 Single-Image Model


The first model we have tested is a fairly simple network that only accept as
input the frontal DX images of the dataset. Since not every patient in the dataset
possesses a frontal DX image, we have only used those that have one. In this case,
the number of training samples are 3393 (2179 positives and 1214 negatives) and
the number of validation samples are 668 (395 positives and 273 negatives).
For the backbone of our model, we have used a VGG16 [24] pretrained in
ImageNet, from which we have taken out the classification layers. The archi-
tecture of the VGG can be seen in Fig. 2. The extracted feature maps are then
passed through two more convolutional layers and finally through two dense lay-
ers before getting the final classification. The whole structure of the network can
be seen in Fig. 3.

Fig. 2. VGG16 architecture

3.4 Multi-image Model


For the second model, we have built a network that is able to take 4 images
as input (frontal DX, side DX, frontal CR, side CR) and is robust to missing
information. The information fusion process works as follows: each of the differ-
ent data types can have its own “branch network”. However, missing entries can
Ensemble Models for Covid Prediction in X-Ray Images 565

Fig. 3. Single image model

exist within a batch, so each branch is only executed over the existing entries,
ignoring the missing ones. Once all branches have been executed and we have
obtained the feature vectors, we apply a fusion layer, which can be built in many
ways. We have tested two methods. The first one is using a direct concatenation
and filling the missing entries with 0s. The second one is applying max pooling
between the feature vectors, ignoring the missing ones. Although for our pur-
poses we only need one fusion layer, this process is very general and could be
extended to create a much larger fusion hierarchy over more complex and diverse
data. A graphical overview of the fusion process can be seen in Fig. 4.

Fig. 4. Fusion framework

Particularizing for the network we have built, we have also used a pretrained
VGG16 as the backbone, as in the single image model. The VGG16 is part of the
branch networks, but it is the same for all of them at the same time. This helps
reducing the number of trainable parameters in the case of doing an end-to-end
training. After the VGG and still within the branches, two more convolutional
layers are applied in each branch. These ones are branch-specific and not shared
between them. Next the feature maps are flattened and the fusion over the
feature vectors is performed as previously explained. Finally, two more dense
566 J. C. Morales Vega et al.

layers are added to the end of the network to get the final classification. The
structure of the network can be seen in Fig. 5.

Fig. 5. Multi-image model

4 Results

We have executed all experiments in a NVIDIA RTX 2080 Super GPU. We have
trained the models for 25 epochs using a batch size of 6, Adam optimizer and
a learning rate of 1e−5. To reduce overfitting, we have included dropout with
0.25 probability before the classification layers and also used weight constraints,
with a max norm of 1. For CLAHE, we have used (8 × 8) tiles with a clipping
value of 2. For BCET, we have used 0, 1 and 0.5 as min, max and mean values
respectively.
For the single-image model, we tried both, end-to-end training and fine tuning
of the last layers. For the multi-image model, we have only tried end-to-end
training, mainly due to the time it was taking to train the fine tuning approach
in the single-model case. The results can be observed in Table 1.
As a side note, we have also tried changing the VGG16 by ResNet and a
pretrained ChexNet [20], but there were no noticeable improvements.
Ensemble Models for Covid Prediction in X-Ray Images 567

Table 1. Model accuracy

Model Train Val Test


BCET + Single-image + end-to-end 83.43 70.51 76.50
BCET + Single-image + fine-tuning 73.58 65.27 65.62
CLAHE + Single-image + end-to-end 87.59 69.31 72.34
CLAHE + Single-image + fine-tuning 73.77 67.96 69.34
BCET + Multi-image (Concat) + end-to-end 84.52 72.01 75.07
BCET + Multi-image (Max) + end-to-end 82.95 72.31 73.35
CLAHE + Multi-image (Concat) + end-to-end 81.54 71.71 70.63
CLAHE + Multi-image (Max) + end-to-end 80.56 71.56 76.07

For a better insight on the model, we have also visualized the activations
in the last convolutional layer using GradCam [23]. For correct predictions in
COVID images, the model looks at specific regions inside the lung, presumably
detecting the COVID zone. An example of one of these images can be seen in
Fig. 6.

Fig. 6. GradCam activations for a COVID image

5 Conclusion and Future Work

Even after trying several normalization algorithms, several networks and several
regularization algorithms, the obtained results are still poor, never reaching 80%
accuracy in test. We think that this could be due to our understanding of the
data or an intrinsic problem of the dataset. One explanation could be that some
of the images have wrong labels. For example, patients could be marked as
covid-positive, which can be true, but they might not present symptoms in the
lungs.
568 J. C. Morales Vega et al.

Even if our results are not great, the multi-image model seems promising
since, for a lower training accuracy, it is reaching similar validation and test accu-
racies, meaning potential improvements for longer training times. The training
process, although much slower, has also been smoother, presenting a less noisy
loss and accuracy.
Regarding the methodology presented, the information fusion system is intu-
itive and can be used in many different problems. Thus, as future work, we
want to test this fusion framework in a set of diverse multimodal problems and
compare the results with the ones that a non-ensemble model would reach.

References
1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
2. Chauhan, R., Ghanshala, K., Joshi, R.: Convolutional neural network (CNN) for
image detection and recognition, pp. 278–282, 2018
3. Cruz-Roa, A., et al.: Accurate and reproducible invasive breast cancer detection
in whole-slide images: a deep learning approach for quantifying tumor extent. Sci.
Rep. 7, 46450 (2017)
4. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated
convolutional networks (2017)
5. Vayá, M.I., et al.: BIMCV COVID-19+: a large annotated dataset of RX and CT
images from COVID-19 patients (2020)
6. Esteva, A., Kuprel, B., Novoa, R., Ko, J., Swetter, S., Blau, H.: Dermatologist-level
classification of skin cancer with deep neural networks. Nature 542, 01 (2017)
7. Evgeniou, T., Pontil, M.: Support vector machines: theory and applications. In:
Paliouras, G., Karkaletsis, V., Spyropoulos, C.D. (eds.) ACAI 1999. LNCS (LNAI),
vol. 2049, pp. 249–257. Springer, Heidelberg (2001). https://doi.org/10.1007/3-
540-44673-7 12
8. Gianchandani, N., Jaiswal, A., Singh, D., Kumar, V., Kaur, M.: Rapid COVID-19
diagnosis using ensemble deep transfer learning models from chest radiographic
images. J. Ambient Intell. Hum. Comput., 1–13 (2020). https://doi.org/10.1007/
s12652-020-02669-6
9. Giri, B., Pandey, S., Shrestha, R., Pokharel, K., Ligler, F., Neupane, B.: Review
of analytical performance of COVID-19 detection methods. Anal. Bioanal. Chem.
413, 09 (2020)
10. Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: KNN model-based approach in
classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) OTM 2003. LNCS,
vol. 2888, pp. 986–996. Springer, Heidelberg (2003). https://doi.org/10.1007/978-
3-540-39964-3 62
11. Guo, L.J.: Balance contrast enhancement technique and its application in image
colour composition. Int. J. Remote Sens. 12(10), 2133–2151 (1991)
12. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional
networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T.
(eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://
doi.org/10.1007/978-3-319-10578-9 23
13. Hu, S., et al.: Weakly supervised deep learning for COVID-19 infection detection
and classification from CT images. IEEE Access 8, 118869–118883 (2020)
14. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks.
CoRR arXiv:1608.06993 (2016)
Ensemble Models for Covid Prediction in X-Ray Images 569

15. Hussain, E., Hasan, M., Rahman, M.A., Lee, I., Tamanna, T., Parvez, M.Z.:
CoroDet: a deep learning based classification for COVID-19 detection using chest
x-ray images. Chaos, Solitons Fractals 142, 110495 (2021)
16. Islam, J., Zhang, Y.: Towards robust lung segmentation in chest radiographs with
deep learning (2018)
17. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep con-
volutional neural networks. Neural Inf. Process. Syst. 25, 01 (2012)
18. Mishra, S., et al.: Principal component analysis. Int. J. Livestock Res. 1, 01 (2017)
19. Oh, K., Chung, Y.C., Kim, K.W., Kim, W.S., Oh, I.S.: Classification and visual-
ization of Alzheimer’s disease using volumetric convolutional neural network and
transfer learning. Sci. Rep. 9, 1–16 (2019)
20. Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest x-
rays with deep learning. CoRR arXiv:1711.05225 (2017)
21. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
22. Roy, S., et al.: Deep learning for classification and localization of COVID-19 mark-
ers in point-of-care lung ultrasound. IEEE Trans. Med. Imaging 39(8), 2676–2687
(2020)
23. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-
CAM: why did you say that? Visual explanations from deep networks via gradient-
based localization. CoRR arXiv:1610.02391 (2016)
24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (2015)
25. Szegedy, C., Vanhoucke, V., Ioffe, S, Shlens, J., Wojna, ,Z.B.: Rethinking the incep-
tion architecture for computer vision (2016)
26. Tang, Y.X., et al.: Automated abnormality classification of chest radiographs using
deep convolutional neural networks. NPJ Digit. Med. 3, 1–8 (2020)
27. Tharwat, A., Gaber, T., Ibrahim, A., Hassanien, A.E.: Linear discriminant analy-
sis: a detailed tutorial. AI Commun. 30, 169–190 (2017)
28. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn.
Res. 9, 2579–2605 (2008)
29. Varshni, D., Thakral, K., Agarwal, L., Nijhawan, R., Mittal, A.: Pneumonia detec-
tion using CNN based feature extraction. In: 2019 IEEE International Conference
on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–7
(2019)
30. Wang, L.L., et al.: CORD-19: the COVID-19 open research dataset (2020)
31. Zhou, T., Lu, H., Yang, Z., Qiu, S., Huo, B., Dong, Y.: The ensemble deep learning
model for novel COVID-19 on CT images. Appl. Soft Comput. 98, 106885 (2021)
32. Zuiderveld, K.: Contrast limited adaptive histogram equalization, pp. 474–485.
Academic Press Professional Inc., Cambridge (1994)
Validation of a Nonintrusive Wearable Device
for Distress Estimation During Robotic Roller
Assisted Gait

Marta Díaz-Boladeras1(B) , Xavier Llanas1 , Marta Musté1 , Elsa Pérez1 , Carlos Pérez2 ,
Alex Barco1 , and Andreu Català1
1 CETpD, Technical University of Catalonia, Vilanova i la Geltrú, Spain
[email protected]
2 Consorci Sanitari de l’Alt Penedès i Garraf, Vilafranca del Penedes, Spain

Abstract. Successful robot rollators work under the shared control paradigm as
the best way to adjust dynamically to users’ needs and preferences in rehabilita-
tion and daily living activities. Deciding how much weight users have in emerging
motion commands is necessary to assess their condition and needs, usually from
on-board sensors. Unfortunately, some relevant parameters for safe and comfort-
able gait assistance (i.e. balance or stress) are extremely difficult to measure using
only on-board sensors. Therefore, wearable devices that offer real-time physio-
logical data acquisition are meant to be a valuable source of relevant information
of users’ psychological states such stress. However, detecting stress in real life
with an unobtrusive wearable device is a challenging task. The objective of this
study is to develop a method for real-time stress detection based in the wrist band
Empatica E4 that can accurately, continuously and unobtrusively monitor psy-
chological stress in real life to feed the system to provide smart gait-assistance.
In this preliminary study we explore the feasibility, accuracy and reliability of
the wrist-band with machine learning and signal processing techniques applied to
electrodermal activity from 6 healthy participants in laboratory conditions. Specif-
ically, the participants’ electrodermal activity (EDA) gathered by the Empatica E4
under a standardized stress induction test (Affective Picture System) is analized to
evaluate the sensitivity, validity and robustness of the measure. The present study
will be followed by a pilot in the lab with 20 participants fulfilling trajectories
of different level of difficulty with the roller, previously to the clinical trials with
rehabilitation patients.

Keywords: Stress detection · Robotic walker · Wrist device · Machine learning

1 Introduction
This study explores the identification of users’ distress through the analyses of psy-
chophysiological signals, as a necessary step to optimize the shared control system in a
robotic roller for rehabilitation purposes. The present study belongs to a RETOS coor-
dinated project on Adaptable Autonomy System for Assisted Mobility, that encompasses

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 570–582, 2021.
https://doi.org/10.1007/978-3-030-85030-2_47
Validation of a Nonintrusive Wearable Device for Distress Estimation 571

two subprojects, the Project AMPARO (Architecture for Monitorization and Personal-
ization of Assistance based on a RObotized Rollator) focus on the development of a
robotized rollator with a shared control paradigm for inpatient rehabilitation. The other
subproject is MESURAR (Wearable based Monitorization of Robotized Rollator Users
with Mobility Disabilities) that faces the monitoring of relevant variables from wearable
devices to feed the system with the data from the user state. The study presented in
this contribution develops one of the objectives that is identifying users’ distress using
wearable sensors.
Specifically, this study explores the feasibility, accuracy and reliability of the wrist-
band (EMPATICA 4) with machine learning and signal processing techniques applied to
participants’ electrodermal activity (EDA). The data are gathered under a standardized
stress induction test (Affective Picture System) and is analyzed to evaluate the sensitivity,
validity and robustness to classify the participants’ subjective experience into two states
distressed and not distressed.
The gold standard for the Empatica E4 measures will be the Self-Assessment
Manikin (SAM) scored by each participant after the visualization of each image of
the Affective Picture System., non-verbal and self-administered scale used in literature
to assess stress and other psychological and affective states in usability and performance
studies (Schmidt et al. 2018).

1.1 Assisted Gait with a Robotic Walker and Stress


Successful robot rollators work under the shared control paradigm as the best way
to adjust dynamically to users’ needs and preferences in rehabilitation and daily living
activities. To decide how much weight users should have in emerging motion commands
is necessary to assess their condition and needs usually from on-board sensors.
Unfortunately, some parameters, like balance or stress, are extremely difficult to
measure using only on-board sensors. Consequently, most existing solutions for shared
control rely on: i) manually pre-fixing -user, caregiver or doctor- the amount of help
to provide (Cortés et al. 2008); or ii) adapting help to external factors like obstacles or
slopes, but disregarding the user’s specifics (Glover et al. 2004; Kulyukin et al. 2008;
Hirata, Hara and Kosuge).
In order to adapt to these specifics, we need new sensor systems and/or processing
algorithms for monitorization of relevant rollator users’ parameters. Rather than working
in fixed environments, we plan to combine nonintrusive on-board and wearable sensors
to continuously monitorize their state and condition and use this information for better
assistance adaptation.
This project consequently proposes to attach a modular on-board sensor system and
wearables to a conventional rollator in order to provide continuous monitorization of the
user’s condition and needs.
An important parameter regarding usability, acceptability and performance of the rol-
lator is users’ subjective experience correlative to the modality and flow of the assisted
gait. Anxiety and sudden emotions during gait may interfere with mechanisms for keep-
ing balance among the elderly and even may have the potential to trigger falls in this
population (Möller et al. 2009). According to our observations, in control share mode
occasionally the user persistently exerts strong forces to “counteract” the roller trajectory
572 M. Díaz-Boladeras et al.

as an intent to regain the initiative what can result in discomfort and ineffective perfor-
mance and eventually in distress and risk of fall. In order to include users’ psychological
state in the shared-control loop, automated real-time continuous estimations should be
provided.

1.2 Stress Estimates in Daily Activities


Concept. Though stress is a widely used term, it is elusive in its conceptualization. Stress
can be defined as a physical, chemical, or emotional factor that causes bodily or mental
tension or this tension itself. Stress is or results in a nonspecific response of the body to
a change request or to any demand upon it. Stress is a complex pattern of responses that
usually has psychological, cognitive and behavioral expressions (Schmidt et al. 2019)
(Campbell and Ehlert 2012). Following this definition, ‘nonspecific’ refers to a shared set
of responses-primarily a physiological response-triggered by an external and/or internal
stimulus regardless of the nature of the stressor (i.e. physical or psychological) (Schmidt
et al. 2019).
Stress can be understood as the response to any demanding or hazardous situation,
defined by an activation of the individual cognitive and physiological resources to prepare
the body to a fight or flight response mobilizing physiological resources, to prepare the
organism for an immediate response and provides, in the long term, either a healthy
adaptation or an allostatic load (Campbell and Ehlert 2012).
In addition, subjective stress experience concomitant to the increase in activation
appears from the appraisal of a stimulus as harmful and threatening. During acute stress
or other demanding situations, these adaptive physiological responses are likely to be
labeled with feelings of being “stressed” and to be associated with central neuronal
activations (Campbell and Ehlert 2012). Accordingly, the quality of emotional experience
(e.g. hopelessness, helplessness, anger, anxiety) is tightly coupled with such cognitive
processes as well as with response outcome expectancies (Campbell and Ehlert 2012).
Moreover, while stress always implies an increase of arousal or activation, the quality
of the subjective experience of this activation can be pleasant –such as euphoria or
enthusiasm, or on the contrary the tension can be experience as unpleasant such as
anxiety, fear or frustration. In this work, we explore the “negative stress” or distress,
defined in the 2 axes model by a high activation (high level of arousal) and negative
experience (negative valence) (see Fig. 1). Moreover, we focus on acute (nonchronic)
stress as the temporary state of activation triggered by a particular activity or event.
Measures. Stress level can be inferred through its expressions in three dimensions: i)
the subjective experience, ii) the hormonal and psychophysiological changes and finally
iii) in the behavioral response. The subjective experience of stress can be defined by an
increased arousal or activation along with a sense of unpleasantness or thread (negative
valance) of a demanding situation. As well, in typical responses to stressors changes in
cortisol and prolactin can be detected in blood and urine. The parameters that typical
prepare the individual to response to stressor factors at psychophysiological level are
brain activity, electrodermal activity, cardiac activity and muscle activity. There are
several unobtrusive devices (e.g. jackets, belts, wristbands) that can capture changes in
these parameters allowing a certain inference of stress level or stress changes.
Validation of a Nonintrusive Wearable Device for Distress Estimation 573

Fig. 1. A circumplex model of affect (Larsen and Diener 1992). The shaded area represents the
target of the present study.

Monitoring Stress in User-Technology Interaction. Researchers use psychophysio-


logical feedback devices such as skin conductance (SC), electroencephalography (EEG)
and Electrocardiography (ECG) to detect the affective states of the users during task
performance. Psycho-physiological feedback has been successful in detection of the
cognitive states of users in human-computer interaction (HCI) or more recently, in game
studies, psycho-physiological feedback has been used to capture the user experience and
the effect of interaction on human psychology (Baig and Kavakli 2019).

1.3 Stress Estimation with Empatica E4

According to (Schmidt et al. 2018) to label affective states “in the wild” there are two
main guidelines, first of all data should be collected in a minimally intrusive way, only
interfering as little as possible with the subjects’ everyday life, and secondly data quality
can be assessed and increased using multiple data sources (physiological, structured
interviews, context information and questionnaires).
While the problem of finding a model that accurately identifies stress using high-
performance laboratory sensors with high sampling rate is becoming closer to being
solved, the solution of the same problem using wearable sensors is less explored. Wear-
able sensors allow for continuous monitoring regardless of the activity of the user, making
it an interesting choice for real-life applications (Ollander et al. 2017).
In this study we choose the E4 device as a portable, non-invasive, and user-friendly
alternative as it can be worn as simply as a wrist watch. The main advantage of the device
comes from its usability. The device requires no electrodes, wires, or other peripheral
components for proper monitoring. The device has a battery life of 36+ hours (lasted for
48 h in all tests) and fully recharges in 2 h.
574 M. Díaz-Boladeras et al.

1.4 Validation

The aim of this study is to describe and analyze the key psycho-physiological parameters
that can be measured by the unobtrusive device that relate reliably to different mental
and emotional states.
According to literature, the different steps followed are: selecting the physiological
parameters provided by the Empatica to measure stress; test the EDA signal obtained
by Empatica as a reliable measure of stress in this context; determine the conditions for
obtaining a good EDA signal for processing in the experimental condition and in the
natural setting; identify which signal properties of EDA –features- are most relevant for
detecting psychological states (Ollander 2015; Perugia et al. 2017); design and test a
system to classify between Distress/No Distress states (Jain et al. 2017; Ollander 2015;
Ollander et al. 2017) and compare EDA stress assessment with self-reported estimates
(SAM) (Lutscher 2016).

2 Methods
2.1 Design

The quasi experimental study followed a repeated measures design with one indepen-
dent (experimental) variable the exposure to images from the IAPS repository, and two
dependent variables or result variables: the self-reported subjective experience and the
galvanic skin response. All the sessions were conducted by a facilitator helped by an
assistant.
The protocol was approved by the “Comité de Ética de la Universitat Politècnica de
Catalunya” (Ref. 28-01-2021).

2.2 Participants

6 healthy participants (50% woman, 50% man) aged 41 to 65 years (M = 50), were
recruited among, faculty and staff from the Campus. Active consent was obtained by
each of the participants signing a consent form that included agreement to participate
and to be video-recorded.

2.3 Materials

Affective Picture System (IAPS). Participants visualized 33 images from the repertory
IAPS for stress induction under three conditions: high stress potential images, low stress
potential images and stress neutral images. The IAPS is an image database created for
providing a standardized set of images for emotion and attention research. It has been
commonly applied in psychological studies. The IAPS was created by the National
Institute of Mental Health Center for Emotion and Attention at the University of Florida
(Lang Bradley and Cuthbert 2008). It consists of pictures varying from simple household
objects to extreme pictures which cause arousal on individuals (mutilated corpses, erotic
and violent scenes). Stress induction by showing a series of pictures from the IAPS is
Validation of a Nonintrusive Wearable Device for Distress Estimation 575

another test that is widely applied among the literature (Dautenhahn et al. 2002; Can
et al. 2019).
Sensor. The E4 wristband is a wearable multi-sensor device for real-time computerized
biofeedback and data acquisition. It has four sensors embedded in its case: (1) a photo-
plethysmography sensor (PPG) to measure blood volume pulse and derive HR, HRV,
and inter-beat interval, (2) a triaxial accelerometer to capture motion-based activity and
detect movement patterns, (3) an infrared thermopile sensor to read peripheral skin
temperature, and (4) an EDA response sensor to measure the electrical conductance
of the skin. The E4 was selected among the available wearable sensors for its light
weight and unobtrusiveness. It was also the only device measuring EDA that did not
entail the positioning of electrodes on the medial or distal phalanxes. This was of crucial
importance as it left participants free to grasp and handle the roller without affecting
neither the assistance to gait nor the data collection. In the context of this study, the
E4 was employed to collect EDA. There are interesting antecedents in literature of the
use of E4’ EDA measurements to infer participants’ affective states while performing
di-verse activities (Perugia et al. 2020).
Self-Assessment Manikin (SAM) To evaluate the subjective experience the partici-
pants respond a pencil and paper 9-points version of the questionnaire, in the affective
dimensions of pleasure (Valence) and activation (Arousal), graphically represented by
iconic characters (Lang Bradley and Cuthbert 2008) (Fig. 2).

Fig. 2. Iconic nine-point scales of SAM in Valence (top) and Arousal (bottom) dimensions

Video Recorder. The session is videotaped with one camera placed in a corner of the
meeting room out of sight of the participant, to offer a general overview (Fig. 4).

2.4 Procedure

Each participant is welcomed by the conductor and briefed about session. The Infor-
mation sheet and the Consent form were delivered to the participant and were
offered to address any doubt or concern. After reading the forms, the participant was
requested to sign the consent to participate and to be video recorded. Once signed, the
sociodemographic data were gathered and the participant was walked to the test room.
Once sat down at a table in front of a wall screen, the wristband was adjusted
to the participant’s wrist and checked. The participant was delivered with the SAM
questionnaire booklet and a pen (Fig. 4).
576 M. Díaz-Boladeras et al.

Every participant rated the same set of 30 pictures in the same order.
The sequence of images presentation and scoring were as follows: first an open-
ing slide was displayed indicating the number of the picture to be scored, secondly
appeared a quick display of a red arrow to fix the attention in the center of the screen,
immediately the target image was displayed during 6 s and finally the slide with the rep-
resentation of the two-scale indicating the moment when the participant must to mark
the scores on the booklet. This cycle was followed by a white screen during 20 s for
recovery. This cycle was repeated 30 times plus 3 trials to practice that are not evaluated
(Figs. 3 and 4).

Fig. 3. The sequence and timing of one circle of the stress elicitation test. This cycle repeats for
the 30 images selected

Fig. 4. Overview of the setting in the stress elicitation test.

2.5 Data

The following data are obtained: (1) Physiological signals from the wrist band (heart
rate, skin temperature, galvanic response of the skin and movement) (2) Self-reported
experience from participants’ scores on the SAM questionnaires (3) observational data
from the video recording of the entire session.
Validation of a Nonintrusive Wearable Device for Distress Estimation 577

3 Results
3.1 Subjective Experience

In Fig. 5 the scatter plot shows the distribution of the average scores on Valence and
Arousal for each picture. With respect to the Standard scores, a Pearson correlation
between the average scores on SAM and the reference values of the pictures show a
significant correlation of 0,903 in Arousal and of 0,927 in Valence.

Fig. 5. Scatter Plot of mean scores for each picture

3.2 Electro Dermal Activity

Preprocessing. We took the Electrodermal Activity signal for each participant. That
parameter depends on the specific skin’s conductance of each participant and presents a
significant variability between participants.
In order to limit the dependence of this variability between the signal of the different
participants, a first preprocessing was performed which consisted of normalizing the
EDA signal. A low pass filter was then applied in order to remove the higher frequency
components from the signal (filter: 2nd order Butterworth lowpass filter with a cut-off
frequency of 0.05 Hz).
As a result, the normalized and filtered EDA signal was obtained for each participant
and for each of the 30 images used in the experiment. As a baseline signal we used the
EDA signal associated with a neutral condition obtained during the instruction phase
previous to viewing the images. That EDA baseline signal will be used during the feature
extraction process.
Feature Extraction. The feature extraction process was performed on the EDA signal
captured during the observation period of each image. We used a sampling window equal
to the time period the experiment lasted for each image.
578 M. Díaz-Boladeras et al.

We used a set of eleven features directed measured from the EDA signal of each
window. Additionally, we used also two sets of six differential features: one set between
the feature of an image and the same feature of the previous image. The second set
between the feature of an image and the same feature of the baseline signal.
At the end, we extracted for each window of EDA signal a total of twenty-three
features, as listed in Fig. 6.

Fig. 6. Significant features

According to the distribution of participants’ responses on the SAM scale, combined


with the values of the variables associated with each image, thresholds have been set to
categorize the output variables: “arousal” and “valence”.
Thus, a high “arousal” has been considered to be associated with a value above 6 on
the SAM rating scale, while a low “arousal” is associated with other values below this
threshold.
Similarly, a low “valence” is associated with a value lower than 3 on the SAM rating
scale, while a high “valence” is associated with other values above this threshold.
A third class was generated from the variables arousal and valence: distress. Accord-
ing to the circumflex model of affect (Larsen and Diener 1992) (see Fig. 7), distress will
occur as a combination of a high arousal (high activation) with a low valence (unpleasant).

Fig. 7. Valence and arousal thersholds


Validation of a Nonintrusive Wearable Device for Distress Estimation 579

Training. The database for training contains 180 patterns obtained from the observation
of the 30 images by 6 participants. The distribution of the patterns for each of the
classes is the one shown in the Table 1. We have carried out the training applying
different classifiers models: support vector machines, radial basis functions, multilayer
perceptron, random forest, logistic model trees.

Table 1. Distribution of the patterns for each of the classes

Performance. To classify the class Arousal, we have chosen a forest of random trees
model, and trained with 80% of the data set and reserved 20% for testing. The correctly
classified instances rise above 72%. Valence class is trained with the same setup as the
class Arousal. For that variable, the correctly classified instances are 75%. The class
“distress” was trained using an RBF Classifier. Again, we used 80% of the data set for
training, and the 20% for test. In that case, the level of correctly classified instances rises
above 83% (Tables 2, 3 and 4).

Table 2. Arousal Classifier: Random Forest with 80% training and 20% test, and 2000 iterations.

Table 3. Valence Classifier: Random Forest with 80% training and 20% test, and 2000 iterations.
580 M. Díaz-Boladeras et al.

Table 4. Distress Classifier: Random Forest with 80% training and 20% test, and 2000 iterations.

4 Conclusions and Discussion


4.1 Conclusions
The present study explores the identification of distress/discomfort through EDA signals
recorded by an unobtrusive wrist wearable, with the focus on the design and methods to
gather quality data to build up a reliable database.
Although the number of participants (n = 6) and cases are not enough to draw
categorical conclusions, the analysis of data allows us to envisage a feasible technique
for the main objective of MESURAR project: to provide feasible distress signals from
the smart walker users to optimise the shared control strategies in natural settings.
We can conclude that users’ subjective experience can and must be added to the
shared control loop taking psychological states like distress into account. Available
sensor technologies and machine learning techniques can identify relevant psychological
states like distress to be automatically estimated.

4.2 Limitations
The Empatica E4 wristband sensor is very convenient in daily monitoring for its size,
lightness and simplicity, but Galvanic Skin Response (GSR) measurements from two
points at the wrist are not as accurate as the ones measured at the fingers in terms of
signal resolution. Also the calculated heart rate is a lot less exact compared to the one
from an ECG at Fs = 1 kHz (Ollander 2015).
The Empatica E4 wristband had a significant loss in terms of detected interbeat inter-
vals, but that time-domain features such as the mean heart rate and standard deviation of
the heart rate were still well estimated, with good stress discrimination power. Further-
more, the skin conductivity signals measured at different locations (wrist versus finger)
show no visual resemblance and it appears that the Empatica E4 wristband yielded higher
stress discrimination power than the signal measured at the fingers (Ollander et al. 2017).

5 Further Work
We are now working in the extension of the data base, the comparison and correlation
of data gathered simultaneously from other certified wearables like the EQUIVITAL
jacket, the enlargement of ML procedures and the design of real time strategies to feed
the loop of the smart walker system on the fly. Specifically, the gathered by the Empatica
E4. Additionally, the Empatica’s galvanic response measures will be compared with the
corresponding measured by the medical certified two electrodes sensor in the Equivital
Lifemonitor, equipment used as a goal standard in other studies (Yan Liu et al. 2013).
Validation of a Nonintrusive Wearable Device for Distress Estimation 581

Acknowledgements. This work was partially supported by the Spanish Ministry of Ciencia,
Innovación y Universidades under project RTI2018–096701-B-C22, and by the Catalonia FEDER
program, resolution GAH/815/2018 under the project, PECT Garraf : Envelliment actiu i saludable
i dependència.

References
Baig, M.Z., Kavakli, M.: A survey on psycho-physiological analysis & measurement methods in
multimodal systems. Multimodal Technol. Interact. 3(2), 37 (2019). https://doi.org/10.3390/
mti3020037
Campbell, J., Ehlert, U.: Acute psychosocial stress: does the emotional stress response correspond
with physiological responses? Psychoneuroendocrinology 37(8), 1111–1134 (2012). https://
doi.org/10.1016/j.psyneuen.2011.12.010
Can, Y.S., Arnrich, B., Ersoy, C.: Stress detection in daily life scenarios using smart phones and
wearable sensors: a survey. J. Biomed. Inf. 92(February), 103139. https://doi.org/10.1016/j.jbi.
2019.103139
Cortés, U., et al.: A SHARE-it service to elders’ mobility using the i-walker. Gerontechnology
7(2) (2008). https://doi.org/10.4017/gt.2008.07.02.032.00
Dautenhahn, K., Werry, I., Rae, J., Dickerson, P., Stribling, P., Ogden, B.: Robotic playmates:
analysing interactive competencies of children with autism playing with a mobile robot. In:
Dautenhahn, K., Bond, A., Canamero, L., Edmonds, B. (eds.), Socially Intelligent Agents-
Creating Relationships with Computers and Robots, pp. 117–124. Kluwer Academic Publishers.
Kluwer (2002)
Glover, J., Thrun, S., Matthews, J.T.: Learning user models of mobility-related activities through
instrumented walking aids. Proc. - IEEE Int. Conf. Robot. Autom. 2004(4), 3306–3312 (2004).
https://doi.org/10.1109/robot.2004.1308764
Jain, S., Member, S., Oswal, U., Xu, K.S., Eriksson, B., Haupt, J.: A compressed sensing based
decomposition of electrodermal activity signals. IEEE Trans. Biomed. Eng. 64(9), 2142–2151
(2017)
Kulyukin, V., Kutiyanawala, A., LoPresti, E., Matthews, J., Simpson, R.: IWalker: Toward a
rollator-mounted wayfinding system for the elderly. In: 2008 IEEE International Conference
on RFID (Frequency Identification), IEEE RFID 2008, pp. 303–311 (2008). https://doi.org/10.
1109/RFID.2008.4519363
Lang Bradley, M.M., Cuthbert, B.N., Lang, P.J.: International affective picture system (IAPS):
affective ratings of pictures and instruction manual (2008)
Larsen, R.J., Diener, E.: Promises and problems with the circumplex model of emotion. In: Clark,
M.S. (ed.) Emotion: Review of Personality and Social Psychology, pp. 25–59. Sage, Newbury
Park (1992)
Liu, Y., Zhu, S.H., Wang, G.H., Ye, F., Li, P.Z.: Validity and reliability of multiparameter phys-
iological measurements recorded by the Equivital LifeMonitor during activities of various
intensities. J. Occup. Environ. Hyg. 10(2), 78–85 (2013). https://doi.org/10.1080/15459624.
2012.747404. PMID: 23259751
Lutscher, D.: The relationship between skin conductance and self-reported stress (2016). https://
essay.utwente.nl/69969/1/Lutscher_BA_BMS.pdf
Möller, J., et al.: Emotional stress as a trigger of falls leading to hip or pelvic fracture. Results
from the ToFa study - A case-crossover study among elderly people in Stockholm, Sweden.
BMC Geriatrics, Vol. 9, p. 7 (2009). https://doi.org/10.1186/1471-2318-9-7
Ollander, S.: Wearable Sensor Data Fusion for Human Stress Estimation (Linköping) (2015).
http://www.diva-portal.org/smash/get/diva2:865706/FULLTEXT01.pdf
582 M. Díaz-Boladeras et al.

Ollander, S., Godin, C., Campagne, A., Charbonnier, S.: A comparison of wearable and stationary
sensors for stress detection. In: 2016 IEEE International Conference on Systems, Man, and
Cybernetics, SMC 2016 - Conference Proceedings, pp. 4362–4366 (2017). https://doi.org/10.
1109/SMC.2016.7844917
Perugia, G., Diaz-Boladeras, M., Catala-Mallofre, A., Barakova, E. I., Rauterberg, M.: ENGAGE-
DEM: a model of engagement of people with dementia. IEEE Trans. Affect. Comput. (2020).
https://doi.org/10.1109/TAFFC.2020.2980275
Perugia, G., Rodriguez-Martin, D., Diaz Boladeras, M., Mallofre, A.C., Barakova, E., Rauter-
berg, M.: Electrodermal activity: explorations in the psychophysiology of engagement with
social robots in dementia. In: 2017 26th IEEE International Symposium on Robot and Human
Interactive Communication (RO-MAN) (2017). https://doi.org/10.1109/ROMAN.2017.817
2464
Schmidt, P., Reiss, A., Dürichen, R., Van Laerhoven, K.: Labelling affective states “in the wild”:
Practical guidelines and lessons learned. UbiComp/ISWC 2018 - Adjunct Proceedings of the
2018 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Pro-
ceedings of the 2018 ACM International Symposium on Wearable Computers, pp. 654–659
(2018). https://doi.org/10.1145/3267305.3267551
Schmidt, P., Reiss, A., Dürichen, R., Laerhoven, K.V.: Wearable-based affect recognition—a
review. Sensors 19(19), 4079 (2019). https://doi.org/10.3390/s19194079
Deep Learning for Heart Sounds
Classification Using Scalograms
and Automatic Segmentation
of PCG Signals

John Gelpud1(B) , Silvia Castillo1 , Mario Jojoa2 , Begonya Garcia-Zapirain2 ,


Wilson Achicanoy1 , and David Rodrigo3
1
University of Nariño, Pasto, Colombia
{johnjairog,gabrielacast,wilachic}@udenar.edu.co
2
Deusto University, Bilbao, Spain
{mariojojoa,mbgarciazapi}@deusto.es
3
Basque Country University, Leioa, Spain
[email protected]

Abstract. This paper proposes a set of Deep Learning algorithms for


classifying Phonocardiogram (PCG) scalogram. PCG signals contain
valuable information about the heart health status, and they could help
us in early detection and diagnosis of potential abnormalities. The sys-
tem will classify into normal or abnormal categories, supported on reli-
able signal processing algorithms to automatically denoise and segment
the sounds to improve the Deep Learning detection task. At the first
stage, we denoised the PCG signal using a multi-resolution analysis
based on the Discrete Wavelet Transform (DWT). At the second one,
we segment automatically the sounds using an algorithm based on the
Teager Energy Operator (TEO) and the autocorrelation. This is very
important, because it is needed to select the S1 component related to
the systole, and S2 component related to the diastole. Finally, scalo-
gram images are obtained using Continuous Wavelet Transform (CWT).
The classification task has been executed using the heart sounds from
the 2016 PhysioNet/CinC Challenge database, and pretrained Convo-
lutional Neural Networks (CNNs) ResNet152 and VGG16, achieving an
accuracy of 91.19% and 90.75%, respectively. The results of our proposed
model presents a good contribution to heart sounds classification area,
in comparison with the state of the art accuracy which is 87%.

Keywords: Phonocardiogram · Automatic segmentation · Deep


learning · Scalograms · Classification · Heart sounds

1 Introduction
The heart is the most important organ of the cardiovascular system, which has
the task to pumping blood throughout the body, by contracting and relaxing
c Springer Nature Switzerland AG 2021
I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 583–596, 2021.
https://doi.org/10.1007/978-3-030-85030-2_48
584 J. Gelpud et al.

the cardiac muscles. It has four chambers, two atria and two ventricles (left and
right), and four valves, namely tricuspid, pulmonary, mitral, and aortic. The
right atrium receives deoxygenated blood from the cava veins, to be sent to the
right ventricle through the tricuspid valve. Then, the right ventricle pumps the
blood through the pulmonary valve to the pulmonary arteries, which transport
the blood to the lungs, to be oxygenated and sent to the left atrium. The oxy-
genated blood passes from the left atrium to the left ventricle through the mitral
valve, and finally is sent to the aorta through the aortic valve to be transported
to the rest of the body [10].
The cardiac cycle comprises a series of periodic mechanical and electrical
events that occur during cardiac activity. Each cycle comprises the alternating
contraction (systole) and relaxation (diastole) of the heart chambers, also known
as the heartbeat, and is generally composed of two fundamental heart sounds
(FHS): S1 (representing the closure of the mitral and tricuspid valves when
ventricular pressures exceed atrial pressures at the beginning of systole), and
S2 (representing the closure of the aortic and pulmonary valves). Besides of S1
and S2, two additional sounds are presented in the heart cycle, they are S3 and
S4 and they could appear in normal and pathological conditions [5]. When the
heart valves do not open and close properly, the FHS produced by the cardiac
cycle are different, indicating a cardiac abnormality [10].
According to the World Health Organization, cardiovascular diseases have
remained the leading cause of death worldwide for the last 20 years [17]. The
vast majority of these deaths occur in low- and middle-income countries, where
most of the population does not have access to primary health care programs,
making difficult both early detection and adequate treatment, leading to late
diagnosis and death, often in their most productive years [16]. On the other hand,
at the macroeconomic level, cardiovascular diseases place a heavy burden on the
economy of low- and middle-income countries, with a projected burden for the
period 2011–2025 of $ 3.76 trillion [15]. At the household level, sufficient evidence
suggests that cardiovascular and other noncommunicable diseases contribute to
increased poverty due to huge health expenditures [16].
On the other side, the auscultation process consists of listening and interpret-
ing the heart sounds using a stethoscope, being the most widely used technique
for detecting cardiac abnormalities, since it is noninvasive, painless, and simple.
In addition, it is possible to obtain a graphic representation of the sounds. This
graph is named as phonocardiogram (PCG) and can be analyzed to assess a
patient’s cardiac condition. However, diagnosing a cardiac abnormality could be
challenging because it requires a trained health care professional, with advanced
skills to correctly interpret heart sounds and images. For this reason, the use
of artificial intelligence algorithms seeks to overcome this challenge and support
medical professional in their diagnosis.
To train artificial intelligence algorithms with the capacity to detect anoma-
lies from a PCG signal, it is very important to select those features in the signal
that provide sufficient information about the cardiac activity for the proper anal-
ysis. In this way, Deep Learning offers very useful and powerful algorithms to
Deep Learning for Heart Sounds Classification 585

carry out this task and the procedure of feature extraction by themselves, as a
result of their own structure, since they have a feature extraction phase inspired
by the human eye. These algorithms are known as convolutional neural networks
(CNNs).
There are several types of CNNs that have been developed over the last
years, used in numerous fields and for different applications [8]. CNNs need a
large number of labeled images to be trained and also require high computational
resources, depending on their architecture and depth.
The main objectives of this work are the segmentation of PCG signals and the
comparison of the performance of two pre-trained CNNs, Resnet152 and VGG16,
in the classification task, using the transfer learning technique. To achieve the
objectives, the PCG signals are filtered and segmented automatically using the
discrete wavelet transform (DWT), the Teager energy operator (TEO), and auto-
correlation. Finally, the segmented signals are converted to scalogram images
using the continuous wavelet transform (CWT).

2 Literature Review
The detection of cardiac abnormalities using PCG signals has been an active
area of research in recent decades. Researchers from all over the world have been
actively involved in this area and have proposed different methods, based on
Machine Learning techniques and signal processing, that have enabled clinicians
to speed up and improve the auscultation process in recent years.
Currently PCG signal segmentation and classification are the most widely
used methods to study heart sound signals. The most common methods for seg-
mentation mainly include methods based on electrocardiographic (ECG) signals,
signal envelope, and Machine Learning. On the other hand, Deep Learning has
transformed the field of Machine Learning, and has been used in image catego-
rization, object recognition, and speech recognition. Moreover, Deep Learning
plays an important role in the biomedical field for automated disease analy-
sis, and can replace the traditional feature engineering based approach [1]. A
commonly used Deep Learning algorithm in images is the CNN, not only for
classification but also for segmentation tasks.
Yi He et al. [6] study a method to analyze heart sounds through the use of
CNNs. First, the original sounds are preprocessed, and then different methods
are used for feature extraction, including Hilbert envelope, homomorphic envi-
ronment map, Wavelet envelope, and power spectral density envelope. Based
on the extracted features, the signal is segmented into 4 periods (S1, systolic,
S2, and diastolic) making use of CNN U-net, which is a commonly used algo-
rithm for image segmentation. Finally, classification is performed using a CNN,
achieving an accuracy of 96.4%, sensitivity of 78.1%, and specificity of 87.3%.
In [1], the performance of a CNN known as AlexNet is studied, making use of the
Transfer Learning technique. First, the PCG signals are filtered and segmented
to ensure that they are properly aligned before the classification task. A high-
pass filter and the Springer segmentation algorithm [20] were used for these two
586 J. Gelpud et al.

tasks. Finally, the scalogram images of each segmented signal are obtained using
CWT. In this step the one-dimensional PCG signals are transformed into two-
dimensional time-frequency representations. An accuracy of 87% was obtained
using AlexNet as a feature extractor and a support vector machine (SVM) as a
classifier, while using AlexNet as a feature extractor and classifier the accuracy
was 85%.
Shannon energy envelope is one of the most common methods in the identifi-
cation of heart sounds S1, S2, S3, and S4. For example, in [5] different signal pro-
cessing techniques and Deep Learning methods are combined to filter, compress,
segment, and classify PCG signals. The PCG signal is filtered and compressed
using DWT-based multi-resolution analysis. Then, a segmentation algorithm
based on Shannon energy envelope and zero crossing is applied to the signals.
Finally, Mel-scale power spectrogram and cepstral coefficients (MFCC) are used
to extract features from the segmented signals. The extracted features are used
as input to a deep neural network (DNN) to classify each PCG signal as normal
or abnormal, achieving an accuracy of 97.10%. In [13]. The empirical Wavelet
transform and the normalized Shannon energy are used for preprocessing and
automatic segmentation of heart sounds, an for each recording the systolic and
diastolic intervals are identified. Then, six power features are extracted (three
for the systole and three for the diastole). Finally, different Machine Learning
models (SVM, K-nearest neighbor (KNN), random forest, and multilayer per-
ceptron) are used for classification. For automatic classification, 805 recordings
from different databases were used. The best accuracy achieved was 99.26% with
the KNN classifier, with a specificity of 100% and sensitivity of 98.57%.
Khan et al. [9] study automatic heart sound classification using segmented
and unsegmented phonocardiogram signals. In addition, they analyze different
features in the time and frequency domain in order to determine the most effec-
tive subset of features to improve classification performance. Sound segmentation
is carried out using Springer’s method. Different classification algorithms includ-
ing SVMs, KNNs, decision trees, Artificial Neural Networks (ANNs), and long
short-term memory networks (LSTM) were used to evaluate the performance of
the feature subsets. The results show that features extracted from unsegmented
sounds do not show consistent performance with some classification algorithms.
Furthermore, when using features from segmented sounds the performance of
all classifiers was higher than the performance obtained using features from
unsegmented sounds. These results highlight the importance of using time and
frequency domain features of segmented signals.

3 Methodology
The flow chart of the proposed method to analyse and classify the PCG diagrams
is shown in Fig. 1. Each step will be detailed in the following subsections.
Deep Learning for Heart Sounds Classification 587

Fig. 1. Diagram of the proposed method

3.1 Databases
The heart sounds used in this work, for the stages of validation of the segmenta-
tion and classification algorithms, were obtained from the Pascal Challenge [4]
and 2016 Physionet/Cinc Challenge [11] databases, respectively.
Physionet is currently the largest heart sound dataset in the world and is
divided into two sets, a training set and a test set. The training set contains
a total of 3,153 heart sound recordings from 764 patients. All recordings have
a sampling rate of 2 kHz and their duration is between 5 s and 120 s. Normal
sounds are labeled −1, while abnormal sounds are labeled 1.
The Pascal database is composed of two datasets collected using the iSthetho-
scope Pro iPhone application (Dataset A) and the DigiScope digital stethoscope
from a clinical trial (Dataset B). Unlike Physionet, this database provides man-
ual segmentation of 113 recordings.

3.2 Signal Pre-processing

The sounds obtained from the databases are resampled at a frequency of 2 kHz
and normalized in amplitude, by dividing them by their absolute maximum
value. In general, when heart sounds are recorded, they are exposed to different
types of noise and unwanted information, that affect the subsequent stages and
therefore hinder the diagnosis. For this reason, before performing the segmenta-
tion process, it is necessary to filter the signals to attenuate the noise.

Discrete Wavelet Transform (DWT): With the DWT it is possible to ana-


lyze signals with different resolutions at different frequencies. At high frequencies,
good time resolution and poor frequency resolution can be achieved. Similarly,
at low frequencies, good frequency resolution and poor time resolution can be
achieved [5].
In this work, the DWT-based adaptive thresholding method for denoising
described in [7] was used. DWT decomposes the signal into approximation coef-
ficients (A) and detail coefficients (D). This decomposition is done at five levels,
using the Coif-5 wavelet as the mother wavelet, in order to apply adaptive thresh-
olding by a nonlinear mean thresholding function to the detail coefficients (D4
588 J. Gelpud et al.

and D5), whose frequencies comprise most of the FHS frequency range (25–120
Hz). The remaining approximation and detail coefficients are made zero, and
finally the signal is reconstructed. An example of a signal filtered by the previ-
ously method is shown in Fig. 2.

(a)
1

0.5
Amplitude

-0.5

-1
0 5000 10000 15000
Samples
(b)
1

0.5
Amplitude

-0.5

-1
0 5000 10000 15000
Samples

Fig. 2. (a) Original PCG signal. (b) Denoised PCG signal

3.3 Segmentation
In this work, a segmentation algorithm based on the signal envelope and the
autocorrelation function, is proposed. The process is shown in the flow chart in
Fig. 3.

Fig. 3. Diagram of the segmentation algorithm

Teager Energy Operator (TEO): The TEO calculates the signal’s energy
from its amplitude and instantaneous frequency, which, in turn, improves the
signal-to-noise ratio and facilitates the detection of the beginning of each
FHS [18]. The TEO is defined for the discrete case as [19]

T EO = x2 [n] − x[n − 1]x[n + 1] , (1)


Deep Learning for Heart Sounds Classification 589

where n ∈ Z and x[n] is the filtered PCG signal.


In this work TEO was used to calculate the filtered signal envelope. To
smooth the envelope, a moving average and the square root were calculated
to reduce the difference between the large amplitude and small amplitude com-
ponents. The signal was then standardized by subtracting the average value and
dividing by the standard deviation, and all negative values were converted to
zero to determine where each component begins and ends. The energy of each
component is calculated, and those whose energy is less than the average of all
energy values are eliminated (see Fig. 4).

(a)
0.06
Amplitude

0.04

0.02

0
0 5000 10000 15000
Samples
(b)

2
Amplitude

1.5

0.5

0
0 5000 10000 15000
Samples

Fig. 4. (a) Teager envelope. (b) Envelope after applying the energy threshold

Finally, the resulting envelope signal is divided into windows, whose duration
is at least two seconds as shown in Fig. 5, to ensure that there are at least two
cardiac cycles in each window [19].

Autocorrelation Function: The partial autocorrelation function for a signal


x is defined as [12]
N −n−1
1 
r[n] = x[m]x[m + n] , (2)
N m=0
where n = 0, 1, . . . , N − 1, N , and N is the total signal number of samples.
It allows taking advantage of the quasi-periodic nature of the cardiac cycle to
590 J. Gelpud et al.

Amplitude 1

0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Samples
2
Amplitude

0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Samples

2
Amplitude

0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Samples

Fig. 5. Envelope windows

estimate the periodicity of the signal, and in particular, the duration of the
cardiac cycle and the minimum distance between the S1 and S2 components.
The autocorrelation function r[n] was applied to each window in order to
estimate the duration of the cardiac cycle and the minimum distance between
the FHS (see Fig. 6). Once these values were obtained, those components with
separation less than the minimum distance between S1 and S2 were identified,
and the component with the lowest energy was eliminated, since it was considered
as a false component that passed the energy threshold.
Finally, a square wave signal of amplitude 1 with a period equal to the calcu-
lated cardiac cycle is created. The synthetic signal is temporally aligned with he
envelope of the PCG signal based on the overall maximum of the cross-correlation
between these two signals, as shown in Fig. 7. This process is repeated again dis-
carding the positions of the already identified component to determine the onsets
of the remaining FHS.
Once the locations of the FHS of the signal have been determined, the S1
and S2 sounds are identified, taking into account the minimum distance between
S1 and S2 previously estimated, and knowing that the interval between S1 and
S2 is smaller than the interval between S2 and S1 [13]. After identifying the
components, we proceed to segment the signal, obtaining a cardiac cycle in each
segmentation.

Continuous Wavelet Transform (CWT): The CWT is an excellent tool for


mapping the changing characteristics of non-stationary signals, and it is used to
decompose a signal into wavelets. Wavelets are small oscillations that are highly
localized in time. Unlike the Fourier Transform, the CWT basis functions are
scaled and shifted versions of the time-localized mother Wavelet [21].
Deep Learning for Heart Sounds Classification 591

Amplitude
0.5

0
0 1000 2000 3000 4000 5000 6000
Samples
1
Amplitude

0.5

0
0 1000 2000 3000 4000 5000 6000
Samples
1
Amplitude

0.5

0
0 1000 2000 3000 4000 5000 6000
Samples

Fig. 6. Cardiac cycle (number of samples between blue points) and minimum distance
from S1 and S2 (minimum distance in samples between red point and one of the blue
points) estimation.

(a)
2
Amplitude

0
0 5000 10000 15000
Samples
(b)
1
Amplitude

0.5

0
0 5000 10000 15000
Samples
(c)
1
Amplitude

0.5

0
0 5000 10000 15000
Samples

Fig. 7. (a) PCG signal envelope. (b) Original square wave signal (blue) and aligned
square wave signal (orange). (c) Cross-correlation between the envelope and square
wave signal.

To obtain a time-frequency representation (image) of each segmented signal,


CWT was used, which offers very good feature localization in time and frequency.
Figure 8 shows the segmentation of a cardiac cycle with its respective scalogram.
592 J. Gelpud et al.

Fig. 8. (a) Segmented signal. (b) Segmented signal scalogram

3.4 Classification

The Transfer Learning technique was applied for the classification of the scalo-
grams, which consists of taking a pre-trained convolutional neural network model
as the basis for classification, with the advantage of being able to achieve good
accuracy using few training images compared to the number of images needed
to train a model from scratch [2]. The CNNs models used in this work were the
ResNet152 and the VGG16.
The VGG16 CNN is composed of 13 convolutional and 3 fully connected
layers with ReLU activation function, and uses 2 × 2 and 3 × 3 filters. CNNs
can get deeper and deeper with the addition of layers, however, the accuracy
may decrease as the number of layers increases. This is why, ResNet CNNs are
based on residues or skip connections, use batch normalization, and are mainly
composed of convolutional and identity blocks. ResNet 152 is up to 9 times
deeper than VGG16 with 152 layers [3].
In this step, the VGG16 and ResNet152 models were created using their pre-
trained weights, and modifying the output layer to obtain a binary classification.
Then, the scalogram images were passed to the CNNs to adjust the network
weights.

4 Discussion

A total of 2,302 scalograms were created and 58.69% of the images correspond
to normal sounds. 60% of the data was used for training, 20% for validation,
and 20% for testing, ensuring to keep the same proportion of classes in each
subset. For training, 25 epochs were defined, including an optimizer based on
the stochastic gradient descent, with a learning rate of 0.001 that is reduced by
a factor of 0.1 every 7 epochs, and with momentum of 0.9. The mini-batch size
used in each iteration was 4 images.
Deep Learning for Heart Sounds Classification 593

The validation subset was used to avoid over-fitting. At the end of training,
the model that obtained the best accuracy in the validation subset is returned.
The best model was applied to the test data in order to obtain the performance
measures, including accuracy, sensitivity, specificity, and precision [14].
The experiments were carried out on a computer with a 6-core AMD Ryzen
5-3600 processor, frequency of 3.59 GHz, 16 GB of RAM memory, and a GTX
1660 SUPER graphics card. The MATLAB R2020b software was used for the
segmentation stage, and the Python programming language and the PyTorch
machine learning library were used for the classification stage.
The Table 1 summarizes the results obtained after applying the fine-tuned
ResNet152 and VGG16 networks to the test images.

Table 1. Test images results.

Measures ResNet152 VGG16


Accuracy 91.19% 90.75%
Sensitivity 86.3% 81.9%
Specificity 94.5% 96.7%
Precision 91.3% 94.3%
Training Time 16 min 8 min

The results presented in Table 1, show a higher accuracy and sensitivity for
the ResNet152. Therefore, it is able to correctly classify a higher number of
images compared to the VGG16 network. Moreover, it is able to correctly classify
abnormal sounds more reliably than VGG16. However, VGG16 achieved higher
specificity and precision percentages, so its ability to detect normal sounds is
higher compared to ResNet152. Moreover, the number of normal sounds diag-
nosed as abnormal (False Positives) is lower than for ResNet152. Additionally,
VGG16 requires less training time than ResNet152 because it has fewer layers.

Table 2. Comparison of results.

Literature source Accuracy Sensitivity Specificity


[6] 96.4% 78.1% 87.3%
[1] 87.65% 83.71% 89.99%
ResNet152 (Our method) 91.19% 86.3% 94.5%
VGG16 (Our method) 90.75% 81.9% 96.7%

According to Table 2, where other methods based on CNNs are compared


with the method proposed in this work, it is evident that the adjustment of
594 J. Gelpud et al.

the pre-trained CNNs yields results comparable with those obtained in [6], and
allows to overcome the results obtained in [1].
The methodology proposed in [1] and [6] is very similar to the one pro-
posed in this work and they are also based on Deep Learning algorithms. Unlike
our method, the preprocessing applied to the signals in [1] and [6] is based on
classical low-pass and high-pass filters. On the other hand, in [1] the Springer
segmentation algorithm is used, while in [6] a segmentation algorithm based on
the deep CNN U-net is proposed. As in this work, in [1] a pre-trained CNN is
used for the classification task, however, the AlexNet is combined with the SVM
algorithm to obtain the results shown in Table 2. For classification, in [6] a CNN
was implemented without relying on another pretrained architecture. The hyper
parameters of the CNN were modified to evaluate its performance, this hyper
parameter adjustment may be the reason for achieving higher accuracy.

5 Conclusion

Because PCG signals are highly exposed to noise, the preprocessing stage is nec-
essary for proper segmentation and classification. So, the DWT-based adaptive
thresholding method, for denoising the signals, provide efficient results, high-
lighting the FHS, and attenuating the noise in different frequency ranges.
The results of this work demonstrate that CNNs have a good diagnostic
capability, with the advantage of automatic features extraction. This information
is used to classify the signals as normal or abnormal, avoiding the difficult task
of choosing features from the signals manually.
The Deep Learning approach applied in this work showed outstanding per-
formance and appears to be a promising tool in the area of PCG signal analysis.
More research is needed on the tuning of the hyper parameters of neural net-
works and their influence on the classification stage. A good combination of Deep
Learning methods with traditional Machine Learning algorithms, such as sup-
port vector machines (SVMs) and decision trees, could improve the performance
of the proposed approach.
One limitation of this work is the fact we have used an open dataset to train
the model. Then, in a real hospital environment, the signals could be disturbed
for noise sources not considered in this study. So, it could be necessary to imple-
ment the proposed model with data collected in real hospital environments. On
the other hand, this situation could affect, not only the performance of the clas-
sification Deep learning model, but also the segmentation Algorithm too.
Another second order limitation of our study is the processing of scalograms,
which belong to the Complex number domain, but we have used real Valued
Deep learning models, losing information that could help us to improve the
performance of the classification Algorithm. So, we consider necessary to use
complex valued deep learning models, to take advantage of this information
implicit in the input signals and to carry out the comparison with the results
obtained in this work for the classification task.
Deep Learning for Heart Sounds Classification 595

Acknowledgement. This work was partially funded by the University of Nariño and
eVIDA group IT905-16 of the University of Deusto.

References
1. Alaskar, H., Alzhrani, N., Hussain, A., Almarshed, F.: The implementation of pre-
trained AlexNet on PCG classification. In: Huang, D.-S., Huang, Z.-K., Hussain, A.
(eds.) ICIC 2019. LNCS (LNAI), vol. 11645, pp. 784–794. Springer, Cham (2019).
https://doi.org/10.1007/978-3-030-26766-7 71
2. Géron, A.: Hands-on machine learning with Scikit-Learn, Keras and TensorFlow:
concepts, tools, and techniques to build intelligent systems (2019). https://www.
oreilly.com/library/view/hands-on-machine-learning/9781492032632/
3. Benali Amjoud, A., Amrouch, M.: Convolutional neural networks backbones for
object detection. In: El Moataz, A., Mammass, D., Mansouri, A., Nouboud, F.
(eds.) ICISP 2020. LNCS, vol. 12119, pp. 282–289. Springer, Cham (2020). https://
doi.org/10.1007/978-3-030-51935-3 30
4. Bentley, P., Nordehn, G., Coimbra, M., Mannor, S.: The PASCAL Classify-
ing Heart Sounds Challenge 2011 (CHSC2011) Results (2011). http://www.
peterjbentley.com/heartchallenge/index.html
5. Chowdhury, T.H., Poudel, K.N., Hu, Y.: Time-frequency analysis, denoising, com-
pression, segmentation, and classification of PCG signals. IEEE Access 8, 160882–
160890 (2020). https://doi.org/10.1109/ACCESS.2020.3020806
6. He, Y., Li, W., Zhang, W., Zhang, S., Pi, X., Liu, H.: Research on segmentation
and classification of heart sound signals based on deep learning. Appl. Sci. 11(2),
1–15 (2021). https://doi.org/10.3390/app11020651
7. Jain, P.K., Tiwari, A.K.: An adaptive thresholding method for the wavelet based
denoising of phonocardiogram signal. Biomed. Signal Process. Control 38, 388–399
(2017). https://doi.org/10.1016/j.bspc.2017.07.002
8. Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architec-
tures of deep convolutional neural networks. Artif. Intell. Rev. 53(8), 5455–5516
(2020). https://doi.org/10.1007/s10462-020-09825-6
9. Khan, F.A., Abid, A., Khan, M.S.: Automatic heart sound classification from seg-
mented/unsegmented phonocardiogram signals using time and frequency features.
Physiol. Meas. 41(5), 55006 (2020). https://doi.org/10.1088/1361-6579/ab8770
10. Khan, N.M., Khan, M.S., Khan, G.M.: Automated Heart Sound Classification from
Unsegmented Phonocardiogram Signals Using Time Frequency Features (2018).
https://doi.org/10.5281/ZENODO.1340418. https://zenodo.org/record/1340418
11. Liu, C., et al.: An open access database for the evaluation of heart sound algo-
rithms. Physiol. Meas. 37(12), 2181–2213 (2016). https://doi.org/10.1088/0967-
3334/37/12/2181
12. Liu, Q., Wu, X., Ma, X.: An automatic segmentation method for heart sounds.
BioMedical Eng. OnLine 17(1), 106 (2018). https://doi.org/10.1186/s12938-018-
0538-9
13. Narváez, P., Gutierrez, S., Percybrooks, W.S.: Automatic segmentation and clas-
sification of heart sounds using modified empirical wavelet transform and power
features. Appl. Sci. 10(14) (2020). https://doi.org/10.3390/app10144791
14. Nathalie Japkowicz, M.S.: Evaluating Learning Algorithms: A Classification Per-
spective (2011)
15. World Health Organization: A global brief on hypertension (2013). https://www.
who.int/cardiovascular diseases/publications/global brief hypertension/en/
596 J. Gelpud et al.

16. World Health Organization: Cardiovascular diseases (CVDs) (2017). https://www.


who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
17. World Health Organization: WHO reveals leading causes of death and disability
worldwide: 2000–2019 (2020). https://www.who.int/news/item/09-12-2020-who-
reveals-leading-causes-of-death-and-disability-worldwide-2000-2019
18. Ramović, A., Bandić, L., Kevrić, J., Germović, E., Subasi, A.: Wavelet and teager
energy operator (TEO) for heart sound processing and identification. In: Badnjevic,
A. (ed.) CMBEBIH 2017, pp. 495–502. Springer, Singapore (2017). https://doi.
org/10.1007/978-981-10-4166-2 76
19. Sabir, M.K.: PCG signal analysis using Teager energy operator & autocorrelation
function. In: International Conference on Computer Medical Applications, ICCMA
2013 (2013). https://doi.org/10.1109/ICCMA.2013.6506176
20. Springer, D., Tarassenko, L., Clifford, G.: Logistic regression-HSMM-based heart
sound segmentation. IEEE Trans. Biomed. Eng. 1 (2015). https://doi.org/10.1109/
TBME.2015.2475278. http://ieeexplore.ieee.org/document/7234876/
21. WEISANG: Continuous Wavelet Transform (CWT) (2019). https://www.weisang.
com/en/documentation/timefreqspectrumalgorithmscwt en/
Skin Disease Classification Using Machine
Learning Techniques

Mohammad Ashraful Haque Abir, Golam Kibria Anik, Shazid Hasan Riam,
Mohammed Ariful Karim, Azizul Hakim Tareq, and Rashedur M. Rahman(B)

Department of Electrical and Computer Engineering, North South University,


Dhaka, Bangladesh
{ashraful.haque,golam.anik,shazid.riam,mohammad.karim01,
azizul.tareq,rashedur.rahman}@northsouth.edu

Abstract. According to the Global Burden of Disease project, skin diseases are
the fourth leading cause of benign sickness throughout the world. Diagnosis of
dermatological diseases presents a challenge alongside the absence of trained der-
matologists and access to formal medical care. This presents a critical challenge,
especially in countries with a large rural population and minimal development. The
aim of this paper is to study machine learning based classifiers for predicting skin
infections for three classes from a clinical dataset. Convolutional neural network
(CNN) has been proved to perform well in image classification. The performance
of the neural network is compared with a benchmark multiclass SVM classifier.
The result analysis and possible future works are also discussed in this paper.

Keywords: CNN · Image classification · Neural network · SVM

1 Introduction

Annually, the Global Burden of Disease project [1] reports that skin infections keep on
being the 4th driving reason for nonfatal sickness around the world. Notwithstanding,
research endeavors and subsidizing do not coordinate with the general incapacity of
skin illnesses. This, in addition to the dearth of skilled dermatologists, has created an
epidemic of skin infections [2]. This is especially prevalent in economically developing
countries such as Bangladesh, where the number of patients receiving medical care for
dermatological infections is low [2]. Moreover, a large portion of the population lives
in rural settings where the presence of medical care is largely minimal. Further due to
low literacy rates, the patients are reluctant to seek medical care until constrained by the
severity of the infection [3].
A common skin infection among the rural population is Chickenpox. Chickenpox is
a highly transmissible disease caused by the primary infection with the varicella-zoster
virus [8]. Signs of the disease are skin breakouts that form small, irrelative blisters, which
progress to form scabs. Each blister originates as small red breakouts on the skin, which
during a period of time, grow into small blisters [8]. Though vaccine campaigns helped

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 597–608, 2021.
https://doi.org/10.1007/978-3-030-85030-2_49
598 M. A. H. Abir et al.

reducing the infection rates, there are large portions of the population left unvaccinated
due to poor reach, which results in outbreaks [4]. Though Chickenpox is rarely fata l,
there is social stigmatization of the patient due to the visible outbreak of lesions on the
skin [4].
Most of the rural population are employed in agricultural work where they are largely
exposed to water sources which are contaminated with Arsenic [5], a human carcinogen.
Moreover, a large portion of the population relies on groundwater as a source of drinking
water [5]. An early indicator of arsenic poisoning is a skin condition known as Arsenical
Keratosis. It is a precancerous condition caused due to intensive exposure to arsenic.
The patients are often identified as having yellowish legions that primarily affect the
palms and soles. The legions are graded into three grades based on thickness - Mild,
Moderate, and Severe [6]. If left untreated, this can progress to Bowen’s disease, which
is an early type of skin cancer. The legions of Bowen’s disease are usually pigmented
red and occur in the neck and torso region. Left untreated, this develops into Squamous
cell carcinoma, an invasive form of skin cancer [7].
In this paper, the aim is to study a different kind of classifiers along with convolutional
neural network (CNN) for skin infections and integrate into a mobile application among
the marginalized and rural areas.
The paper is arranged as follows. Section 2 presents previous works related to
the topic. Section 3 introduces the methodology, and Sect. 4 describes the results and
discussion. Section 5 contains the conclusion and potential future works.

2 Related Works

In a study [11], researchers tried to predict skin disease using a number of machine
intelligence techniques. In this study, a comparative analysis of 5 different algorithms
- logistic regression, naive Bayes, random forest, kernel SVM, and CNN were tested
on three types of skin diseases. They were - lichen planus, sjs ten, and acne. Then
classification-based detection was performed. In this study, a dataset of 3000 photos
(samples) were used. The dataset was preprocessed for classification. The dataset was
divided into 20% as testing and 80% as the training set. The training accuracy of all the
mentioned algorithms was compared and analyzed. Then the skin disease was detected
using the five machine learning algorithms mentioned previously.
In this paper [12], SVM and GLCM were proposed to identify three skin diseases:
Herpes, Paederus Dermatitis, and Psoriasis. The full process was divided into three steps.
In the first step, the sample images were preprocessed. In the second step, the geometric
transformation was used to segment the vertical image. These two steps helped in finding
features of different skin diseases as well the features that were correlated. Pixels of lesion
areas were found by image segmentation. SVM was finally used to identify symptoms.
Another study [13] was done previously that aimed to detect skin disease by using a
dataset collected from the department of a skin hospital in Mumbai, India. After rounds
of dialogues with concerned doctors. 10 different diseases were used in this study,
and these are - Atopic Dermatitis, Folliculitis, Leprosy, Lichen Planus, Warts, Herpes
Zoster, PediculosisCaptis, PityriasisVersicolor, SeborheicDermatitis, and Vitiligo. There
were three different types of variables - categorical, Boolean, and range. Categorical
Skin Disease Classification Using Machine Learning Techniques 599

and Boolean attributes were converted to numerical integers before feeding into the
learning pipeline. The dataset was divided into two parts: validation and training dataset.
The authors applied different classification algorithms to the training dataset. In this
study, Artificial Neural Networks (ANN), Decision Trees, K Nearest Neighbors (KNN),
Support Vector Machine (SVM), Random Forest techniques were used to detect different
types of skin diseases.
The system [14] was used for detecting skin disease using color images. This system
was examined on six different skin diseases and the average accuracy for the first and
second stage was 95.99% and 94.016%, respectively. Due to the variation of image
sizes in the database, selected images were resized. The authors used feature extraction
from a convolutional neural network that was pre-trained. After extracting, they used
classification to classify the image via SVM by using extracted features from the training
set [15].

3 Methodology

This section presents the dataset, implementation of our design using CNN, SVM in
detail.

3.1 Dataset

We collected 64 images of Arsenic infected body parts from various sources from
research papers and websites [19–22]. We had to find arsenic infection body parts from
different sources. We took some images from IEEE data port [18] for our other 2 types
of disease, such as Chickenpox and Bowens. We took 124 Bowens infected body part
images from the IEEE data port dataset and 227 Chickenpox infected images from
Google and IEEE data port. The dataset was submitted by Qingguo Wang and licensed
by creative common attribution. The dataset has 11 types of skin diseases. These datasets
are very useful for image classification CNN architecture. Some of the data is shown in
Fig. 1.
We collected some arsenic infection disease body part data from various websites and
papers. Few data was collected from “Comparison of health effects between individuals
with and without skin lesions in the population exposed to arsenic through drinking
water in West Bengal, India”. These images were uploaded by Mayukh Banarjee [19].
Few data was collected from SOS-arsenic.net. The article name Arsenic Exposure –
Carcinogen [20]. Some data was collected from “The Arsenic Contamination of Drinking
and Ground waters in Bangladesh: Featuring Biogeochemical Aspects and Implications
on Public Health” by Michael Raessler from springer link [21]. Some images were
taken from “Fate of over 480 million inhabitants living in arsenic and fluoride endemic
Indian districts: Magnitude, health, socio-economic effects and mitigation approaches”
article. The images were uploaded by B. Das [22]. We also collected some images from
VisualDx website where we found some arsenic infected diseases images from VisualDx
website [22].
We basically used these 3 types of skin diseases because these three skin diseases are
common in South Asian countries. In other research papers on skin disease classification,
600 M. A. H. Abir et al.

Fig. 1. Sample images from the dataset

we did not find any work on arsenic infection disease. In South Asia many people
lacking awareness use arsenic contaminated water and became infected with various
skin diseases. We used these three skin diseases such as Arsenic infection, Bowens, and
Chickenpox so that we can find the difference between them. Sometimes doctors could
not certain about a disease the patient is suffering from. Bowens is a condition of early
stage of skin cancer, so it is important to identify it as soon as possible so the patient can
get the proper treatment in the early stage. By doing research we tried to solve this issue.
Many skin diseases look almost same, and the algorithms have difficulty to classify them
correctly due to similar patterns of skin rashes and the presence of wide variety of skin
colors among people of this region.

3.2 Project Design

For different disease classification, we used Convolutional Neural Network (Con-


vNet/CNN).
CNN is a particular type of Neural Network, and a Deep Learning algorithm takes
images as its input, measures the importance of different kinds of objects in the image,
and then differentiates one image from the other. CNN is beneficial in the field of pat-
tern, sign, face, and different object recognition. CNN gets popular as it requires lower
preprocessing compared to other classification algorithms. A standard CNN architecture
is divided into some parts. Convolutional, pooling, and fully connected layers. Disease
classification accuracy can be improved by using different features of the CNN architec-
ture. Batch normalization, dropout, shortcut connections are used in CNN architecture
to increase the classification accuracy.
Figure 2 describes our works with CNN model training. It is a basic architecture. We
give the images to its input layer, then processing the data is done into different layers
of a CNN model. The algorithm classify different diseases of images and then show
the output. As we used 3 different diseases here such as Arsenic infection, Bowens and
Chickenpox, the model will classify between these diseases.
Skin Disease Classification Using Machine Learning Techniques 601

Image Dataset

Image preprocessing

Neural Classifier

Training and Classification

Accuracy Assessment
Fig. 2. Flowchart of the CNN classification

The first step of our methodology is to preprocess our data. There are some noisy
images; some are not clear; also there are some blurred images. Those problems are
removed by preprocessing the images. We highlight different affected areas, and our
CNN architecture is trained to detect various diseases. We have followed some steps,
and they are shown in Fig. 2.

3.3 Implementation Using CNN

3.3.1 Image Pre-processing


Pre-processing is the step where we have preprocessed our collected images. The images’
quality is enhanced; noise is removed from the images. We have used median filters to
remove noise, and this makes the images more ready for classification. We have also
improved the contrast to get a perfect result. This is the initial stage in the classification of
images in CNN architecture. We input images data of skin disease or rashes of different
body parts. As we have worked with skin diseases images, we have to identify the
difference between 3 diseases. Feature extraction is basically the transformation of our
input image. The CNN model has many filters and layers. We have reshaped each image
into 150 × 150 size. We also discuss this part in details in the 3.3.4 training model part.

3.3.2 Classification
We have used the CNN architecture to classify the images into our disease classes. We
have used TensorFlow and Keras library for image classification, which made our work
easier. TensorFlow provides both high-level and low-level APIs.
On the other hand, Keras provides only high-level API. Both the Tensor Flow and
Keras frameworks provide high-level API. With the help of these two, we can build
602 M. A. H. Abir et al.

and train models efficiently. Keras is built in Python, which is more user-friendly than
TensorFlow [17].

3.3.3 Loading Training, Testing and Validating Data


In Sect. 3.1, we have discussed about how many images we have used for each class.
CNN basically observes the data and learns from each training data. We keep our dataset
in two parts. One part is training dataset and another is testing dataset. Testing data is
used for prediction capability of the model and train data is used to build the model. We
use the training data for model processing and testing data for the prediction of the train
model. Validation data was split by 20% from all train data.
Validation data is used for validating the model. It improves the data quality and
data quantity. We validated 83 images from 3 classes and total train data we used 415
skin disease images which we discussed in Sect. 3.1 dataset part. We took 202 images
as testing data.

3.3.4 Training Model


The model was trained for skin disease classification. We used the skin disease image
dataset then we trained the model. In this part, we shortly discussed how we trained the
model. The training model part is important because the result depends on the way the
model was trained.
Training convolutional neural networks basically requires weights and mapping input
and outputs. Every convolution layer is considered as a separate parameter, including its
input volume, kernel size and depth of the mapping stack, zero padding and stride.
Output size is measured using (1) and (2)
IX − Kx
Mx = (1)
Sx
Iy − Ky
My = (2)
Sy

Here (Mx , My ), (IX , Iy ), (Kx , Ky ) is respectively refer to map input, and kernel size,
where (S x ,Sy ) represents stride in row and column.
We used the Sequential () model in our project, and ‘Relu’ and ‘Softmax’ activation
function. If the result in the algorithm coming from positive, then output would be the
same as positive value and if the negative then the output value will be changed. We
reshaped our image pixel size as 150 × 150. Our neural network has a fully connected
layer. Full connection of input neurons works to combine and reweight all the higher-
order features. Fully connected layer is basically connected with the previous layer
output. There is no spatial arrangement in these layer. Fully connected layer is the
last layer, which is connected to the output layer. After applying the Convolutional
Neural Network on our skin disease dataset, we got 7,982,403 trainable parameters
and 0 non-trainable parameters. In fully connected layers, a total number of trainable
parameters is calculated by (n + 1) × m. Here, n is the number of input units and m
is the number of output units. In CNN classification, we have used 2 MaxPooling2D
Skin Disease Classification Using Machine Learning Techniques 603

Layers.MaxPooling2D layer takes the output of previous layer. In our model, we have
used 2 Conv2D layers, 1 flatten layer and 2 dense layers. We have used dropout layer
which helps to reduce model over fitting. In Fig. 3, We showed the CNN architecture
we used to build the model.

INPUT CONVOLUTION MAX POOLING


DROPOUT 25%
150X150X3 32-3X3 Filter 2X2

MAX POOLING CONVOLUTION


FLATTEN DROPOUT 25%
2X2 64-3X3

DENSE LAYER DENSE LAYER 3 OUTPUT LAYER


DROPOUT 50%
96 UNITS UNITS 3 UNITS

Fig. 3. Architecture of our CNN model

In Fig. 4, Training loss and validation loss is presented, and in Fig. 5 training and
validation accuracy have been shown in different epochs. Figure 4 and Fig. 5 show the
increment and decrement of the training loss and accuracy in different epochs. Each
epoch receives all data from training set. The model trained correctly when the loss
function reached a minimum value. The loss function gives a measurement of perfor-
mance of training of the model. The quality of model training depends on the loss values.
From the values that we got from Fig. 4 and Fig. 5, the model trained well.
From the figures we have got average training loss around 57.56% and average
training accuracy 75.84%.
The diagrams present the loss when the model is training. High accuracy means the
model had less error, and high loss means that the model had high error. This is what
training accuracy and training loss presents. If validation loss > training loss, we say it
is over fitting. If validation loss < training loss, we can say it is under fitting.

3.4 Implementation Using SVM


Support Vector Machine (SVM) follows supervised learning approach. We applied SVM
on our dataset. We found the result from that SVM algorithm training. We basically
applied supervised SVM for multiclass classification. This algorithm classify 3 types of
diseases. The 3 skin diseases class we used is Arsenic infection disease, Bowens and
Chickenpox. We used the same dataset and classes in SVM which we applied to CNN
model. We used supervised SVM algorithm. We used Scikit-learn and NumPy library
to solve this problem.
We got total 69% accuracy for SVM. We found the precision for arsenic 100%, for
Bowens 72%, and for Chickenpox 66%.
We presented the classification result (precision, recall, F1-score, support), also the
average of micro and macro weighted. To implement metrics we converted testing data
604 M. A. H. Abir et al.

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Training Loss ValidaƟon Loss

Fig. 4. Training loss versus validation loss

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Training Accuracy ValidaƟon Accuracy

Fig. 5. Training accuracy versus validation accuracy

to NumPy format. The precision result we got satisfactory (Table 1).


True Positive
Precision = (3)
True Positive + False Positive
True Positive
Recall = (4)
True Positive + False Negative
Precision ∗ Recall
F1 − Score = 2 ∗ (5)
Precision + Recall
Skin Disease Classification Using Machine Learning Techniques 605

Table 1. Result of SVM classification

Class Precision Recall F1-score Support


Arsenic Keratosis 1.00 0.20 0.33 20
Bowen’s Disease 0.72 0.54 0.62 39
Chickenpox 0.66 0.92 0.77 66
Macro average 0.80 0.55 0.57 125
Weighted average 0.74 0.69 0.65 125

The elementary SVM is unable to perform classification for multiple classes. Gener-
ally, binary classification is done by separating data points into two classes. For classifica-
tion in more than two classes, the problem is broken into numerous binary classification
modules [16]. In this case, data points are plotted to higher dimensional space to achieve
linear separation between every two classes. This is known as one-to-one method that
forms a binary classifier for each of the two classes [16]. A second method is one-to-
rest, where a lone SVM performs binary classification and is able to identify differences
between two classes. In order to classify data points from m class dataset:

1. From the one to rest method, m SVMs are used by the classifier. Each SVM will
predict the association in one of the m classes.
2. From the one-to-one technique, m(m−1)2 SVMs are used by the classifier.

Utilizing the two methods to this dataset produces the followings:


For the one-to-one method, a hyper plane is used to differentiate between every two
classes, ignoring the plots of the third class. This indicates the partition allows only the
plots of the two classes in the present split [16].
For the one-to-rest method, a hyper plane is used to differentiate between a class and
all other classes simultaneously. This means the partition calculates all plots, separating
them into two sets; a set for the class plots and a set for all other plots [16].

4 Results and Discussion


We used an image dataset that was collected primarily from formal and informal sources.
This includes open-source sources as well as local medical centers. By using these
images, we built and trained a CNN as well as compared the results with a multiclass
SVM.
Table 2 below shows the accuracy, precision, recall, and F1 score for the CNN
classifier.
Figure 6 shows the confusion matrix for the CNN classifier. The accuracy is
calculated using (6)
NP
CP = (6)
TN
606 M. A. H. Abir et al.

Table 2. Result for CNN classifier

Class N(Truth) N(Classified) Accuracy Precision Recall F1 Score


Arsenic Keratosis 72 64 87.13% 0.86 0.76 0.81
Bowen’s Disease 65 75 89.11% 0.79 0.91 0.84
Chickenpox 65 63 84.16% 0.76 0.74 0.75

where CP is the number of Correct Prediction, NP is the number of predicted data found,
and TN is the total number of examples per class.

Fig. 6. Confusion matrix for CNN classifier

The total accuracy we got from the confusion matrix is 80.20%, and the total mis-
classification or confusion is about 19.80%. We found confusion matrix results from
testing data. It shows the correct prediction and confusion result of each class.
Figure 7 shows the F1 score of both the CNN and the SVM, where the CNN outper-
formed in all classes except in one class where the SVM marginally outperformed the
CNN. A factor in the distinction between the two methods can be attested to the class
imbalance which was present in the dataset.
The neural network performance can be attested to its bias towards classes with a
larger number of images [10]. Overall, the CNN model produces better results than the
SVM. However, there are shortcomings in the model, especially for a certain number
of classes. The skin color or melanin concentration of the patients in the dataset are of
similar skin shade. If a variation in the ethnicities of the patients increases, the accuracy
of the model will decrease [9]. The accuracy for each class can be improved by increasing
the number of patients of multiple ethnicities.
Skin Disease Classification Using Machine Learning Techniques 607

Fig. 7. F1 score of CNN and SVM

5 Conclusion and Future Work


This paper primarily focuses on the classification task of three common skin infections
- Arsenic Keratosis, Bowen’s Disease, and Chickenpox. A large portion of the dataset
is collected from a combination of formal and informal sources. The performance of
the neural classifier is estimated, and the accuracy is determined using the evaluation
metrics indicated in the paper. The performance of the neural classifier is also compared
against classical machine learning technique using similar metrics. A substantial limiting
variable of the study is the quality and availability of clinical image dataset from the least
developed regions, e.g., Bangladesh, India. Further works include the addition of other
transmissible and precancerous skin diseases - Scabies, Melanoma. We also want to work
in the future to build an app and website on skin disease detection and classification,
which will help doctors and patients to find the skin disease faster and correctly.

References
1. Seth, D., Cheldize, K., Brown, D., Freeman, E.E.: Global burden of skin disease: inequities
and innovations. Curr. Dermatol. Rep. 6(3), 204–210 (2017). https://doi.org/10.1007/s13671-
017-0192-7
2. Barua, P.: Skin health in Bangladesh: an overview. Indian J. Dermatol. Venereology Leprology
78(2), 133 (2012). https://doi.org/10.4103/0378-6323.93627
3. Ahmed, N., Islam, M., Farjana, S.: Pattern of skin diseases: experience from a rural community
of Bangladesh. Bangladesh Med. J. 41(1), 50–52 (2014). https://doi.org/10.3329/bmj.v41i1.
18784
4. Rahman, A., Kuddus, A.: Effects of some sociological factors on the outbreak of chickenpox
disease. JP J. Biostat. 11(1), 37–53 (2014)
5. Hossain, M.: Arsenic contamination in Bangladesh—an overview. Agric. Ecosyst. Environ.
113(1–4), 1–16 (2006)
6. Shajil, C., Mahabal, G.: Arsenical Keratosis. [Updated 2020 Aug 14]. In: StatPearls [Inter-
net]. StatPearls Publishing, Treasure Island, FL January 2021. https://www.ncbi.nlm.nih.gov/
books/NBK560570/
608 M. A. H. Abir et al.

7. The Australian College of Dermatologists - Bowen’s Disease. https://www.dermcoll.edu.au/


atoz/bowens-disease
8. Chickenpox (Varicella) Signs & Symptoms. Centers for Disease Control and Prevention
(cdc.gov). https://www.cdc.gov/Chickenpox/about/symptoms.html
9. Zhao, X., et al.: The application of deep learning in the risk grading of skin tumors for patients
using clinical images. J. Med. Syst. 43(8), 283 (2019)
10. Lee, Y.C., Jung, S.-H., Won, H.-H.: WonDerM: skin lesion classification with fine-tuned
neural networks. arXiv:1808.03426v3 (2019)
11. Bhadula, S., Sharma, S., Juyal, P., Kulshrestha, C.: Machine learning algorithms based skin
disease detection. Int. J. Innov. Technol. Exploring Eng. 9(2), 4044–4049 (2019)
12. Wei, L., Gan, Q., Ji, T.: Skin disease recognition method based on image color and texture
features. Comput. Math. Methods Med. 2018, 1–10 (2018)
13. Kolkur, S., Kalbande, D.R., Kharkar, V.: Machine learning approaches to multi-class human
skin disease detection. Int. J. Comput. Intell. Res. 14(1), 29–39 (2018)
14. ALKolifiALEnezi, N.: A method of skin disease detection using image processing and
machine learning. Procedia Comput. Sci. 163, 85–92 (2019)
15. Cristianini, N., Shawe, J.: Support Vector Machines (2000)
16. Multiclass classification using SVM by Baeldung. https://www.baeldung.com/cs/svm-multic
lass-classification
17. Classify image classification using transfer learning. https://www.tensorflow.org/hub/tutori
als/image_feature_vector
18. Wang, Q.: An Image Dataset of various skin conditions and rashes submitted by and licensed
by creative commons attribution. https://ieee-dataport.org/documents/image-dataset-various-
skin-conditions-and-rashes#files
19. Ghosh, P., et al.: Comparison of health effects between individualswith and without skin
lesions in the population exposed to arsenic through drinking water in West Bengal, India. J.
Expo. Sci. Environ. Epidemio. 17(3), 215–223 (2007)
20. Raessler, M.: The arsenic contamination of drinking and groundwaters in Bangladesh: fea-
turing biogeochemical aspects and implications on public health. Arch. Environ. Contam.
Toxicol. 75, 1–7 (2018)
21. Chakraborti, D., et al.: Fate of over 480 million inhabitants living in arsenic and fluo-
ride endemic Indian districts: magnitude, health, socio-economic effects and mitigation
approaches. J. Trace Elem. Med. Biol. 38, 33–45 (2016)
22. Different Diagnosis Image. https://www.visualdx.com/visualdx/differential/arsenic+tri
oxide/multiple+skin+lesions?moduleId=100&findingId=22537,25221&reqFId=22537#
view=photos&tab=drug
Construction of Suitable DNN-HMM
for Classification Between Normal
and Abnormal Respiration

Masaru Yamashita(B)

Nagasaki University, Nagasaki 8528521, Japan


[email protected]

Abstract. In many situations, abnormal sounds termed adventitious sounds are


part of the lung sound of a subject suffering from a pulmonary disease. In this study,
we aim to achieve the automatic detection of abnormal sounds from auscultatory
sound. For this purpose, we expressed the acoustic features of normal lung sound
for healthy subjects and abnormal lung sound for patients by using the Gaussian
mixture model (GMM)-hidden Markov model (HMM) and distinguished between
normal and abnormal lung sounds. However, the classification rate between normal
and abnormal respiration was low (86.53%). In speech recognition, the accuracy
was improved using a deep neural network (DNN)-HMM. However, in the case
of lung sound, we cannot use a DNN because the amount of training data is small.
In this paper, we present the construction of a DNN-HMM with high accuracy
by selecting a suitable acoustic feature and setting the number of hidden layers
and units for the DNN-HMM. By selecting a suitable number of hidden layers
and units for the DNN-HMM, the classification rate was increased (91.26%). The
results proved the effectiveness of the proposed method.

Keywords: Deep neural network · Hidden Markov model · Lung sound ·


Classification · Abnormal respiration

1 Introduction

Auscultation of the lungs is one of the methods of detecting pulmonary diseases in


patients. Although there are other non-invasive inexpensive methods, auscultation using
a stethoscope can obtain valuable information regarding health status. In many cases,
abnormal sounds (called adventitious sounds [1]) are included in the lung sound of a
subject suffering from pulmonary disease, and even today, auscultation is an effective
method of diagnosing pulmonary diseases. However, it requires expert knowledge and
experience to perceive the difference between healthy subjects and patients, making it
difficult for non-medical personnel. This may be the reason why auscultation does not
penetrate common households. It is difficult for the elderly or persons in depopulated
areas to visit the hospital. Thus, the ability to discriminate between healthy subjects and
patients at home will facilitate the early detection of pulmonary diseases.

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, pp. 609–619, 2021.
https://doi.org/10.1007/978-3-030-85030-2_50
610 M. Yamashita

Several studies have been conducted with the aim of automatically detecting adven-
titious sounds from lung sounds [2-4]. In these studies, a specific adventitious sound
was detected by using either a wavelet transform or a frame of adventitious sound was
discriminated by using the short-time spectrum. However, the time of occurrence and
duration of adventitious sounds vary. Therefore, it is desirable to discriminate between
sounds using the whole respiration features and its inflection. Furthermore, the fea-
tures of adventitious and respiratory sounds depend on the individual and the degree of
progress of the disease. Therefore, we consider that the features should be expressed
statistically. In a previous study, we expressed the time-series of acoustic features of
the lung sound by constructing the GMM-HMM, and discriminated between normal
and abnormal respiratory sounds [5–7]. Furthermore, we proved the effectiveness of
selecting the parameters of the GMM-HMM for the classification [8].
In speech recognition, the accuracy was improved by using a DNN-HMM [9].
Approximately 40 dimensional acoustic features are extracted from each frame and
the acoustic features from several consecutive frames, for example 440 dimensions, are
input into the DNN. In the case of the structure, for example, the DNN is constructed
with 5 hidden layers that have 1024 units each and output layer that has approximately
4000 units. However, this cannot be used in the case of lung sound, because the amount
of training data is small. If we construct the DNN in a similar manner, the model will
be over-fitting.
Therefore, we focus on analyzing the parameters and structure of the DNN. In this
study, we proposed the construction of a DNN-HMM with high accuracy by selecting a
suitable acoustic feature and setting the number of hidden layers and units for the DNN.
The validity of the proposed method is confirmed through a classification experiment
between normal and abnormal respiratory sounds. Finally, the performance of the clas-
sification between healthy subjects and patients carried out using the proposed method
is described.

2 Lung Sound Database

2.1 Hand Labeling

We recorded lung sounds by using an electronic stethoscope. Subsequently, we manually


performed segmentation based on recorded sounds, waveform, spectrogram, and power.
Figures 1 and 2 are examples of respiratory sounds including two types of adventitious
sounds. We can see the continuous adventitious sound called wheeze in the expiration
and the discontinuous adventitious sound called fine crackles. First, we divided the lung
sound into the inspiration and expiration sound segments (respiratory sound segment).
Next, we divided the respiratory sound segment into adventitious sound segments and
other breathing sound segments. Furthermore, we divided the adventitious sound seg-
ments into continuous adventitious sound and discontinuous sound. If the interval of
occurrence between adventitious sounds was shorter than 100 ms, we regarded them as
one segment.
Construction of Suitable DNN-HMM for Classification 611

Fig. 1. Example of respiratory sounds including adventitious sound called wheeze.

Fig. 2. Example of respiratory sounds including adventitious sound called fine crackle.

2.2 Definition of Normal and Abnormal Respiration


The acoustic features of some noises are similar to adventitious sounds. Some respiratory
sounds from healthy subjects include the adventitious sound. It is difficult for a non-
medical personnel to diagnose this. Conversely, some respiratory sounds from the patient
do not include adventitious sounds. However, we cannot term them as normal respiratory
sounds. Thus, we defined normal and abnormal respiration. In our study, we grouped
the respiratory sounds into four categories.

-Abnormal respirations from patients (AP): respirations include adventitious sounds


from patients.
-Abnormal respirations from healthy subjects (AH): respirations include noises resem-
bling the adventitious sounds from healthy subjects.
-Normal respirations from patients (NP): respirations do not include adventitious sounds
or noises resembling the adventitious sounds from patients.
-Normal respirations from healthy subjects (NH): respirations do not include adventitious
sounds or noises resembling the adventitious sounds from patients.
612 M. Yamashita

In our discrimination experiment between normal and abnormal respiration, we used


only NH as normal respiration and AP as abnormal respiration. That is, we did not use
AH and NP. However, we considered all respirations in the discrimination experiment
between healthy subjects and patients.

3 Detection of Abnormal Respiration and Patient

In speech recognition, the acoustic models of the phoneme (smallest unit of speech)
and the occurrence probability of words are used to construct stochastic models. Subse-
quently, we applied the technique to the lung sound. In previous studies [5–7], we used
HMM for classification between normal and abnormal respiration. In this study, we use
a DNN-HMM rather than HMM. Fig. 3 shows the architecture of the classification sys-
tem between normal and abnormal respiration using the DNN-HMM, and Fig. 4 shows
the structure of the DNN. The classification procedure comprises the training and test
processes. In the training process, the HMM used as the acoustic model and the segment
sequence model [6] that defines the occurrence probability of the divided segments are
trained. In the test process, input respiration is discriminated between normal and abnor-
mal respiration based on the maximum likelihood approach. If we assume that the sample
respiration W comprises N segments, it can be expressed as W = w1 w2 · · · wi · · · wN
where wi is the i-th segment of W .

Training data of lung sounds


Occurrence probability

Sound Label Segment bigram

Acoustic model (DNN-HMM)


Abnormal Normal
Feature extraction DNN Adventitious sound, Breathing sound
Breathing sound
Training
Test
Likelihood Result
Respiration Feature extraction DNN
Calculation (Normal/Abnormal)

Fig. 3. Architecture of the classification system between normal and abnormal respiration using
DNN-HMM.

The training process can be explained as follows. We extract acoustic features and
train each segment using the DNN. Generally, in the case of speech recognition, for
example, 440 dimensional features extracted from 11 frames are used as input of the
DNN. For speech recognition, mel-frequency cepstrum coefficients (MFCCs) or filter
banks were used as acoustic features, and the DNN comprises more than 5 hidden
layers that have approximately 2000 units each. In our previous studies [5–8], for the
classification between normal and abnormal respiration, we used 6 MFCCs and power
that is less than the case of speech recognition. In this study, we extract 6 MFCCs and
Construction of Suitable DNN-HMM for Classification 613

power, and 6 filter-banks and power for normal and abnormal respiration respectively.
Then, to construct a highly accurate DNN-HMM for the classification between normal
and abnormal respiration, we selected four suitable parameters: (1) number of input
frame(s) for the DNN, (2) acoustic feature, (3) number of hidden layers, and (4) units.

(1) Input frame(s) for DNN

(MFCC or Filter bank) + Power

(2) Acoustic features

Input layer

(3) Hidden layers

(4) Units :

Output layer

Posterior probability for each segment


Fig. 4. Structure of DNN.

In the case of normal respiration, we assume it comprises one segment (N = 1)


whereas in the case of abnormal respiration including adventitious sound, it comprises
at least two segments (N ≥ 2). For example, in the case of expiration in Fig. 3, it
comprises one wheeze segment and two breathing segments (N = 3). In the case of
inspiration in Fig. 3 that does not include adventitious sound, it comprises one breathing
sound segment (N = 1). The training of the segment sequence model can be explained
as follows: We calculate the occurrence probability of the segments P(W ) by using
614 M. Yamashita

segment bigram. P(W ) can be written as



N
P(W ) = w1 × P(wi |wi−1 ) (1)
i=2

Let P(wi |wi−1 ) be defined as


P(wi |wi−1 ) = C(wi |wi−1 ) = (wi−1 , wi )/C(wi−1 ), (2)
where C(wi ) is the count of wi , . is the count of wi−1 , and C(wi |wi−1 ) is the count of
segment wi after the wi−1 in the database for training.
The test process can be explained as follows: The maximum likelihood among the


calculated likelihoods is found and the corresponding segment sequence W is selected to


recognize the sample respiration sound. If the sequence includes at least one adventitious
sound, we identify the sample respiration as abnormal sound. Otherwise, we identify
the sample respiration as a normal sound.


W can be written as


W = argmax(logP(X |W ) + αlogP(W )) (3)


W

where X is a sample respiration and logP(X |W ) is an acoustic likelihood. The weight


factor was obtained experimentally.
Furthermore, we describe the detection of patients. Noises from outside of the body
occur irregularly, whereas adventitious sounds occur periodically. Therefore, in the case
of healthy subjects, most of the respirations are classified into normal respiration even if
one or a few respirations are classified into abnormal respiration. In other words, in the
case of healthy subjects, most of the likelihoods for normal respiration are higher than
the likelihoods for abnormal respiration even if one or a few respirations are classified
as abnormal respiration. To detect patient, we calculate the likelihood L(Wno ) for the


segment sequence W no that does not include adventitious sound and the maximum like-


lihood L(Wab ) for the segment sequence W ab that includes adventitious sound segments
for each respiration. If the total of L(Wab ) is larger than or equal to the total of L(Wno )
the subject is classified as patient. That is,
     
L Wj,ab ≥ L Wj,no (4)
j j
 
where L Wj,ab is the likelihood for the segment sequence  thatinclude adventitious
sound segments for the j-th respiration of the subject and L Wj,no is the likelihood for
the segment sequence that does not include adventitious sound segments for the j-th
respiration of the subject.

4 Classification Experiments
4.1 Experimental Conditions
In every 10 ms, 6 MFCCs and power were extracted as acoustic features using a 25 ms
Hamming window. The lung sound data were sampled at 5 kHz. Figure 5 shows the aus-
cultation points. Table 1 shows that the number of abnormal respiratory sounds included
Construction of Suitable DNN-HMM for Classification 615

adventitious sounds, and the number of patients included at least one adventitious sound.
As many normal respirations or healthy subjects were selected randomly for each detec-
tion experiment of abnormal respirations and patient subjects. We performed a leave-
one-out cross validation to construct the subject-independent model. In this study, we
input several types of acoustic features into the DNN and compare the classification rate.
The acoustic features are shown in Table 2. We set the number of units for hidden layer
as 16, 64, 128, 256, 512 and 2048. Furthermore, we set the number of hidden layers from
two to seven and compare the classification rate of all the combination of the acoustic
features, number of units, and number of hidden layers.

Table 1. Number of abnormal respiratory sounds included adventitious sounds, and the number
of patients included at least one adventitious sound.

Points No. of abnormal respiration No. of patients


A 219 44
B 161 89
C 254 53
D 217 47
E 312 62
F 206 52
G 182 46
H 324 62
I 260 62
Total 2135 517

Table 2. Acoustic features of input for DNN.

Type Acoustic features No. dimensions


I MFCCs + power 7
II filter-banks + power 7
III 3 frames of MFCCs + power 21
IV 3 frames of filter-banks + power 21
V 5 frames of MFCCs + power 35
VI 5 frames of filter-banks + power 35
616 M. Yamashita

Fig. 5. Auscultation points.

4.2 Classification Experiments

We carried out the classification experiment using several types of acoustic features as
inputs for the DNN-HMM as shown in Table 2. When we set the type of acoustic features
as III, and the number of hidden layers and units as 4 and 128 respectively, the classifica-
tion rate between normal and abnormal respiration was the highest of all combinations.
Figure 6 shows the classification rate between normal and abnormal respiration for each
number of units, where the number of hidden layers was 4. Furthermore, Fig. 7 shows the
classification rate between normal and abnormal respiration for each number of hidden
layers, where the number of units was 4. The values were smaller than that used for
speech recognition. We considered that when the values are large, the amount of train-
ing data was insufficient. That is, the model was over-fitting. The difference between
the classification rate using type I and the classification rate using type III was small.
Furthermore, the classification rate using type V was decreased. Therefore, increasing
the dimension of input features was considered to have little or no effect in the case of
small training data. When we used the filter bank, the classification rate was lower than
MFCC, and the amount of training data was considered insufficient for the extraction of
acoustic features. Therefore, MFCC, that is, frequency information was suitable to input
the DNN. The above results show the significant effectiveness of using DNN-HMM by
selecting suitable acoustic features, and setting number of hidden layers and number of
units for classification between normal and abnormal respiration.
Finally, we describe a classification experiment between healthy subjects and
patients. Table 3 presents the classification rate between healthy subjects and patients.
The conventional method using GMM-HMM selects MFCC and power as acoustic fea-
tures. The proposed method sets the number of hidden layers and the number of units as
the best number in the previously mentioned classification experiments between normal
and abnormal respiration. The results of the classification experiment between healthy
subjects and patients show the possibility of improvement of the classification rate. How-
ever, the improvement is not significant. This is because the number of test subjects is
small.
Construction of Suitable DNN-HMM for Classification 617

100

91.26
90
Classification rate [%]

86.53

80

70

60

50
16 64 128 256 512 2048
Number of units

GMM-HMM Type I Type II Type III


Type IV Type V Type VI
Fig. 6. Classification rate between normal and abnormal respiration for each number of units of
hidden layer.

100

91.26
90
Classification rate [%]

86.53

80

70

60

50
2 3 4 5 6 7
Number of hidden layers

GMM-HMM Type I Type II Type III


Type IV Type V Type VI
Fig. 7. Classification rate between normal and abnormal respiration for each number of
hiddenlayers.
618 M. Yamashita

Table 3. Classification rate between healthy subjects and patients [%].

Healthy subject Patient Average


GMM-HMM 88.8 84.0 86.4
DNN-HMM (Type 89.9 85.6 87.8
III)

5 Conclusions

In this study, we proposed the construction of a DNN-HMM with high accuracy by


selecting suitable acoustic features and setting the number of hidden layers and units
for the DNN. The results of the classification experiment confirmed the improvement of
the classification rate after selecting a suitable acoustic feature and setting the number
of hidden layers and number of units for the DNN. This demonstrated the effectiveness
of the proposed approach. When we used the filter bank, the classification rate was
lower than MFCC, and the amount of training data was considered insufficient to extract
acoustic features. Furthermore, when we used a large number of hidden layers and units
for the DNN the classification rate was low and it was also considered that the amount
of training data was not sufficient. Therefore, the results indicated that we need to select
suitable acoustic features as inputs of the DNN, the number of hidden layers and number
of units for the DNN depending on the amount of training data. On the other hand, In
the classification experiment between healthy subjects and patients, the improvement of
classification rate was not significant because of the small number of test subjects.
In image recognition, rotated images of original images are used to increase the
amount of training data. The future work should aim at generating respirations to increase
the amount of training data by transforming the original respirations.

Acknowledgment. This study was supported by “Kawai Foundation for Technology & Music.”

References
1. Gavriely, N., Cugell, D.W.: Breath Sounds Methodology. CRC Press, New York (1995)
2. Kahya, Y.P., Yerer, S., Cerid, O.: A wavelet-based instrument for detection of crackles in
pulmonary sounds. In: EMBC 2001, International Conference of the IEEE Engineering in
Medicine and Biology Society (2001)
3. Marshall, A., Boussakta, S.: Signal analysis of medical acoustic sounds with applications to
chest medicine. J. Franklin Inst. 344, 230–242 (2007)
4. Taplidou, S.A., Hadjileontiadis, L.J.: Wheeze detection based on time-frequency analysis of
breath sounds. Comput. Biol. Med. 37, 1073–1083 (2007)
5. Matsunaga, S., Yamauchi, K., Yamashita, M., Miyahara, S.: Classification between normal
and abnormal respiratory sounds based on maximum likelihood approach. In: Proceedings of
ICASSP 2009, IEEE International Conference on Acoustics, Speech, and Signal Processing,
pp. 517–520 (2009)
Construction of Suitable DNN-HMM for Classification 619

6. Yamamoto, H., Matsunaga, S., Yamashita, M., Yamauchi K., Miyahara, S.: Classification
between normal and abnormal respiratory sounds based on stochastic approach. In: Proceedings
of ICA 2010, International Congress on Acoustics, vol. 671 (2010)
7. Yamashita, M., Himeshima, M., Matsunaga, S.: Robust classification between normal and
abnormal lung sounds using adventitious-sound and heart-sound models. In: Proceedings of
ICASSP 2014, IEEE International Conference on Acoustics, Speech, and Signal Processing,
pp. 4451–4455 (2014)
8. Yamashita, M.: Construction of effective HMMs for classification between normal and
abnormal respiration. In: Proceedings of APSIPA 2020 Asia Pacific Signal and Information
Processing Association Annual Summit and Conference, pp. 914–918 (2020)
9. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared
views of four research groups. IEEE Signal Process. Mag. 6, 82–97 (2012)
Correction to: Advances in Computational
Intelligence

Ignacio Rojas , Gonzalo Joya, and Andreu Català

Correction to:
I. Rojas et al. (Eds.): Advances in Computational Intelligence,
LNCS 12861, https://doi.org/10.1007/978-3-030-85030-2

In the original version of this paper, the affiliation of Gonzalo Joya was presented
incorrectly. This was corrected.
It should be read as follows: University of Málaga, Málaga, Spain
In the original version of this paper, the last name of Andreu Català was misspelled and
the affiliation was incorrect. These errors were corrected.
The correct spelling is with an accent as follows: Andreu Català
The affiliation is Technical University of Catalonia, Barcelona, Spain

The updated version of the book can be found at


https://doi.org/10.1007/978-3-030-85030-2

© Springer Nature Switzerland AG 2021


I. Rojas et al. (Eds.): IWANN 2021, LNCS 12861, p. C1, 2021.
https://doi.org/10.1007/978-3-030-85030-2_51
Author Index

A. Silva, Carlos II-380 Castillo, Silvia I-583


Abdallah, Wejden I-420 Castillo-Valdivieso, Pedro II-330
Abir, Mohammad Ashraful Haque I-597 Català, Andreu I-570, II-356
Achicanoy, Wilson I-583 Cateni, Silvia II-248
Adamczyk, David II-14 Ceschini, Andrea II-306
Alba, Emilio I-24 Changoluisa, Vinicio I-230, I-253
Alcaraz, Raúl I-242 Chen, Tianhua I-89
Alonso-Betanzos, Amparo II-128 Chen, Yuan I-89
Alraddadi, Faisal Saleem II-416 Cheraghian, Ali I-484
Alyamkin, Sergey I-303 Chou, Jia-Hong I-372
Amogne, Zemenu Endalamaw I-372 Chowdhury, Townim I-484
Amoroso, Roberto II-318 Chudakov, Dmitry I-303
Anguita, Davide II-367 Colla, Valentina II-248
Anik, Golam Kibria I-597 Colley, Kathryn II-118
Antosz, Patrycja II-105 Craig, Tony II-118
Arevalo, William I-537 Cucchiara, Rita II-318
Artelt, André I-101
Aruleba, Idowu I-385, I-407 D’Amato, Vincenzo II-367
Atencia, Miguel I-3, II-260 Das, Anindita I-347
de Carvalho, André C. P. L. F. I-471
Bacciu, Davide I-279, II-168
de la Rosa, Francisco López I-265
Bakhanova, Maria I-138
Deepty, Ummeh Habiba I-498
Banos, Oresti I-322, I-537
del Corral Guijarro, Francisco S. II-356
Baraldi, Lorenzo II-318
Delbruck, Tobi II-57
Barbero-Gómez, Javier II-3
Denisov, Andrey I-303
Barco, Alex I-570
Di Luzio, Francesco II-306
Bauer, Christoph I-291
Di Sarli, Daniele II-168
Beketova, Anna II-28
Díaz-Boladeras, Marta I-570
Berbel, Blanca II-81
Dominguez, David II-236
Biadgligne, Yohanens I-443
Dudek, Grzegorz II-196
Birkholz, Peter II-143
Blanco-López, José I-61
Boicea, Marinela II-260 E, Hanyu I-89
Bolón-Canedo, Verónica I-113 Eiermann, Sven II-404
Bonifazi, Giuseppe II-281 Elices, Irene II-81
Burgueño-Romero, A. M. II-392 Elizondo, David I-36
Engelbrecht, Andries I-525, II-183
Cabrera-León, Ylermi II-223
Calderon-Ramirez, Saul I-36 Faure, James I-525
Camurri, Antonio II-367 Feldhans, Robert I-101
Canas-Moreno, Salvador II-57 Fernández-Caballero, Antonio I-219, I-242,
Capobianco, Giuseppe II-281 I-265
Carrillo-Perez, Francisco I-559 Fernández-López, Pablo I-49, I-61, II-223
Casilari, Eduardo II-380 Fontenla-Romero, Oscar II-343
622 Author Index

Formoso, Marco II-45 Kanzari, Dalel I-420


Friedrich, Timo I-334 Karim, Mohammed Ariful I-597
Khan, Anika II-429
Gallego-Molina, Nicolás J. II-45 Khan, Ashikur Rahman II-429
Gallicchio, Claudio II-168 Kleyko, Denis II-155
Gan, John Q. I-547 Klüver, Christina II-404
Garcia, Luís P. F. I-471 Klüver, Jürgen II-404
García-Báez, Patricio I-61, II-223
García−Bermúdez, Rodolfo II-380 Lago-Fernández, Luis F. II-416
García-Martínez, Beatriz I-219, I-242 Latorre, Roberto II-81
García-Valdez, Mario II-330 Leite, Argentina I-359
Garcia-Zapirain, Begonya I-583 Liñán-Villafranca, Luis II-330
Garrido-Peña, Alicia II-81 Linares-Barranco, Alejandro II-57
Gelpud, John I-583 Liu, Bingsheng I-89
Gianella, Cristina II-356 Llanas, Xavier I-570
Glösekötter, Peter I-15, I-291 López, María T. I-265
Gómez Zuluaga, Mauricio Andrés I-291 López-García, Guillermo I-24
Goncharenko, Alexander I-303 López-Rubio, Ezequiel I-432
González, Mario II-236 López-Rubio, José Miguel I-432
González, Pedro II-272 Lorena, Ana C. I-471
Gonzalez-Jimenez, J. II-392 Lorenzo-Navarro, Javier II-281
Guijarro-Berdiñas, Bertha II-128, II-343 Luque, Juan L. II-45
Gutiérrez, Pedro Antonio II-3
Macho, Oscar II-356
Hammer, Barbara I-101, I-334 Madani, Kurosh I-420
Herrera Maldonado, Luis Javier I-559 Magalhães, Carlos I-359
Hervás-Martínez, César II-3 Makarov, Ilya I-138, I-456, II-28, II-293
Heusinger, Moritz I-126 Mansour, Mahmud I-311
Hinder, Fabian I-101 Marcu, Andreea I-3
Hotoleanu, Mircea I-3 Marma, Jonas S. II-93
Hui, Lucas Y. W. I-395 Martínez-Murcia, Francisco J. II-45
Hung, Shih-Kai I-547 Martínez-Rodrigo, Arturo I-219, I-242
Maslov, Dmitrii I-456
Ilie, Vlad I-3 Masud, Shehzin II-429
Iliescu, Dominic I-3 Matsuo, Tokuro I-77
Ionescu, Leonard II-260 Mendez, Mauro I-36
Ishii, Naohiro I-77 Menzel, Stefan I-334
Ivbanikaro, Anna E. I-202 Merelo, J. J. II-330
Iwata, Kazunori I-77 Micheli, Alessio II-168
Miñana, Juan José I-165
Jager, Wander II-105 Molina-Cabello, Miguel A. I-36, I-432
Jalalvand, Azarakhsh II-143 Monge, Gerardo I-36
Jalisha, Mahira I-484 Mora, Antonio M. I-151
Jarray, Fethi I-311 Morales Vega, Juan Carlos I-559
Jerez, José M. I-24 Morán-Fernández, Laura I-113
Jesus, María José Del II-272 Moreno, Ginés I-190
Jojoa, Mario I-583 Moreno, Segundo I-151
Jorge, Fábio O. II-93 Mukai, Naoto I-77
Joya, Gonzalo I-3, II-260 Musté, Marta I-570, II-356
Author Index 623

Nagy, Rodica I-3 Rodrigo, David I-583


Navarro-Mesa, Juan Luis I-49, I-61, II-223 Rodríguez, Francisco B. I-230, I-253, II-236,
Noman, Abdulla All I-498 II-416
Noor, Asaduzzaman I-498 Rodríguez-Arias, Alejandro II-128
Novais, Paulo I-219 Rodriguez-Leon, Ciro I-537
Nunes, Gabriel F. II-93 Rojas Ruiz, Ignacio I-559
Rojas, I. I-15
Odagiri, Kazuya I-77 Rörup, Tim I-15
Okwu, Modestus O. I-202 Rosato, Antonello II-155, II-306
Omoregbee, Henry O. I-202 Ruican, Dan I-3
Oneto, Luca II-367 Ruiz-Sarmiento, J. R. II-392
Ordikhani, Ahmad I-291
Ortiz, Alberto I-165 Safont, Gonzalo I-178
Ortiz, Andrés II-45 Sahu, Nilkanta I-347
Ortiz, Esaú I-165 Saidi, Rakia I-311
Osipov, Evgeny II-155 Salazar, Addisson I-178
Osowski, Stanislaw II-208 Salazar, Vanessa I-253
Salgado, Pablo I-322
Panella, Massimo II-155, II-306 Salt, Doug II-118
Patru, Ciprian I-3 Salt, Douglas II-105
Pavão, João I-359 Samuel, Olusegun D. I-202
Pérez, Carlos I-570, II-356 Sánchez, Ángel II-236
Pérez, Elsa I-570, II-356 Sánchez-Maroño, Noelia II-128
Pérez-Acosta, Guillermo I-49, I-61 Sánchez-Reolid, Daniel I-265
Pérez-Godoy, María Dolores II-272 Sánchez-Reolid, Roberto I-265
Pérez-Sánchez, Beatriz II-343 Santana-Cabrera, Luciano I-49
Perez-Uribe, Andres I-510 Satizábal, Héctor F. I-510
Perfilieva, Irina II-14 Sattar, Asma I-279
Piñero-Fuentes, Enrique II-57 Schleif, Frank-Michael I-126
Polhill, Gary II-105, II-118 Sen, Prithwish I-347
Pomares, H. I-15 Serranti, Silvia II-281
Pranto, Tahmid Hasan I-498 Shen, Yinghua I-89
Puccinelli, Niccolò II-168 Smaïli, Kamel I-443
Puentes-Marchal, Fernando II-272 Soh, De Wen I-395
Solteiro Pires, E. J. I-359
Quiros, Steve I-36 Somoano, Aris II-356
Sosa-Marrero, Alberto II-223
Rahman, Rashedur M. I-498, I-597, II-429 Sparvoli, Marina II-93
Rahman, Shafin I-484 Stawicki, Piotr II-69
Ramírez, Luis II-356 Steiner, Peter II-143
Ramos-Jiménez, Gonzalo I-432 Stoean, Catalin I-3, II-260
Rangel, Jose E. Torres II-356 Stoean, Ruxandra I-3, II-260
Ravelo-García, Antonio G. I-49 Suárez-Araujo, Carmen Paz I-49, I-61,
Rezeika, Aya II-69 II-223
Riam, Shazid Hasan I-597 Suárez-Díaz, Francisco I-61
Riaza, José A. I-190 Suárez-Díaz, Francisco J. I-49
Ribeiro, João I-359 Succetti, Federico II-306
Ribelles, Nuria I-24
Rios-Navarro, Antonio II-57 Tareq, Azizul Hakim I-597
Rivolli, Adriano I-471 Tartibu, Lagouge K. I-202
624 Author Index

Toledano Pavón, Jesús I-559 Villalonga, Claudia I-322, I-537


Torrents-Barrena, Jordina I-36 Viriri, Serestina I-385, I-407
Trocan, Rares I-3 Volosyak, Ivan II-69
Tseytlin, Boris II-293 Volpe, Gualtiero II-367
Volta, Erica II-367
Valero, Óscar I-165
Wang, Fu-Kwun I-372
van Deventer, Stefan II-183
Wilson, Ruth II-118
Vannucci, Marco II-248
Vaquet, Valerie I-101 Xiao, Zhi I-89
Vargas, Martin I-36
Vargas, Nancy I-178 Yamashita, Masaru I-609
Varona, Pablo I-230, II-81
Veredas, Francisco J. I-24 Zamora-Cárdenas, Willard I-36
Vergara, Luis I-178 Zinkhan, Dirk II-404

You might also like