100% found this document useful (5 votes)
66 views75 pages

Science of Cyber Security 1st Edition by Wenlian Lu, Kun Sun, Moti Yun, Feng Liu 3030891372 9783030891374

The document provides information about the Science of Cyber Security, including details about the 1st edition book and its editors, as well as highlights from the third International Conference on Science of Cyber Security (SciSec 2021). It outlines the conference's mission to foster interdisciplinary research in cybersecurity and lists various topics of interest. Additionally, it includes links to download the book and other related resources from ebookball.com.

Uploaded by

gamoletike
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
66 views75 pages

Science of Cyber Security 1st Edition by Wenlian Lu, Kun Sun, Moti Yun, Feng Liu 3030891372 9783030891374

The document provides information about the Science of Cyber Security, including details about the 1st edition book and its editors, as well as highlights from the third International Conference on Science of Cyber Security (SciSec 2021). It outlines the conference's mission to foster interdisciplinary research in cybersecurity and lists various topics of interest. Additionally, it includes links to download the book and other related resources from ebookball.com.

Uploaded by

gamoletike
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Visit ebookball.

com to download the full version and


explore more ebook or textbook

Science of Cyber Security 1st edition by Wenlian


Lu, Kun Sun, Moti Yun, Feng Liu 3030891372
9783030891374

_____ Click the link below to download _____


https://ebookball.com/product/science-of-cyber-security-1st-
edition-by-wenlian-lu-kun-sun-moti-yun-feng-
liu-3030891372-9783030891374-19994/

Explore and download more ebook or textbook at ebookball.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Nuclear Power Plants Innovative Technologies for


Instrumentation and Control Systems 1st Edition by Yang
Xu, Yongbin Sun, Yanyang Liu, Feng Gao, Pengfei Gu,
Zheming Liu ISBN 9811634580 9789811634581
https://ebookball.com/product/nuclear-power-plants-innovative-
technologies-for-instrumentation-and-control-systems-1st-edition-by-
yang-xu-yongbin-sun-yanyang-liu-feng-gao-pengfei-gu-zheming-liu-
isbn-9811634580-9789811634581-17016/

Frontiers in Cyber Security 1st Edition by Haomiao Yang,


Rongxing Lu ISBN 981999330X 9789819993307

https://ebookball.com/product/frontiers-in-cyber-security-1st-edition-
by-haomiao-yang-rongxing-lu-isbn-981999330x-9789819993307-17056/

Security of Cyber Physical Systems State Estimation and


Control 1st edition by Chengwei Wu, Weiran Yao, Guanghui
Sun, Ligang Wu 3030883507 9783030883508
https://ebookball.com/product/security-of-cyber-physical-systems-
state-estimation-and-control-1st-edition-by-chengwei-wu-weiran-yao-
guanghui-sun-ligang-wu-3030883507-9783030883508-20120/

Bayesian Network Structure Ensemble Learning 1st Edition


by Feng Liu, Fengzhan Tian, Qiliang Zhu 9783540738701

https://ebookball.com/product/bayesian-network-structure-ensemble-
learning-1st-edition-by-feng-liu-fengzhan-tian-qiliang-
zhu-9783540738701-10416/
Science of Cyber Security SciSec 2022 Workshops 1st
edition by Kouichi Sakurai,Chunhua Su 9811977682
9789811977688
https://ebookball.com/product/science-of-cyber-security-
scisec-2022-workshops-1st-edition-by-kouichi-sakurai-chunhua-
su-9811977682-9789811977688-25744/

Distributed Control Methods and Cyber Security Issues in


Microgrids 1st edition by Wenchao Meng, Xiaoyu Wang,
Shichao Liu 0128169478 9780128169476
https://ebookball.com/product/distributed-control-methods-and-cyber-
security-issues-in-microgrids-1st-edition-by-wenchao-meng-xiaoyu-wang-
shichao-liu-0128169478-9780128169476-20002/

Prediction of Protein Subcellular Locations by Combining K


Local Hyperplane Distance Nearest Neighbor 1st Edition by
Hong Liu, Haodi Feng, Daming Zhu 9783540738701
https://ebookball.com/product/prediction-of-protein-subcellular-
locations-by-combining-k-local-hyperplane-distance-nearest-
neighbor-1st-edition-by-hong-liu-haodi-feng-daming-
zhu-9783540738701-9288/

Dental Digital Photography From Dental Clinical


Photography to Digital Smile Design 1st Edition by Feng
Liu ISBN 981131621X 9789811316210
https://ebookball.com/product/dental-digital-photography-from-dental-
clinical-photography-to-digital-smile-design-1st-edition-by-feng-liu-
isbn-981131621x-9789811316210-5472/

Secure data science intergrating cyber security and data


science 1st Edition by Bhavani Thuraisngham, Murat
Kantarcioglu, Latifur Khan ISBN 9781000557510 1000557510
https://ebookball.com/product/secure-data-science-intergrating-cyber-
security-and-data-science-1st-edition-by-bhavani-thuraisngham-murat-
kantarcioglu-latifur-khan-isbn-9781000557510-1000557510-20098/
Wenlian Lu
Kun Sun
Moti Yung
Feng Liu (Eds.)
LNCS 13005

Science of Cyber Security


Third International Conference, SciSec 2021
Virtual Event, August 13–15, 2021
Revised Selected Papers
Lecture Notes in Computer Science 13005

Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA

Editorial Board Members


Elisa Bertino
Purdue University, West Lafayette, IN, USA
Wen Gao
Peking University, Beijing, China
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Gerhard Woeginger
RWTH Aachen, Aachen, Germany
Moti Yung
Columbia University, New York, NY, USA
More information about this subseries at http://www.springer.com/series/7410
Wenlian Lu · Kun Sun · Moti Yung ·
Feng Liu (Eds.)

Science of Cyber Security


Third International Conference, SciSec 2021
Virtual Event, August 13–15, 2021
Revised Selected Papers
Editors
Wenlian Lu Kun Sun
Fudan University George Mason University
Shanghai, China Fairfax, VA, USA

Moti Yung Feng Liu


Computer Science Department Chinese Academy of Social Sciences
Columbia University Beijing, China
New York, NY, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Computer Science
ISBN 978-3-030-89136-7 ISBN 978-3-030-89137-4 (eBook)
https://doi.org/10.1007/978-3-030-89137-4
LNCS Sublibrary: SL4 – Security and Cryptology

© Springer Nature Switzerland AG 2021


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The third annual International Conference on Science of Cyber Security (SciSec 2021)
was held successfully online during August 13–15, 2021. The mission of SciSec is
to catalyze the research collaborations between the relevant scientific communities
and disciplines that should work together in exploring the foundational aspects of
cybersecurity. We believe that this collaboration is needed in order to deepen our under-
standing of, and build a firm foundation for, the emerging science of cybersecurity
discipline. SciSec is unique in appreciating the importance of multidisciplinary and
interdisciplinary broad research efforts towards the ultimate goal of a sound science of
cybersecurity, which attempts to deeply understand and systematize knowledge in the
field of security.
SciSec 2021 solicited high-quality, original research papers that could justifiably help
develop the science of cybersecurity. Topics of interest included, but were not limited
to, the following:

– Cybersecurity Dynamics
– Cybersecurity Metrics and Their Measurements
– First-principle Cybersecurity Modeling and Analysis (e.g., Dynamical Systems,
Control-Theoretic Modeling, Game-Theoretic Modeling)
– Cybersecurity Data Analytics
– Quantitative Risk Management for Cybersecurity
– Big Data for Cybersecurity
– Artificial Intelligence for Cybersecurity
– Machine Learning for Cybersecurity
– Economics Approaches for Cybersecurity
– Social Sciences Approaches for Cybersecurity
– Statistical Physics Approaches for Cybersecurity
– Complexity Sciences Approaches for Cybersecurity
– Experimental Cybersecurity
– Macroscopic Cybersecurity
– Statistics Approaches for Cybersecurity
– Human Factors for Cybersecurity
– Compositional Security
– Biology-inspired Approaches for Cybersecurity
– Synergistic Approaches for Cybersecurity

SciSec 2021 was hosted by the Fudan University, Shanghai, China. Due to the
intensification of the COVID-19 situation all around the world, SciSec 2021 was held
totally online through Tencent Conference and VooV Meeting. The Program Committee
selected 22 papers — 17 full papers and 5 poster papers — from a total of 50 submissions
for presentation at the conference. These papers cover the following subjects: detec-
tion for cybersecurity, machine learning for cybersecurity, and dynamics, network and
vi Preface

inference. We anticipate that the topics covered by the program in the future will be
more systematic and further diversified.
The Program Committee further selected the paper titled “Detecting Internet-
scale Surveillance Devices using RTSP Recessive Features” by Zhaoteng Yan, Zhi Li,
Wenping Bai, Nan Yu, Hongsong Zhu, and Limin Sun and the paper titled “Dismantling
Interdependent Networks Based on Supra-Laplacian Energy” by Wei Lin, Shuming
Zhou, Min Li, and Gaolin Chen for the Distinguished Paper Award. The conference
program also included four invited keynote talks: the first keynote titled “Layers of
Abstractions and Layers of Obstructions and the U2F” was delivered by Moti Yung,
Google and Columbia University, USA; the second keynote titled “Progresses and
Challenges in Federated Learning” was delivered by Gong Zhang, Huawei, China; the
third keynote titled “SARR: A Cybersecurity Metrics and Quantification Framework”
was delivered by Shouhuai Xu, University of Colorado Colorado Springs, USA; while
the fourth keynote was titled “Preliminary Exploration on Several Security Issues in AI”
and was delivered by Yugang Jiang, Fudan University, China. The conference program
presented a panel discussion on “Where are Cybersecurity Boundaries?”
We would like to thank all of the authors of the submitted papers for their interest
in SciSec 2021. We also would like to thank the reviewers, keynote speakers, and
participants for their contributions to the success of SciSec 2021. Our sincere gratitude
further goes to the Program Committee, the Publicity Committee, and the Organizing
Committee, for their hard work and great efforts throughout the entire process of
preparing and managing the event. Furthermore, we are grateful to Fudan University
for their generosity to enable free registration for attending SciSec 2021.
We hope that you will find the conference proceedings inspiring and that it will
further help you in finding opportunities for your future research.

August 2021 Wenlian Lu


Kun Sun
Moti Yung
Feng Liu
Organization

Steering Committee
Guoping Jiang Nanjing University of Posts and Telecommunications, China
Feng Liu Institute of Information Engineering, Chinese Academy of
Sciences, China
Shouhuai Xu University of Colorado Colorado Springs, USA
Moti Yung Google and Columbia University, USA

Program Committee Co-chairs


Wenlian Lu Fudan University, China
Kun Sun George Mason University, USA
Moti Yung Google and Columbia University, USA

Organization Committee Chair


Lei Shi Fudan University, China

Publicity Co-chairs
Habtamu Abie Norwegian Computing Center, Norway
Guen Chen University of Texas at San Antonio, USA
Noseong Park George Mason University, USA
Chunhua Su University of Aizu, Japan
Jia Xu Nanjing University of Posts and Telecommunications, China
Xiaofan Yang Chongqing University, China
Jeong Hyun Yi Soongsil University, South Korea
Lidong Zhai Institute of Information Engineering, Chinese Academy of
Sciences, China
James Zheng Macquarie University, Australia

Program Committee Members


Habtamu Abie Norwegian Computing Centre, Norway
Richard Brook Clemson University, USA
Sara Foresti University of Milan, Italy
Ying Fan Beijing Normal University, China
Xinwen Fu University of Massachusetts Lowell, USA
Jianxi Gao Rensselaer Polytechnic Institute, USA
viii Organization

Dieter Gollmann Hamburg University of Technology, Germany


Yujuan Han Shanghai Maritime University, China
Debiao He Wuhan University, China
Daojing He East China Normal University, China
Wei Huo Institute of Information Engineering, Chinese Academy of
Sciences, China
Zbigniew Kalbarczyk Coordinated Science Laboratory, USA
Arash Habibi Lashkari University of New Brunswick, Canada
Lingguang Lei Institute of Information Engineering, Chinese Academy of
Sciences, China
Cong Li Fudan University, China
Xiwei Liu Tongji University, China
Zhuo Lu University of South Florida, USA
Pratyusa K. Manadhata Hewlett Packard Laboratories, USA
Thomas Moyer University of North Carolina at Charlotte, USA
Andrew Odlyzko University of Minnesota, USA
Kazumasa Omote University of Tsukuba, Japan
Noseong Park George Mason University, USA
Kouichi Sakurai Kyushu University, Japan
Lipeng Song North University of China, China
Chunhua Su Osaka University, Japan
Longkun Tang Huaqiao University, China
Lingyu Wang Concordia University, Canada
Zhi Wang Nankai University, China
Chengyi Xia Tianjin University of Technology, China
Min Xiao Nanjing University of Posts and Telecommunications, China
Xin-Jian Xu Shanghai University, China
Jia Xu Nanjing University of Posts and Telecommunications, China
Maochao Xu Illinois State University, USA
Guanhua Yang Binghamton University, USA
Xiaofan Yang Chongqing University, China
Chuan Yue Colorado School of Mines, USA
Jun Zhao Nanyang Technological University, Singapore
Sencun Zhu Pennsylvania State University, USA
Cliff Zou University of Central Florida, USA

Web Chair
Weixia Cai Institute of Information Engineering, Chinese Academy of
Sciences, China

Organizing Committee Members


Dandan Yuan Fudan University, China
Chenyao Zhang Fudan University, China
Contents

Keynote Report

SARR: A Cybersecurity Metrics and Quantification Framework (Keynote) . . . . . 3


Shouhuai Xu

Detection for Cybersecurity

Detecting Internet-Scale Surveillance Devices Using RTSP Recessive


Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Zhaoteng Yan, Zhi Li, Wenping Bai, Nan Yu, Hongsong Zhu,
and Limin Sun

An Intrusion Detection Framework for IoT Using Partial Domain


Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Yulin Fan, Yang Li, Huajun Cui, Huiran Yang, Yan Zhang,
and Weiping Wang

Mining Trojan Detection Based on Multi-dimensional Static Features . . . . . . . . . 51


Zixian Tang, Qiang Wang, Wenhao Li, Huaifeng Bao, Feng Liu,
and Wen Wang

Botnet Detection Based on Multilateral Attribute Graph . . . . . . . . . . . . . . . . . . . . . 66


Hua Cheng, Yinda Shen, Tao Cheng, Yiquan Fang, and Jianfan Ling

A New Method for Inferring Ground-Truth Labels and Malware Detector


Effectiveness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
John Charlton, Pang Du, and Shouhuai Xu

Machine Learning for Cybersecurity

Protecting Data Privacy in Federated Learning Combining Differential


Privacy and Weak Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chuanyin Wang, Cunqing Ma, Min Li, Neng Gao, Yifei Zhang,
and Zhuoxiang Shen

Using Chinese Natural Language to Configure Authorization Policies


in Attribute-Based Access Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Zhuoxiang Shen, Neng Gao, Zeyi Liu, Min Li, and Chuanyin Wang
x Contents

A Data-Free Approach for Targeted Universal Adversarial Perturbation . . . . . . . 126


Xiaoyu Wang, Tao Bai, and Jun Zhao

Caps-LSTM: A Novel Hierarchical Encrypted VPN Network Traffic


Identification Using CapsNet and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Jiyue Tang, Le Yang, Song Liu, Wenmao Liu, Meng Wang,
Chonghua Wang, Bo Jiang, and Zhigang Lu

Multi-granularity Mobile Encrypted Traffic Classification Based on Fusion


Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Hui Zhang, Gaopeng Gou, Gang Xiong, Chang Liu, Yuewen Tan,
and Ke Ye

Stochastic Simulation Techniques for Inference and Sensitivity Analysis


of Bayesian Attack Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Isaac Matthews, Sadegh Soudjani, and Aad van Moorsel

Simulations of Event-Based Cyber Dynamics via Adversarial Machine


Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Zhaofeng Liu, Yinchong Wang, Huashan Chen, and Wenlian Lu

Dynamics, Network and Inference

Dismantling Interdependent Networks Based on Supra-Laplacian Energy . . . . . . 205


Wei Lin, Shuming Zhou, Min Li, and Gaolin Chen

DWT-DQFT-Based Color Image Blind Watermark with QR Decomposition . . . . 214


Liangcheng Qin, Ling Ma, and Xiongjun Fu

A Multi-level Elastic Encryption Protection Model . . . . . . . . . . . . . . . . . . . . . . . . . 225


Caimei Wang, Zijian Zhou, Hong Li, Zhengmao Li, and Bowen Huang

An Event-Based Parameter Switching Method for Controlling


Cybersecurity Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Zhaofeng Liu, Wenlian Lu, and Yingying Lang

RansomLens: Understanding Ransomware via Causality Analysis


on System Provenance Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Rui Mei, Han-Bing Yan, and Zhi-Hui Han

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269


Keynote Report
SARR: A Cybersecurity Metrics
and Quantification Framework
(Keynote)

Shouhuai Xu(B)

Laboratory for Cybersecurity Dynamics Department of Computer Science,


University of Colorado Colorado Springs, Colorado Springs, USA
[email protected]
https://xu-lab.org/

Abstract. Cybersecurity Metrics and Quantification is a fundamental


but notoriously hard problem and is undoubtedly one of the pillars under-
lying the emerging Science of Cybersecurity. In this paper, we present an
novel approach to addressing this problem by unifying Security, Agility,
Resilience and Risk (SARR) metrics into a single framework. The SARR
approach and the resulting framework are unique because: (i) it is driven
by the assumptions that are made when modeling, designing, imple-
menting, operating, and defending systems, which are broadly defined
to include infrastructures and enterprise networks; and (ii) it embraces
the uncertainty inherent to the cybersecurity domain. We will review the
status quo by looking into existing metrics and quantification research
through the SARR lens and discuss a range of open problems.

Keywords: Cybersecurity metrics · Cybersecurity quantification ·


Security · Agility · Resilience · Risk · Cybersecurity management

1 Introduction
Effective cybersecurity design, operations, and management ought to rely on
quantitative metrics. This is because effective cybersecurity decision-making and
management demands cybersecurity quantification, which in turn requires us
to tackle the problem of metrics. For example, when a Chief Executive Officer
(CEO) decides whether to increase the enterprise’s cybersecurity investment, the
CEO would ask a simple question: What is the estimated return, ideally mea-
sured in dollar amount, if we increase the cybersecurity budget (say) by $5M
this year? Unfortunately, the status quo is that we cannot answer this question
yet because cybersecurity metrics and quantification remains one of the most
difficult yet fundamental open problems [10,32,38], despite significant efforts
[3,4,6–8,21,30,33,35,37,39,40,59].
Our Contributions. In this paper, we propose a systematic approach to tack-
ling the problem, by unifying Security, Agility, Resilience, and Risks (SARR) met-
rics into a single framework. The approach is assumption-driven and embraces the
c Springer Nature Switzerland AG 2021
W. Lu et al. (Eds.): SciSec 2021, LNCS 13005, pp. 3–17, 2021.
https://doi.org/10.1007/978-3-030-89137-4_1
4 S. Xu

uncertainty inherent to the cybersecurity domain. Moreover, we evaluate exist-


ing cybersecurity metrics through the SARR lens and propose a range of open
problems for future research. Our findings include: (i) it is essential to explic-
itly and precisely articulate the assumptions made at the design and operation
phases of systems; (ii) it is important to understand and characterize the relation-
ships between cybersecurity assumptions, because they may not be independent
of each other; (iii) uncertainty is inherent to cybersecurity because defenders can-
not directly observe whether or not assumptions made at the design phase are vio-
lated in the operation phase; (iv) the current understanding of cybersecurity agility
and resilience metrics is superficial, even if defenders can be certain about which
assumptions are violated; (v) cybersecurity risk metrics emerge from the uncer-
tainty inherent to assumptions.
Related Work. From a conceptual point of view, the present study corresponds
to one pillar of the Cybersecurity Dynamics framework [47,48,53,56], which
aims to quantify and analyze cybersecurity from a holistic perspective (in con-
trast to the building-blocks perspective). This approach stresses the importance
of considering the time dimension in cybersecurity, leading to time-dependent
metrics and analysis methods (e.g., [11,18,26,49–51,54,57,58,60]). The SARR
framework is partly inspired by the STRAM framework [8], which systematizes
security metrics, trust metrics, resilience metrics, and agility metrics. The SARR
framework goes far beyond the STRAM framework [8] because STRAM does not
present the underlying connections between the families of metrics. In contrast,
SARR uses assumptions and uncertainty to unify families of metrics, and these
two aspects play no roles in STRAM.
From a technical point of view, the present study focuses on characterizing
what need to be measured, rather than how to measure because we treat the
measurement of well-defined metrics as an orthogonal research problem. The
latter can be challenging as well. For example, when we infer the ground-truth
labels of files in the setting of malware detection, we often encounter the situation
that malware detectors give conflicting information (e.g., one detector says a file
is benign but another says the file is malicious) [1,2,13,23,31].
Paper Outline. Section 2 presents the SARR framework. Section 3 discusses
the status quo in cybersecurity metrics and quantification research. Section 4
explores future research directions. Section 5 concludes the present paper.

2 The SARR Framework

2.1 Terminology
Abstractions and Views. Cyberspace is a complex system which mandates
the use of multiple (levels of) abstractions to understand them. We use the
term network broadly to include the entire cyberspace, an infrastructure, an
enterprise network, or a cyber-physical-human network of interest. Networks
can be decomposed horizontally or vertically, leading to two views:
SARR: A Cybersecurity Metrics and Quantification Framework 5

– In the horizontal view, an network can be decomposed into many networked


devices, which are combinations of hardware and software with computing
and networking capabilities. Devices include computers (e.g., servers, sensors
and IoT devices), network devices (e.g., routers and switches), and cyberse-
curity devices which run (e.g.) intrusion detection systems and firewalls. The
horizontal view is often used by cyber defense operators.
– In the vertical view, a network can be decomposed into layers of components,
which are hardware or software sub-systems, possibly provided by different ven-
dors. Examples of components include operating systems (e.g., Microsoft Win-
dows vs. Linux), applications, and security functions (e.g., intrusion detection
systems, malware detectors, and firewalls). We may treat data as components
as well. Each component may be further divided into layers. For example, the
TCP/IP stack can be seen as the communication component, which can be
divided into layers of communication protocols. Each component may incor-
porate or integrate multiple building-blocks, such as the machine learning tech-
niques employed by malware detectors. This distinction is important because
building-block techniques are often carefully analyzed, components are often
proprietary and analyzed only superficially, but networks are analyzed even less
thoroughly, perhaps because they are very complex.
Design vs. Operation. In principle, the lifecycle of a network, device, com-
ponent, or building-block can be divided into a design phase and an operation
phase. The design phase deals with its modeling, design, analysis, implementa-
tion, and testing; for ease of reference, we refer to the entities that conduct these
activities as designers. The operation phase deals with its installation, configu-
ration, operation, maintenance, and defense in the real world; similarly, we refer
to the entities that conduct these activities as operators. The design vs. opera-
tion distinction is important because there can be huge gaps between these two
phases, which will be elaborated later.
Cybersecurity vs. Security Properties and Metrics. We use the term
security properties to describe the standard notions of confidentiality, integrity,
availability, non-repudiation, authentication, etc. We use the term cybersecu-
rity properties to describe security, agility, resilience, risk and possibly other
properties. This means that cybersecurity properties are much broader than
security properties. Cybersecurity quantification indicates precise characteriza-
tion of these cybersecurity properties. For this purpose, we need cybersecurity
metrics. A metric is a function that maps from a set of objects (e.g., networks,
devices, components or building-blocks) to a set of values with a scale (e.g.,
{0, 1} or [0, 1]), reflecting security or cybersecurity properties of the objects [35].

2.2 SARR Overview


Figure 1 highlights the framework, which is driven by the assumptions that are
made at the design and operation phases of an network, device, component or
building-block. For a given set of assumptions, there are three kinds of scenarios
according to a spectrum of (un)certainty in regards to the assumptions.
6 S. Xu

AssumpƟons
(threat model, trust, etc)

Security metrics no yes Security metrics (conƟnuous or [0,1])


Violated? Agility metrics
(discrete or {0,1})
Resilience
maybe

Risk metrics (security, agility,


resilience with uncertainty)

Fig. 1. The SARR framework is driven by assumptions and embraces uncertainty.

1. It is certain that the assumptions are not violated. This often corresponds
to the analyses that are conducted at the design phase, where designers con-
sider a range of security properties (e.g., confidentiality, integrity, availability,
authentication, and non-repudiation) with respect to a certain system model
and a certain threat model. Essentially, these security properties are often
defined over a binary scale, denoted by {0, 1}, indicating whether a property
holds or not under the system model and the threat model.
2. It is certain that some or all assumptions are violated. This often corresponds
to the operation phase, where security properties may be partially or entirely
compromised. Therefore, security properties may be defined over a continu-
ous scale, such as [0, 1] (e.g., the fraction of compromised computers in an
network). In this case, detection of violations would trigger the defender to
take countermeasures to “bounce back” from the violations, leading to the
notion of agility and resilience metrics, which will be elaborated later.
3. It is uncertain whether assumptions are violated or not (i.e., assumptions may
be violated). This naturally leads to risk metrics by associating uncertainties
to security, agility and resilience metrics.
In the rest of the section we will elaborate these matters.

2.3 Assumptions
In order to tame cybersecurity, assumptions may be made, explicitly or implic-
itly, during the design and operation phases of an network, device, component
or building-block. They are fundamental to cybersecurity properties.
Assumptions Associated with the Design Phase. At this phase, assump-
tions can be made with respect to system models, vulnerabilities, attacks (i.e.,
threat models) and defenses. For example, designers often use system models to
describe the interactions between the participating entities, the environment and
the interaction with it (if appropriate), the communication channels between the
participating entities (e.g., authenticated private channel), and the trust that is
embedded into the model (e.g., a participating entity is semi-honest or honest).
SARR: A Cybersecurity Metrics and Quantification Framework 7

Designers use threat models with simplifying assumptions when specifying secu-
rity properties, proposing systems architectures, selecting protocols and mecha-
nisms, analyzing whether a property is attained or not under those assumptions.
Programmers and testers detect/eliminate bugs and vulnerabilities in the course
of developing software, while making various (possibly implicit) assumptions
(e.g., competency of a bug/vulnerability detection tool).
Assumptions Associated with the Operation Phase. During this phase,
various kinds of (possibly implicit) assumptions are often made (e.g., compe-
tency of configurations or defense tools). One example of assumptions that are
often made at the design phase and then inherited at the operation phase is the
attacker’s capability. For example, Byzantine Fault-Tolerance (BFT) protocols,
which can be seen as a building-block, work correctly when no more than one-
third of the replicas are compromised [29]. However, there is no guarantee in the
real world that the attacker cannot go beyond the one-third threshold, effectively
compromising the assurance offered by these powerful building-blocks. This can
be further attributed to the limited capabilities of cyber defense tools, such as
intrusion detection systems and malware detectors.

2.4 Metrics When Assumptions Are Certainly Not Violated

Under the premise that assumptions are complete and are not violated, cyber-
security metrics may degenerate to security metrics in the sense that agility,
resilience and risk may become irrelevant. Moreover, it may be sufficient to use
binary metrics, namely {0, 1}, to quantify security properties. This serves as a
starting point towards tackling cybersecurity metrics because it would be rare
to ascertain in the real world that assumptions are certainly not violated and
that the articulated assumptions are sufficient.
Metrics Associated with the Design Phase. At the design phase, we need
to define metrics to precisely describe the desired security properties. Textbook
knowledge would teach us that the desired properties include confidentiality,
integrity, availability, authentication, non-repudiation, etc. However, they may
not be sufficient. We advocate accurate and rigorous definitions (or specifica-
tions) of metrics, ideally as accurate and rigorous as the definitions given in
modern cryptography [16]. This is important because when accurate and rigor-
ous definitions are not given, it is not possible to conduct rigorous analysis to
establish desired properties. This means that each security property must be pre-
cisely defined with respect to a system model and a threat model. For example,
when we specify an availability property, we should specify it as a property of a
service (e.g., the service offered at port #80) vs. data (e.g., a file in a computer)
in the presence of some attack.
Metrics Associated with the Operation Phase. We need to define met-
rics to precisely describe the required security properties of an network, device,
component, or building-block at the operation phase. For example, availability
metrics at the operation phase may include service response time and service
8 S. Xu

throughput. Metrics associated with the operation phase are less understood
than their counterparts associated with the design phase.

2.5 Metrics When Assumptions Are Certainly Violated

When assumptions are violated, some or all of the security properties are com-
promised. In order to describe how defenders respond to such violations of
assumptions or compromises of security properties, agility and resilience proper-
ties emerge. Intuitively, agility quantitatively characterizes how fast a defender
responds to cybersecurity situation changes [8,30], and resilience quantitatively
characterizes whether and how the defender can make the network, device, com-
ponent or building-block “bounce back” from the violation of assumptions (i.e.,
correcting the violations) and the compromise of security properties (i.e., making
them hold again). The state-of-the-art is that the notions of agility-by-design and
resilience-by-design are less investigated and understood than security-by-design.
Agility and resilience are inherently associated with the operation phase because
(i) assumptions are the starting point of a design process and (ii) assumptions are
violated in real-world operations but not at the design phase. When assumptions
are violated, we propose quantifying security, agility, and resilience properties.
For quantifying security properties, examples of metrics are described as fol-
lows. (i) To what extent may an assumption have been violated? This may
require quantifying the extent to which a network, device, component, or
building-block is compromised. This is important for example when using BFT
protocols to tolerate attacks, where the fraction of devices that are compro-
mised (e.g., 35% vs. 50%) would make a difference in the defender’s response
to the attacks. (ii) To what extent is a security property compromised? This is
important because a security property may not be all-or-nothing, meaning that
a violation of assumptions may only cause a degradation of a security property.
For example, when a network (or device) is compromised, the attacker may only
be able to steal some, but not all, of the data sorted on the network (or device),
causing a partial loss of the confidentiality property.
For quantifying agility, example metrics are described as follows. (i) How agile
is the defender in detecting the violation of an assumption? One assumption can
be that an employed intrusion prevention system can effectively detect a certain
class of attacks. Another assumption can be that the attacker does not identify
any 0-day vulnerability or use any new attack vector that cannot be recognized
by defense tools. (ii) How fast do the desired security properties degrade because
of the violation of assumptions? (iii) How quickly does the defender react to
the violation of assumptions or successful attacks? (iv) How quickly does the
defender bring the network to the required level of security properties?
For quantifying resilience, example metrics are described as follows. (i) What
is the maximum degree of violation in terms of the assumptions or security
properties that would make it possible for the defender to recover the network (or
device or component) and its services without shutting down and re-booting it
from scratch? In order to quantify these, we would need to quantify the maximum
SARR: A Cybersecurity Metrics and Quantification Framework 9

degree of violation with respect to the assumptions that can be tolerated. (ii)
Does a security property degrades gradually or abruptly when assumptions are
violated? (iii) How does the degradation pattern, such as gradual vs. abrupt,
depend on the degree of violations of the assumptions?

2.6 Metrics When Assumptions May Be Violated


The preceding two scenarios correspond to the two ends of the spectrum of
(un)certainty about the assumptions being violated or not. In the real world,
it is rare that the defender would be certain about whether an assumption is
violated or not. As a consequence, it is rare for the defender to be certain about
whether a security property is compromised or not. Since uncertainty is inher-
ent to the cybersecurity domain, we have to embrace the uncertainty, meaning
that cybersecurity metrics must be defined while bearing in mind the uncer-
tainty factor. We use the term risk to accommodate the security, agility and
resilience metrics that can cope with uncertainty. Some examples of risk met-
rics are described as follows. (i) What is the degree of certainty that a security
property is compromised? In order to quantify this, the defender would need
to quantify the degree of certainty that an assumption is violated. (ii) What is
the degree of certainty when a defense tool flags an event as an attack (e.g., an
incoming network connection is an attack or a file is malicious) or anomaly? This
may be measured as the conditional probability (or trustworthiness), for exam-
ple, Pr(the event is indeed an attack|a detector says an event is an attack). (iii)
What is the degree of certainty that some software contains a zero-day vulnera-
bility that is known to the attacker but not the defender? (iv) What is the degree
of certainty about a threat model (e.g., attacker indeed cannot wage attacks that
are not permitted by the threat model)?
Observation 1. Uncertainty is inherent to cybersecurity, meaning that we must
define cybersecurity metrics to help defenders quantify cybersecurity risks and
make decisions in their cyber defense operations.

3 Status Quo
In this section, we use the SARR framework as a lens to look into the cyberse-
curity metrics that have been proposed in the literature. For this purpose, we
leverage survey papers [8,35,37] as a source of metrics, while considering more
recent literature published after those survey papers (e.g., [13,30]).

3.1 Assumptions
Assumptions are often articulated more clearly in building-block studies (e.g.,
cryptography) than the other settings of cybersecurity (e.g., what a chosen-
ciphertext attacker can do exactly). However, there are still gaps that are yet to
be bridged. First, assumptions may be stated implicitly. For example, cryptogra-
phy assumes that cryptographic keys are kept secret, either entirely or at least for
10 S. Xu

most information of cryptographic keys (i.e., a partial exposure of cryptographic


key may be tolerated). However, cryptographic keys in the real world can be
compromised in their entirety (see, e.g., [14,20]). As a consequence, the security
property of digital signatures, known as unforgeability, under the assumption
that the private signing keys are kept secret is compromised. This highlights
the importance of coping with the presence of compromised cryptographic keys
which have not been revoked yet [12,41,52]. Still, the trustworthiness of digital
signatures has yet to be quantified given the uncertainty that the private signing
keys or services may have been compromised without being detected.
Second, assumptions may be inadequate or incomplete. One example of inad-
equacy is the evolution from considering chosen-plaintext attacks to considering
chosen-ciphertext attacks. One example of incompleteness is that earlier threat
models simply did not consider the presence of side-channel attacks, which are
however realistic. This is not surprising because cyber attacks evolve with time,
meaning that threat models also evolve with time [47,48,53].
The preceding examples highlight the gaps between the validity of assump-
tions made at the design phase and the validity of these assumption in the real
world. These gaps highlight the importance of explicitly and precisely articu-
lating assumptions because violation of assumptions cause new properties and
metrics to emerge (e.g., emergence of agility and resilience metrics). Moreover,
the inevitable uncertainty causes the emergence of risk metrics.

Observation 2. In order to tame cybersecurity, it is essential to explicitly and


precisely articulate the assumptions that are made at the design phase and the
operation phase. This is far from being achieved and is a big challenge.

3.2 Security Metrics

In [35], four classes of security metrics are defined: those for quantifying vulnera-
bilities (including user/human, interface-induced, and software vulnerabilities),
those for quantifying attack capabilities (including zero-day, targeted, botnet
attacks, malware, and evasion attacks), those for quantifying the effectiveness
of defenses (including preventive, reactive, proactive defense capabilities), and
those for quantifying situations (e.g., the percentage of compromised comput-
ers at a point in time). It is concluded in [35], and re-affirmed in [55], that the
problem “what should be measured” is largely open.

Observation 3. Our understanding of what should be measured in cybersecurity


is superficial.

3.3 Agility Metrics

In a broader context, the existing metrics that can be adapted to measure agility
are classified into the following categories [8]: those for quantifying timeliness
(including detection time, overall agility quickness) and those for quantifying
usability (including ease of use, usefulness, defense cost).
SARR: A Cybersecurity Metrics and Quantification Framework 11

In the narrower context of attack-defense interactions, a novel family of agility


metrics are proposed in [30] to quantify the co-evolution (or escalation) of cyber
attacks and defenses. Unlike the classification used in [8], the agility metrics
defined in [30] accommodate two dimensions of the attack-defense co-evolution,
namely timeliness and effectiveness. Timeliness metrics describe how quickly an
attacker is in terms of evolving its attacks in response to the defender’s use of
new strategy and/or techniques (and comparable metrics from the defender’s per-
spective). These metrics include: generation-time, which is the time it takes an
attacker (or defender) to evolve its strategies or techniques from one generation
to another generation as observed by the defender (or attacker), where a genera-
tion may be a new version of a tool (e.g., a new version of malware detector); and
triggering-time, which is the time it takes an attacker (or defender) to evolve into
the next generation of strategy or techniques. Effectiveness metrics quantify how
effective a new generation of attacks (or defenses) are, including: evolutionary-
effectiveness, which describes the effectiveness of the attacker’s (defender’s) strat-
egy or techniques with respect to defender’s (or attacker’s); relative-generational-
impact, which is the effectiveness gain of the current generation of attack (or
defense) over the past generation of attack (of defense).

Observation 4. Our understanding of agility metrics are even more superficial


than our understanding of security metrics.

3.4 Resilience Metrics

By adapting the existing metrics that are defined in other contexts, resilience
metrics may be classified into the following families [8]: those for quantifying
fault-tolerance metrics (including mean-time-to-failure, percolation threshold,
diversity), those for quantifying adaptability (including degree of local deci-
sion, degree of intelligent decision, degree of automation), and those for quan-
tifying recoverability (including mean-time-to-full-recovery, mean-time-between-
failures, mean-time-to-repair, and intrusion response cost). There are no system-
atic studies on resilience metrics.

Observation 5. Our understanding of resilience metrics are even more super-


ficial than our understanding of security metrics.

3.5 Risk Metrics

Risk is often investigated in the setting of hazards and is often defined as a prod-
uct of threat (which is a probability estimated by domain expert or other means),
vulnerability (which is another probability estimated by domain expert or another
means), and consequence (which is the damage caused by the threat when it hap-
pens) [22]. This means that risk is quantified as the expected or mean loss. How-
ever, this approach is not competent for managing the risk incurred by terror-
ist attacks [9] because it cannot deal with, among other things, the dependence
between many events (e.g., cascading failures). This immediately implies that this
12 S. Xu

approach is not competent for cybersecurity risk management because there are
many kinds of dependencies and interdependencies which make cybersecurity risks
exhibit emergent properties [17,34,36,46]. In order to deal with these problems,
Cybersecurity Dynamics offers a promising approach, especially its predictive
power in forecasting the evolution of dynamical situational awareness attained by
first-principle analyses (e.g., [11,18,19,26–28,42,45,49–51,54,60,61]) and data-
driven analyses (e.g., [5,15,24,25,43,44,57,58]).

4 Future Research Directions

In order to ultimately tackle the Cybersecurity Metrics and Quantification prob-


lem, we highlight some open problems that must be adequately addressed.
Taming Cybersecurity Assumptions. It would be ideal that (i) assumptions
are always explicitly and precisely stated, (ii) assumptions are independent of
each other, and (iii) assumptions made at the design phase are always valid
at the operation phase. However, these are hard to achieve. Alternatively, we
should characterize the relationships between related assumptions. For exam-
ple, an authenticated private communication channel assumes the following: (i)
authenticity of the communication parties, (ii) confidentiality of the communica-
tion contents, and (iii) integrity of the communication contents. These assump-
tions further rely on other, often implicitly made, assumptions. Specifically, the
preceding assumptions (i)–(iii) would have to be based on the assumption that
the communication parties are not compromised when cryptographic mecha-
nisms are used to realize these assumptions; otherwise, assumptions (i)–(iii) are
violated. Therefore, when the threat model assumes that the attacker cannot
compromise any of the communication parties, the security guarantee rigorously
proven in the abstract model may become irrelevant in the real world.
Bridging Design vs. Operation Gaps. There are several gaps between
designers’ views and defenders’ views, especially in terms of their levels of
abstractions. In particular, designers often deal with build-blocks and compo-
nents, but defenders often deal with networks and devices. There are big gaps
between these views. First, designers often make assumptions with the mindset
that these assumptions will not be violated in the real world. As a consequence,
the resulting cybersecurity properties are not only bound to the completeness
and accuracy of the assumptions, but also bound to the premise that the assump-
tions are not violated in practice. Therefore, there is a big gap between the
certainty of assumptions considered by designers and the uncertainty of assump-
tions being violated or not as perceived by defenders. Second, the network-level
and device-level implications of the assumptions that are made when designing
building-blocks and components are often unaddressed. This further amplifies
the uncertainty encountered by defenders in the real world.
The preceding discussion would explain why security properties are often
analyzed in academic research literature but not agility or resilience properties.
Moreover, the preceding discussion would also explain why designers often focus
SARR: A Cybersecurity Metrics and Quantification Framework 13

on achieving preventive defense with no successful attacks. However, defenders


often deal with successful attacks, which break security properties by violating
the assumptions made by designers. This explains why real-world cyber defend-
ers need to leverage preventive defenses, reactive defenses, adaptive defenses,
proactive defenses, and active defenses collectively in order to achieve effective
defenses [47,48]. This also explains why the motivating question mentioned in
the Introduction cannot be answered yet, namely that the current cybersecurity
metrics and quantification knowledge is not sufficient to answer the defender’s
question in regards to the return on cybersecurity investment.
Identifying and Defining Cybersecurity Metrics That Must Be Mea-
sured. As mentioned above, the current understanding of what should be quan-
tified is superficial [35,55]. It is important to define a comprehensive, ideally
complete, suite of metrics under each of the security, agility, resilience, and risk
pillars. Since the literature study is often geared towards designers’ views, exist-
ing metrics are often defined for some purposes but rarely for the purposes of
cyber defense operations. Since academic research is often geared towards that
assumptions are not to be violated, there is a very limited body of knowledge
that can help defenders achieve quantitative cyber defense decision-making and
cybersecurity risk management. In order to bridge these gaps, one candidate
approach is to leverage cybersecurity datasets to define cybersecurity metrics
at multiple levels of abstraction: data vs. knowledge vs. application [48]. Using
Medical Science as an analogy, data-level metrics may be defined to quantify
building-block or “cell” level properties; “cell” level metrics may be leveraged
to define sub-system or “tissue” level properties; “tissue” level metrics may be
further leveraged to define “organ” level metrics; “organ” level metrics may be
further leveraged to define “human body” level metrics. It should be mentioned
that a higher level metric would not be any simple aggregation of some lever
level metrics, because cybersecurity is largely about emergent properties [46,55],
meaning that the phenomenon observed at a higher level of abstraction is the
outcome of interactions between its composing parts.
Seeking Foundations to Distinguish Good from Poor Metrics [35]. It
would not be hard to define cybersecurity metrics, but it is certainly hard to
define “good” cybersecurity metrics. This is because it is hard to define criteria
or seek foundations to evaluate the competency or usefulness of cybersecurity
metrics. In order to tackle this problem, we may need to conduct many case
studies and define metrics at multiple levels of abstractions [55] before we can
draw general insights along this direction. It would be ideal to conduct such case
studies on some killer applications; two candidate killer applications are cyber
defense command-and-control and quantitative cyber risk management [48].
Fostering a Cybersecurity Metrics Research Community. In order to
tackle such a fundamental problem like cybersecurity metrics and quantifica-
tion, it must take a community effort. This can be justified by how the basic
medical science research has supported clinical healthcare practices. For exam-
ple, the basic medical science research creates knowledge to help understand
14 S. Xu

how the various kinds of metrics (e.g., blood pressure) would reflect a human
being’s health condition (e.g., presence or absence of certain diseases), and this
kind of knowledge is applied to guide the practice of medical diagnosis and
treatment. Analogously, cybersecurity metrics research would need to identify,
invent, and define metrics (e.g., “cybersecurity blood pressure”) that reflect the
cybersecurity situations and can be applied to diagnose the “health conditions”
of networks or devices.
In order to accelerate the fostering of a research community, we can start
with some “grass roots” actions. For example, when one publishes a paper, the
author may strive to clearly articulate the assumptions that are needed by the
new result. Moreover, the author may strive to define metrics that are impor-
tant to quantify the progress made by the new result [35]. Furthermore, when
we teach cybersecurity courses, we should strive to make students know that
much research needs to be done in order to tackle the fundamental problems
of cybersecurity metrics and quantification. For this purpose, we would need to
develop new curriculum materials.
Developing a Science of Cybersecurity Measurement. Well defined cyber-
security metrics need to be measured in the real world, which would demand the
support of principled (rather than heuristic) methods. This problem may seem
trivial at a first glance, which may be true for some metrics in some settings.
However, the accurate measurement of cybersecurity metrics could be very chal-
lenging, which may be analogous to the measurement of light speed or gravita-
tional constant in Physics. To see this, let us consider a simple and well-defined
metric: What is the fraction (or percentage) of the devices in an network that
are compromised at a given point in time t? The measurement of this metric is
challenging in practice when the network is large. The reason is that automated
or semi-automated tools (e.g., intrusion detection systems and/or anti-malware
tools) that can be leveraged for measurement purposes are not necessarily trust-
worthy because of their false-positives and false-negatives.

5 Conclusion
We have presented a framework to unify security metrics, agility metrics,
resilience metrics, and risk metrics. The framework is driven by the assump-
tions that are made at the design and operations phases, while embracing the
uncertainty about whether these assumptions are violated or not in the real
world. We identified a number of gaps that have not been discussed in the lit-
erature but must be bridged in order to tackle the problem of Cybersecurity
Metrics and Quantification and ultimately tame cybersecurity. In particular, we
must bridge the assumption gap and the uncertainty gap, which are inherent to
the discrepancies between designers’ views at lower levels of abstractions (i.e.,
building-blocks and components) and operators’ views at high levels of abstrac-
tions (i.e., networks and devices). We presented a number of future research
directions. In addition, it is interesting to investigate how to extend the SARR
framework to accommodate other kinds of metrics, such as dependability.
SARR: A Cybersecurity Metrics and Quantification Framework 15

Acknowledgement. We thank Moti Yung for illuminating discussions and Eric Ficke
for proofreading the paper. This work was supported in part by ARO Grant #W911NF-
17-1-0566, NSF Grants #2115134 and #2122631 (#1814825), and by a Grant from the
State of Colorado.

References
1. Charlton, J., Du, P., Cho, J., Xu, S.: Measuring relative accuracy of malware
detectors in the absence of ground truth. In: Proceedings of IEEE MILCOM, pp.
450–455 (2018)
2. Charlton, J., Du, P., Xu, S.: A new method for inferring ground-truth labels. In:
Proceedings of SciSec (2021)
3. Chen, H., Cho, J., Xu, S.: Quantifying the security effectiveness of firewalls and
DMZs. In: Proceedings of HoTSoS 2018, pp. 9:1–9:11 (2018)
4. Chen, H., Cho, J., Xu, S.: Quantifying the security effectiveness of network diver-
sity. In: Proceedings of HoTSoS 2018, p. 24:1 (2018)
5. Chen, Y., Huang, Z., Xu, S., Lai, Y.: Spatiotemporal patterns and predictability
of cyberattacks. PLoS ONE 10(5), e0124472 (2015)
6. Cheng, Y., Deng, J., Li, J., DeLoach, S., Singhal, A., Ou, X.: Metrics of security.
In: Cyber Defense and Situational Awareness, pp. 263–295 (2014)
7. Cho, J., Hurley, P., Xu, S.: Metrics and measurement of trustworthy systems. In:
Proceedings IEEE MILCOM (2016)
8. Cho, J., Xu, S., Hurley, P., Mackay, M., Benjamin, T., Beaumont, M.: STRAM:
measuring the trustworthiness of computer-based systems. ACM Comput. Surv.
51(6), 128:1–128:47 (2019)
9. National Research Council: Review of the Department of Homeland Security’s
Approach to Risk Analysis. The National Academies Press (2010)
10. INFOSEC Research Council. Hard problem list. http://www.infosec-research.org/
docs public/20051130-IRC-HPL-FINAL.pdf (2007)
11. Da, G., Xu, M., Xu, S.: A new approach to modeling and analyzing security of
networked systems. In: Proceedings HotSoS 2014, pp. 6:1–6:12 (2014)
12. Dai, W., Parker, P., Jin, H., Xu, S.: Enhancing data trustworthiness via assured
digital signing. IEEE TDSC 9(6), 838–851 (2012)
13. Du, P., Sun, Z., Chen, H., Cho, J.H., Xu, S.: Statistical estimation of malware
detection metrics in the absence of ground truth. IEEE T-IFS 13(12), 2965–2980
(2018)
14. Durumeric, Z., et al.: The matter of heartbleed. In: Proceedings IMC (2014)
15. Fang, Z., Xu, M., Xu, S., Hu, T.: A framework for predicting data breach risk:
leveraging dependence to cope with sparsity. IEEE T-IFS 16, 2186–2201 (2021)
16. Goldreich, O.: The Foundations of Cryptography, vol. 1. Cambridge University
Press (2001)
17. Haimes, Y.Y.: On the definition of resilience in systems. Risk Anal. 29(4), 498–501
(2009)
18. Han, Y., Lu, W., Xu, S.: Characterizing the power of moving target defense via
cyber epidemic dynamics. In: HotSoS, pp. 1–12 (2014)
19. Han, Y., Lu, W., Xu, S.: Preventive and reactive cyber defense dynamics with
ergodic time-dependent parameters is globally attractive. IEEE TNSE, accepted
for publication (2021)
20. Harrison, K., Xu, S.: Protecting cryptographic keys from memory disclosures. In:
IEEE/IFIP DSN 2007, pp. 137–143 (2007)
16 S. Xu

21. Homer, J., et al.: Aggregating vulnerability metrics in enterprise networks using
attack graphs. J. Comput. Secur. 21(4), 561–597 (2013)
22. Jensen, U.: Probabilistic risk analysis: foundations and methods. J. Am. Stat.
Assoc. 97(459), 925 (2002)
23. Kantchelian, A., et al.: Better malware ground truth: techniques for weighting
anti-virus vendor labels. In: Proceedings AISec, pp. 45–56 (2015)
24. Li, D., Li, Q., Ye, Y., Xu, S.: SoK: arms race in adversarial malware detection.
CoRR, abs/2005.11671 (2020)
25. Li, D., Li, Q., Ye, Y., Xu, S.: A framework for enhancing deep neural networks
against adversarial malware. IEEE TNSE 8(1), 736–750 (2021)
26. Li, X., Parker, P., Xu, S.: A stochastic model for quantitative security analyses of
networked systems. IEEE TDSC 8(1), 28–43 (2011)
27. Lin, Z., Lu, W., Xu, S.: Unified preventive and reactive cyber defense dynamics is
still globally convergent. IEEE/ACM ToN 27(3), 1098–1111 (2019)
28. Lu, W., Xu, S., Yi, X.: Optimizing active cyber defense dynamics. In: Proceedings
GameSec 2013, pp. 206–225 (2013)
29. Lynch, N.: Distributed Algorithms. Morgan Kaufmann (1996)
30. Mireles, J., Ficke, E., Cho, J., Hurley, P., Xu, S.: Metrics towards measuring cyber
agility. IEEE T-IFS 14(12), 3217–3232 (2019)
31. Morales, J., Xu, S., Sandhu, R.: Analyzing malware detection efficiency with mul-
tiple anti-malware programs. In: Proceedings CyberSecurity (2012)
32. Nicol, D., et al.: The science of security 5 hard problems, August 2015. http://cps-
vo.org/node/21590
33. Noel, S., Jajodia, S.: A suite of metrics for network attack graph analytics. In:
Network Security Metrics, pp. 141–176. Springer, Cham (2017). https://doi.org/
10.1007/978-3-319-66505-4 7
34. Park, J., Seager, T.P., Rao, P.S.C., Convertino, M., Linkov, I.: Integrating risk
and resilience approaches to catastrophe management in engineering systems. Risk
Anal. 33(3), 356–367 (2013)
35. Pendleton, M., Garcia-Lebron, R., Cho, J., Xu, S.: A survey on systems security
metrics. ACM Comput. Surv. 49(4), 62:1–62:35 (2016)
36. Pfleeger, S.L., Cunningham, R.K.: Why measuring security is hard. IEEE Secur.
Priv. 8(4), 46–54 (2010)
37. Ramos, A., Lazar, M., Filho, R.H., Rodrigues, J.J.P.C.: Model-based quantitative
network security metrics: a survey. IEEE Commun. Surv. Tutor. 19(4), 2704–2734
(2017)
38. National Science and Technology Council: Trustworthy cyberspace: strate-
gic plan for the federal cybersecurity research and development program
(2011). https://www.nitrd.gov/SUBCOMMITTEE/csia/Fed Cybersecurity RD
Strategic Plan 2011.pdf
39. Wang, L., Jajodia, S., Singhal, A.: Network Security Metrics. Network Security
Metrics, Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66505-4
40. Wang, L., Jajodia, S., Singhal, A., Cheng, P., Noel, S.: k-zero day safety: a network
security metric for measuring the risk of unknown vulnerabilities. IEEE TDSC
11(1), 30–44 (2014)
41. Xu, L., et al.: KCRS: a blockchain-based key compromise resilient signature system.
In: Proceedings BlockSys, pp. 226–239 (2019)
42. Xu, M., Da, G., Xu, S.: Cyber epidemic models with dependences. Internet Math.
11(1), 62–92 (2015)
43. Xu, M., Hua, L., Xu, S.: A vine copula model for predicting the effectiveness of
cyber defense early-warning. Technometrics 59(4), 508–520 (2017)
SARR: A Cybersecurity Metrics and Quantification Framework 17

44. Xu, M., Schweitzer, K.M., Bateman, R.M., Xu, S.: Modeling and predicting cyber
hacking breaches. IEEE T-IFS 13(11), 2856–2871 (2018)
45. Xu, M., Xu, S.: An extended stochastic model for quantitative security analysis of
networked systems. Internet Math. 8(3), 288–320 (2012)
46. Xu, S.: Emergent behavior in cybersecurity. In: Proceedings HotSoS, pp. 13:1–13:2
(2014)
47. Xu, S.: Cybersecurity dynamics: a foundation for the science of cybersecurity. In:
Proactive and Dynamic Network Defense, pp. 1–31 (2019)
48. Xu, S.: The cybersecurity dynamics way of thinking and landscape (invited paper).
In: ACM Workshop on Moving Target Defense (2020)
49. Xu, S., Lu, W., Xu, L.: Push- and pull-based epidemic spreading in networks:
thresholds and deeper insights. ACM TAAS 7(3), 1–26 (2012)
50. Xu, S., Lu, W., Xu, L., Zhan, Z.: Adaptive epidemic dynamics in networks: thresh-
olds and control. ACM TAAS 8(4), 1–19 (2014)
51. Xu, S., Lu, W., Zhan, Z.: A stochastic model of multivirus dynamics. IEEE Trans.
Dependable Secure Comput. 9(1), 30–45 (2012)
52. Xu, S., Yung, M.: Expecting the unexpected: towards robust credential infrastruc-
ture. In: Financial Crypto, pp. 201–221 (2009)
53. Xu, S.: Cybersecurity dynamics. In: Proceedings HotSoS 2014, pp. 14:1–14:2 (2014)
54. Shouhuai, X., Wenlian, L., Li, H.: A stochastic model of active cyber defense
dynamics. Internet Math. 11(1), 23–61 (2015)
55. Xu, S., Trivedi, K.: Report of the 2019 SATC pi meeting break-out session on
“cybersecurity metrics: Why is it so hard?” (2019)
56. Shouhuai, X., Yung, M., Wang, J.: Seeking foundations for the science of cyber
security. Inf. Syst. Front. 23, 263–267 (2021)
57. Zhan, Z., Xu, M., Xu, S.: Characterizing honeypot-captured cyber attacks: statis-
tical framework and case study. IEEE T-IFS 8(11), 1775–1789 (2013)
58. Zhan, Z., Maochao, X., Shouhuai, X.: Predicting cyber attack rates with extreme
values. IEEE T-IFS 10(8), 1666–1677 (2015)
59. Zhang, M., Wang, L., Jajodia, S., Singhal, A., Albanese, M.: Network diversity: a
security metric for evaluating the resilience of networks against zero-day attacks.
IEEE Trans. Inf. Forensics Secur. 11(5), 1071–1086 (2016)
60. Zheng, R., Lu, W., Xu, S.: Active cyber defense dynamics exhibiting rich phenom-
ena. In: Proceedings HotSoS (2015)
61. Zheng, R., Lu, W., Xu, S.: Preventive and reactive cyber defense dynamics is
globally stable. IEEE TNSE 5(2), 156–170 (2018)
Detection for Cybersecurity
Detecting Internet-Scale Surveillance
Devices Using RTSP Recessive Features

Zhaoteng Yan1,2 , Zhi Li1,2(B) , Wenping Bai2 , Nan Yu2 ,


Hongsong Zhu1,2 , and Limin Sun1,2
1
School of Cyber Security, University of Chinese Academy of Sciences,
Beijing, China
{yanzhaoteng,lizhi,zhuhongsong,sunlimin}@iie.ac.cn
2
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
{baiwenping,yunan}@iie.ac.cn

Abstract. In recent years, fingerprinting online surveillance devices has


been a hot research topic. However, large-scale devices still can not
be identified their brands in previous studies and mainstream search
engines. In this work, we propose a novel neural network-based app-
roach for automatically discovering surveillance devices and identifying
their brands in cyberspace. Moreover, by using the deep semi-supervised
learning algorithm, the most unlabeled samples with new-explored reces-
sive features can be learned of RTSP protocol. In the global IPv4 space,
we implement an evaluation on 3, 123, 489 active RTSP-hosts for train-
ing and testing. The experimental results demonstrate our approach can
discover 2, 803, 406 surveillance devices, which are eight times and three
times more than those discovered by Shodan and Zoomeye. Moreover,
the number of identified brand-level devices by our approach is 2, 457, 661
devices with their brands, which is at least four times more than existing
methods. The performance of these results with precision and recall can
both achieve 93%.

Keywords: RTSP · Video surveillance devices · Fingerprinting

1 Introduction

Video surveillance devices, including IP camera, Network Video Recorder(NVR),


Digital Video Recorder (DVR), etc., can intuitively grasp the real-time image and
audio of the target monitoring area. For convenience, massive surveillance devices
are connecting to the Internet. Meanwhile, these online embedded devices bring
huge threats to cyberspace because of their lack of enough security protection
[2]. The most effective measurement of security assessment is identifying their
manufacturers. Therefore, previous studies [5,7,10,14] and search engines [13,18]
on discovery and fingerprinting these devices have been implemented.
However, there is existing a problem that most online surveillance devices still
can not be identified. As an example, this work focus on the video surveillance
devices which are opening RTSP service in cyberspace. As a kind of classic video
c Springer Nature Switzerland AG 2021
W. Lu et al. (Eds.): SciSec 2021, LNCS 13005, pp. 21–35, 2021.
https://doi.org/10.1007/978-3-030-89137-4_2
22 Z. Yan et al.

transmission protocol, RTSP is the most used application protocol which can be
implemented by manufacturers as the remote video transmission service between
their video surveillance products. According to the statistics in 2019, there are
2, 290, 633 hosts were opening RTSP service on the Internet [18]. However, only
526, 241 can be identified as surveillance devices and there is a large remain-
ing number of RSTP-hosts (1, 764, 392, about 77% of the total) were tagged as
Unknown. Even the remaining unknown RSTP-hosts may include some RTSP
streaming servers or honeypots, however, the number of un-identified surveil-
lance devices is still too huge.
According to our analysis, the two main reasons of most RSTP-hosts can not
be identified are as follows: (1) Single probing load. Current mainstream search
engine (such as Shodan [13], Zoomeye [18], Censys [5]) and previous studies
[10] employed one kind method OPTIONS request as the only probing load to
obtain protocol banners. Actually, this is only one of the 11 methods which were
originally designed for RTSP request packets [12]. This caused the identification
source is unitary. (2) Simple dominant feature. Previous approaches employed
the obvious characteristic field (commonly using the name of manufacturers, such
as Hikvision, Dahua) as the fingerprinting feature. This kind of feature is simple
and intuitive, but most effective. However, only a few parts of video surveillance
devices provide these dominant keywords in their protocol banners. Moreover,
more and more manufacturers have modified their obvious vendor/model name
from the responses in their new products. This caused the scope of identification
becomes much smaller and the difficulty of fingerprinting is highly increased.
Motivation: In this paper, we aim to detect these Unknown RSTP-hosts on
the Internet whether are surveillance devices or not, and then identify their
manufacturers if they are. The key of our work is mining new effective non-
dominant features and generating accurate fingerprints on the RTSP protocol
banners. We observe that every surveillance manufacturer implements the RTSP
service between its product and a streaming server with distinction. That caused
that there are a variety of response packets when these surveillance devices
received the same sequential standard request methods (including OPTIONS,
DESCRIBE, SETUP, etc.). Therefore, these distinctions can be employed as a new
kind of recessive feature to distinguish the surveillance brands, even if the various
response packets do not contain obvious characteristic keywords.
Challenges: To achieve this end, we need to address three main challenges as
follows:
– Un-standard products: each manufacturer respectively implements RTSP in
its diverse serial of surveillance products, which caused the difficulty of feature
extraction is greatly increased.
– Too few labeled-samples: there are no publicly labeled surveillance devices by
using RTSP recessive features as ground truth for training neural networks.
– Unevenly sample distribution: unbalanced market occupancy of various
brands caused the distribution of training and testing samples are uneven.
Method: To address these challenges, we employ a novel method consists of
three parts: Text-CNN, DS 3 L, Open-world SSL, and the name is abbreviated
Detecting Internet-Scale Surveillance Devices 23

to TDO. Firstly, we use enhanced Text- Convolutional Neural Network(CNN) to


extract recessive features from encoded consequential and normal Response >
matrixes against un-standard samples. Then we use DS 3 L (Deep Safe Semi-
Supervised Learning) for extending the labeled samples. In particular, the open-
world SSL (Semi-Supervised Learning) algorithm is employed to clustering un-
seen classes which may not be contained in the labeled brands on the Internet.
Results: To evaluate the performance, we implement our approach to the real
Internet-wide public data. We generate the experimental dataset which contains
3, 011, 237 active RTSP-hosts. After training and testing, 2, 803, 406 can be clas-
sified as discovered surveillance devices. The highly increased identification rate
(89.75%) is about eight and three times than famous search engines of Shodan
and Zoomeye [13,18]. Moreover, 2, 457, 661 discovered devices can be identified
in their diverse 35 brands, including 8 new brands of unseen classes. And the
number of identified brand-lever devices is 7.2 times and 4.7 times more than
representative previous approaches by ARE and IoTtracker [7,14]. The results
show that the precision and recall keep high comparable performance as 93.39%
and 93.12%.
Contributions: Overall, we make the following desirable contributions:
– New recessive feature: we propose a method TDO for identifying online
surveillance devices using recessive features of RTSP protocol.
– Active learning : we propose the first neural network-based approach to encod-
ing RTSP Responses > into consequential vectored matrixes for recessive
feature extraction.
– Effective evaluation in the real experiment: we generate the first data set
which contains 2, 803, 406 labeled surveillance devices, about four times as
many as commercial search engines. Among which, 2, 457, 661 brand-level
samples are identified over 35 different brands, almost four times than previ-
ous approaches.
– High precision and recall: the evaluation demonstrates that our approach can
achieve higher precision and recall (93%) than six comparable classification
algorithms.

2 Related Work
Internet-Wide Discovering of Surveillance Devices. As a video surveillance
device is the most typical IoT device, previous studies on discovering surveil-
lance devices in cyberspace were also along with fingerprinting online IoT devices.
Durumeric et al. proposed ZMap which decreased Internet-wide scanning time
from two years to one hour [6]. Based on ZMap, researchers proposed many fast
Internet-wide scan mechanisms for IoT devices, including webcams [3]. At the
same time, device search engines such as Shodan [13] and Censys [5] emerged in
succession and provided Internet-wide device searching services to the public. Due
to the online surveillance devices were the most Mirai-infected bonnets in 2016 [2],
Antonakakis et al. determined hundreds of thousands of IP cameras and DVRs
were infections. A specific study on discovering almost 1.6 million surveillance
24 Z. Yan et al.

devices in cyberspace, which comes closest to our work, was given by Qiang et al.
[10]. Although Qiang et al. also discussed RTSP as one application service of video
surveillance devices, they only used HTTP webpages as the main fingerprinting
target source data.
The above-mentioned previous works mainly focus on fingerprinting surveil-
lance devices using dominant features, such as obvious vendor and product names
[5,7], or visible webpages [10]. However, these approaches only can cover a part
of target devices that carried dominant features. To the complex and irregular
cyberspace, the ideal dominant features obviously inadequate.
Protocol Recessive Features. For augmented cognition, a new field on intend-
ing to extract recessive features has come out in recent years. Xu et al. used
HTML Doomtree and CSS style as enhancing recessive features [14], that
increased 40.76% identifiable devices than using obvious vendor and product
names [7]. Kai et al. and Zhaoteng et al. proposed neural network to learn deep
recessive features in protocol banners, such as special string field [16,17]. These
works have partly added the number of identifiable online IoT devices. However,
previous works still performed helpless on RTSP-service surveillance devices.
That is exactly the issue what this paper aimed to resolve.

3 Protocol Analysis on RTSP


In this section, we first describe RTSP <Request, Response> method. Then, we
introduce recessive feature extraction for Internet-wide measurements on surveil-
lance device discovery.

3.1 <Request, Response> Methods of RTSP


As a standard C/S model application protocol, there is a total of 20 methods. On
one hand, there are 6 required or recommended methods, and 5 optional methods
defined by RTSP [12]. There are 6 serialize required or recommended methods
are commonly implemented in video surveillance devices as streaming servers.
On the other hand, there are 9 abnormal methods (including bad method,
bad option url, etc.) and one HTTP Get method. These special methods are
generated by us to extend the identification data source.
Each request method carries out a unique function [12], which caused its
corresponding response is also different. Meanwhile, the <Request, Response>
packets explore HTTP-structure format. As shown in Fig. 1, there are two exam-
ples of Hikvision and Uniview IP camera. We observe that different brands of
devices typically have unique six Response packets(status code, layout, and con-
tent, even some Response is null). What’s more, even the same brand, Response
packets are also different from diverse product serials or device types. For exam-
ple, Hikvision NVRs and IP cameras may be loaded with different firmware in
different production years. Consequently, the diverse and normalized response
packets present two characteristics: invariant and distinct, which provide us the
opportunity to extract these useful features for identification.
Detecting Internet-Scale Surveillance Devices 25

Hikvision Uniview
RTSP\/1.0 200 OK\r\n RTSP\/1.0 200 OK\r\n
CSeq: 1000\r\n CSeq: 1000\r\n
OPTIONS Public: OPTIONS, DESCRIBE, PLAY, PAUSE, SETUP, TEARDOWN, Public: SETUP,TEARDOWN,OPTIONS,PLAY,ANNOUNCE,DESCRIBE,
Response SET_PARAMETER, GET_PARAMETER\r\n SET_PARAMETER,\r\n\r\n
Date: Wed, May 08 2019 11:12:32 GMT\r\n\r\n
RTSP\/1.0 401 Unauthorized\r\n RTSP\/1.0 401 ClientUnAuthorized\r\n
CSeq: 1000\r\n CSeq: 1000\r\n
DESCRIBE WWW-Authenticate: Digest realm=\"54c4155fbb2a\", WWW-Authenticate: Digest realm=\"48ea631dc359\",
Response nonce=\"30e10b28c57c6c54b16963f75c6d5ce0\", stale=\"FALSE\"\r\n nonce=\"155771122371111194113114116158459961687\", stale=\"FALSE\"\r\n
WWW-Authenticate: Basic realm=\"54c4155fbb2a\"\r\n WWW-Authenticate: Basic realm=\"48ea631dc359\"\r\n\r\n
Date: Wed, May 08 2019 11:12:32 GMT\r\n\r\n
RTSP\/1.0 401 Unauthorized\r\n RTSP\/1.0 401 ClientUnAuthorized\r\n
CSeq: 1000\r\n CSeq: 1000\r\n
SETUP WWW-Authenticate: Digest realm=\"54c4155fbb2a\", WWW-Authenticate: Digest realm=\"48ea631dc359\",
Response nonce=\"12631c5955aff087d4444dc61b9229b7\", stale=\"FALSE\"\r\n nonce=\"155711112571519136121111143115834677461\", stale=\"FALSE\"\r\n
WWW-Authenticate: Basic realm=\"54c4155fbb2a\"\r\n WWW-Authenticate: Basic realm=\"48ea631dc359\"\r\n\r\n
Date: Wed, May 08 2019 11:12:32 GMT\r\n\r\n
RTSP\/1.0 454 RTSP\/1.0 401 ClientUnAuthorized\r\n
Session Not Found\r\n CSeq: 1000\r\n
PLAY CSeq: 1000\r\n WWW-Authenticate: Digest realm=\"48ea631dc359\",
Response Session: 0\r\n nonce=\"1557113249111521172121171351611601686410\", stale=\"FALSE\"\r\n
Date: Wed, May 08 2019 11:12:32 GMT\r\n\r\n WWW-Authenticate: Basic realm=\"48ea631dc359\"\r\n\r\n
RTSP\/1.0 401 ClientUnAuthorized\r\n
CSeq: 1000\r\n
PAUSE null WWW-Authenticate: Digest realm=\"48ea631dc359\",
Response nonce=\"155718132942111942811111114817589336115\", stale=\"FALSE\"\r\n
WWW-Authenticate: Basic realm=\"48ea631dc359\"\r\n\r\n
RTSP\/1.0 500 Internal Server Error\r\n RTSP\/1.0 401 ClientUnAuthorized\r\n
CSeq: 1000\r\n CSeq: 1000\r\n
TEARDOWN Session: 0\r\n WWW-Authenticate: Digest realm=\"48ea631dc359\",
Response Date: Wed, May 08 2019 11:12:32 GMT\r\n\r\n nonce=\"1557113111411129512112212119151415836419\", stale=\"FALSE\"\r\n
WWW-Authenticate: Basic realm=\"48ea631dc359\"\r\n\r\n

Fig. 1. Two examples of 6 irregular response packets.

Most significantly, non-interruption is the basic guideline to choose Request


methods. For example, benefited from OPTIONS does not influence server state,
existing search engines use the OPTIONS Response as identification of source
data [13,18]. Thanks to the authentication of RTSP, the other Requests will not
bring any interrupt to target devices unless authentication passed. Moreover,
these un-authentication Responses still contain useful contents for identification,
as shown in Fig. 1. Therefore, there is no need to worry about the influence of
getting response packets on online surveillance devices.

3.2 Features Selection

To identify the brand of a surveillance device besides direct brand name keywords
as observe feature, we choose three-dimensional recessive features in Response
packets. First, we use the diverse responses of 20 sequential Request methods.
Take Fig. 1 as two typical examples, two different IP cameras from two manu-
facturers return diverse responses. Among which, the contents on each header
field show obvious contrast. Second, we explore the diverse respond mechanisms
of different brands. As shown in Fig. 1, PAUSE Response of a Uniview camera
display a normal packet while Hikvision responses null. This non-responding is
a characteristic in itself to distinguish with other brands. Third, status code is
another useful feature. There are 44 kinds of RTSP status codes, which have
been implemented by diverse manufacturers. Such as “454” in Hikvision PLAY
Response and “401” in Uniview PLAY Response.
With regard to the stability of feature source data, these Responses of a
device rarely change (except for data, time, and temporary strings in the “nonce”
field) until its firmware is updated.
26 Z. Yan et al.

3.3 Challenges of Internet-Wide Measurement

Although the above-mentioned displays that it is possible to use diverse and


structural RTSP Responses for fingerprinting. There are reminding three diffi-
culties to hold in cyberspace. First, not all surveillance devices of one brand
respond same sequential Response packets because of diverse serials, device
types, and updating firmware. That caused the Responses complex and irregu-
lar. Second, market share of different manufacturers determine their respective
proportion of different products in cyberspace. That caused the samples dis-
tributed uniformly, which easily lead to classification on some brands of less
market share over-fitting. Third, there are not labeled online samples, among
which one common case is that unlabeled data contains classes that are not seen
in the labeled data.
As our goal is to identify online surveillance devices and detect their brands
which are tagged as the Unknown by existing search engines, we focus on
obtaining Responses from accessible devices which open their RTSP services
in cyberspace without protection. With regard to the meticulously protected
devices, we will not attempt any disallowed probing packets to comply with
ethics. To achieve this aim, we need to address the three difficulties as our main
novelty and contribution.

4 Methodology
In this section, we introduce a new architecture of TDO for fingerprinting online
surveillance devices base on mixed neural networks and deep learning methods.
As illustrated in Fig. 2, the workflow of our architecture consists of six steps:
(1) Data collection: we firstly send 20 sequential RTSP Request methods to the
hosts with port 554 opening both on our private Intranet and the public Internet.
Simultaneously, we collect these Responses from these offline and online devices.
(2) Pre-processing: we clean and normalize each Response as a matrix sample in
the unified format. (3) Input: depending on manual labeling experience and fin-
gerprints, we tag the known device a label with surveillance type and its brand.
Thus, these samples can be divided into two categories: labeled and unlabeled,
which can be respectively used as the training and testing dataset. (4) Training:
based on deep learning algorithms, we train the classification model. (5) Clas-
sification and (6) Identification: the trained module can identify the unlabeled
sample whether a surveillance device and its brand.

4.1 Data Collection

As illustrated in Fig. 2, our data sources are obtained from two environ-
ments: offline and online. First, concerning the offline data source, we con-
structed a private surveillance Intranet which contains 67 popular surveillance
devices we purchased. In the white-box testing environment, we can explore
<Request, Response> methods of RTSP and ensure the non-interruption in
Detecting Internet-Scale Surveillance Devices 27

Fig. 2. The Main System Architecture.

detail. Most significantly, we have evaluated the feasibility of fingerprinting these


devices by using Responses as a data source. Meanwhile, these few well-labeled
samples can be used as seeds for fingerprinting in cyberspace. Second, for reduc-
ing unnecessary Internet-wide scans and respecting network ethics, we collected
active hosts which were opening RTSP service on port 554 from search engine
[13,18]. Then, we sent 20 Request methods to these hosts and collected respect-
ing Responses as online data source.

4.2 Pre-processing
Although RTSP Response packets are typically semi-structured and specially
RTSP-format text content, the packets still need to be processed. As shown in
Fig. 1, the Response packets contain some useless symbols, such as “=”, “:”. To
make useful characteristics more efficient, we firstly transform these symbols as
ASCII codes and clean by them NLP (Natural Language Processing) [1]. Then,
considering the texts vary in length, we set the fixed length of each response no
exceeding 200. Most significantly, we process each response in standard RTSP
Response format with 44 fields. Among which, we fill null for the empty field.
After the above steps, each response is processed as a unified text-matrix. Thus,
each sample is constructed by 20 sequential matrixes for every host.

4.3 Labeling
The normalized samples need to be tagged a label in two previous approaches
by using common features. And the label of each host contains three attributes:
whether surveillance device (T ), whether can be identified by dominant feature
(A), and its brand name (B). For instance, a label yi = {Ai , Ti , Bi } of one
Hikvision IPcam (i) is {Y es, Y es, Hikvision}. With regard to the different class
28 Z. Yan et al.

like Hikvision NVRs, we still tagged as the same brand but trading as new unseen
class. Hence, the classification process is conducted by the three-dimensional
classifiers. Among which, the first two classifiers are binary, and the class number
(c ∈ C) of the last classifier depending on the actual number of real brands in
cyberspace. First, we use the offline dataset for artificially labeled which includes
67 private devices covering 43 brands. This manual labeling can be expended
on a part of online samples (nearly 30, 147) by similar features of 20 sequential
Responses: same status code, similar fields, and similar content. However, the
0.96% of labeled samples are still not enough for training the classier model of the
remaining most samples. For increasing the scale of labeled samples, we secondly
employ the existing fingerprinting approach [5]. By extracting dominant features
of apparent brand keywords, we manually labeled 334, 651 samples. Combined
the above two approaches and merged duplicate parts, the labeled dataset Xn
contains n (n = 341, 324) samples belonging to 43 brands (occupied 10.93% of
total samples) which can be the input of the training dataset. And the remaining
m (m = 2, 782, 165) unlabeled samples construct the testing dataset Ym .

4.4 Training
Considering the Response packet is different from common text, we employ
CNN-enhanced instead of CNN for better learning recessive features from
sequential and logic Responses. As we discussed above, we aim to use three
recessive features (status code, content, response-method), which may be pro-
cessed as ordinary word vectors based on classic CNN models. In a trial test by
using the TextCNN program in Tensorflow [9], the over-fitting problem appears
after only one round of the training process, and the learned feature focuses
on the samples of 9 big brands. Integrating the two main reasons, we propose
an enhanced-CNN model [15], directed at the extraction of all recessive char-
acteristics of the text-matrixes. Hence, we trade the 20 sequential matrixes of
each host like 20 continuous photographs, which produced the reduced global
feature map (xi ) in the input layer. In the convolutional layer, the sub-sampling
is followed to extract local features by mapping several feature maps. Then, the
two-dimension feature maps in the fully connected layer, which can be used to
linking the positional relationship of each map. Consequently, global and local
recessive features can be fully learned by the CNN-enhanced algorithm.
To address another challenge of unevenly distribution in the Internet-wide
samples, we employ DS 3 L to strengthen the minority samples of small brands
and maintain the majority samples of big brands [8]. According to our statis-
tics of the final Internet-wide experimental result, the samples of Top 5 brands
(Dahua, Hikvision, Xiongmai) occupy nearly two-thirds of the total samples
(3, 123, 489) and the remaining one-third samples are belong to more than 32
brands. For instance, the number of Bottom 10 brands (including Sony, Axis,
Netgear, etc.) surveillance devices is 317, which is less than 0.02% of Dahua (Top
1) samples. Thus, we use DS 3 L for two stages. First, we set a weight function
w(xi ; α) parameterized by α for the unlabeled samples and find the optimal
model θ̂(α) as following:
Detecting Internet-Scale Surveillance Devices 29


n 
n+m
θ̂(α) = min (h(xi ; θ), yi ) + w(xi ; α)Ω(xi ; θ) (1)
θ∈Θ
i=1 i=n+1

where θ ∈ Θ denotes a training parameter and h(xi ; θ) denotes the training


model.  refers to the loss function and Ω(xi ; θ) refers to the regularization term.
Then for maintaining the generalization performance of the labeled samples
of big brands, we attempt to find the optimal parameter α̂ as following:

n
α̂ = arg min [(h(xi ; θ̂(α)), yi ] (2)
α∈Bd
i=1

where α ∈ Bd denotes to a sample to a weight. The training process is syn-


chronously optimizing the weighted loss α̂, which can be minimizing the super-
vised degradation performance of well-labeled samples.
On the other hand, To address another challenge of unseen classes in the
unlabeled samples, we partly employ the open-world SSL algorithm to enhance
the DS 3 L semi-supervised algorithm for discovery novel classes of unlabeled-
brands in the real cyberspace [4]. As above-mentioned, the labeled samples cover
43 brands, which are only a few parts of more than thousands of diverse brands on
the Internet [7]. However, DS 3 L may reject new classes of Unknown brands to be
discovered. Therefore, we denote the number (u) of novel classes as an unknown
parameter. By clustering unknown classes (cu ), we calculate the similarity of the
new class with labeled classes (C) of samples using Levenshtein distance [11].
Note that if cu ∈ / C, we assign the class cu as a novel class. After fingerprinting
by artificial experience, we would try to manually label the novel class as a new
brand and add one of the total numbers as (c + 1).

4.5 Classification

After the training process, the trained models can be built for prediction as
three-dimensional classifiers. As mentioned in Sect. 4.3, three classifiers are log-
ically associated as follows: the first classifier is for determining whether a host
is a surveillance device, then the second classifier is for detecting whether it can
be identified by using the dominant feature, the third classifier is for identifying
its brand. We have investigated the 4 typical classification of machine learning
algorithms, including support vector machine(SVM), decision trees, and neural
networks. Considering the logic relationship of two classifiers and the actual per-
formance(see Sect. 5.2), we select the neural network for classification and three-
dimensional logistic regression. Consequently, the unlabeled sample of testing
dataset can be predicted whether is a surveillance device by the trained classifi-
cation neural network. Towards finally identifying its manufacturer, a multi-class
classifier needs to interpret.
30 Z. Yan et al.

4.6 Identification

In the identification module, we utilize the multi-class (c-class) classifier to detect


the brand Bj of an unknown sample from an online surveillance device. In the-
ory, the number (j) of brands decides the number c of classes. While in practice,
each manufacturer may produce multiple series of surveillance devices in differ-
ent decades. This problem (j < c) results that multiple clusters that may belong
to the same brand, but can not be classified as the same class. For instance,
Hikvision DS-2CD-xx IP camera and DS-xx NVR have two different superficial
characteristics. As a result, we introduce pseudo-labeling for these particular
clusters. With regard to the cluster having a similar feature on the neural net-
work, we add a middle step of tagging the same pseudo-labels for these samples.
Then according to the similarity of neurons, we labeled the final brand for these
clusters based on dual-clustering and their similarity. Finally, the multi-class
classifier determines the certain brand of the sample by calculating the proba-
bility p(xi ) and using Argmax function to detect the most corresponding class.

5 Real-World Experiments and Result

In this section, we implement the real-world experiments on the Internet to


perform the actual validity and accuracy of our approach.

5.1 Experimental Data

Step1: Data Collection. For offline data, we acquired 67 popular surveillance


devices (including 58 IP cameras, 5 NVRs, and 4 DVRs) from 43 brands, see
Fig. 3, keeping their original firmware and their manufacture years range from
2012 to 2020, aiming to emulate the real-usage as possible.

Fig. 3. Acquired surveillance devices in our experimental Intranet.

For online experiment, we utilized the bulk data in 2020 from search engine
[13,18], which contains 5, 046, 671 active hosts via HTTP-Get requests on port 554.
Detecting Internet-Scale Surveillance Devices 31

After <Request, Response> methods, we received 3, 123, 489 Responses hosts as


online data source (other 1, 923, 182 hosts may not return Responses because the
access were blocked by firewalls or network address translators).
Step2&3: Pre-processing and Labeling. Then, the 3, 123, 489 Responses
consequential banners were processed as 3, 123, 489 normal matrixes. After man-
ually and previous labeling approach [7], there are 341, 324 samples that can
be tagged with their labels as {Device − T ype, F eature − T ype, Brand} which
cover 27 brands. With regard to the other 16 surveillance brands of our acquired
devices (such as Zavio, Tiandy, etc.), there are no available samples that have
been found opening RTSP service on the Internet. Thus, the remaining un-
labeled samples (2, 782, 165) occupy nearly 89.07% of all samples.
Step4: Training. As shown in Table 1, we implemented the training process
for extending the labeled samples by three stages. (1)By using the enhanced-
TextCNN algorithm, 800, 251 samples can be added into the labeled dataset.
(2)By using the DS 3 L algorithm, 271, 566 samples can be added into the labeled
dataset, which increases more than double than the previous stage. (3)By using
the open-world SSL algorithm, 98, 096 samples can be added to the labeled
dataset. These additional samples were labeled from unseen-class samples which
cover 8 “new” brands. Totally, there are 1, 511, 237 samples (occupying 48.38%
of all samples) that can be labeled by learning recessive features and the semi-
supervised learning algorithm.

Table 1. Label samples of Three Stages in Training Process.

# of pre-labels TextCNN DS 3 L Openworld SSL


Labeled samples 341,324 1,141,575 1,413,141 1,511,237
Un-Labeled samples 2,782,165 1,981,914 1,710,348 1,612,252

Step5&6: Classification and Identification. As shown in Table 3, the total


of 2, 803, 406 samples can be classified in the final testing process as surveillance
devices. Among which, 2, 457, 661 samples can be identified their brands which
cover 35 brands. And the other 345, 745 surveillance devices only can be identified
their device-type as H264 DVR which can not be identified their detail brands (a
part of these DVRs belong to Xiongmai, Dahua, etc.).

5.2 Evaluation
Measurement. To evaluate the performance, we introduce two evaluation
indexes: precision and recall. Precision reflects the rating of devices correctly
classified, recall reflects the number of Other devices incorrectly classified, and
the harmonic means of F1-score is calculated using as follows:
TP TP 2 ∗ P recision ∗ Recall
P recision = , Recall = ,F1 =
TP + FP TP + FN P recision + Recall
(3)
32 Z. Yan et al.

where True Positive (TP) denotes the number of surveillance devices correctly
classified, False Positive denotes the number of devices incorrectly classified and
False Negative (FN) reflects the number of surveillance devices incorrectly clas-
sified. Naturally, high precision and recall are the desirable outcomes.
Performance. As shown in Fig. 4, the changing trends of precision and
F1-score were similar with the number of training samples, and the diversifica-
tion of recall is opposite. When the number of training samples reach to 150
thousands, the performance is becoming to stabilize.

(a) Performance of type-level. (b) Performance of brand-level.

Fig. 4. Trend of classification performance along with the number of training samples.

For performance comparison, we also evaluate other six classification algo-


rithms: Naive Bayes, Boosting, Support Vector Machine (SVM), K-Means, C4.5
Decision Tree, simple Neural Network. As shown in Table 2, which indicates our
approach TDO has the best performance. The main reason is our approach mixed
neural network and deep semi-supervised algorithm, which separately ensure the
high precision and recall, and especially for classifying un-seen classes.

Table 2. Performance of seven classification algorithms.

Approach {Surveillance, -, null} {Surveillance, -, Brand}


Precision(%) Recall(%) F1 score(%) Precision(%) Recall(%) F1 score(%)
Naive Bayes 74.71 69.35 71.93 72.19 71.03 71.61
Boosting 71.92 73.22 72.56 72.45 70.38 71.40
SVM 85.36 90.44 87.83 86.33 87.55 86.94
k-Means 83.41 81.87 82.63 81.39 82.34 81.86
C4.5 78.75 81.55 80.13 77.99 80.10 79.03
Neural Network 90.23 91.61 90.91 91.68 90.51 91.09
TDO 93.39 93.12 93.25 92.97 92.33 92.65
Detecting Internet-Scale Surveillance Devices 33

5.3 Comparison
To evaluate the further effectiveness of meeting the above-mentioned challenges,
we carry out two comparative trials on type-level and brand-level.
Comparison with Search Engine. Due to the difference in probing methods
and periods, the number of collected samples also differs from each search engine.
We choose the comparative samples in the year 2020 to ensure as fair as pos-
sible. Then, we compare the number of identified surveillance devices on type-
level. Table 3 shows that the identification rate of our approach is eight times
and three times more than those discovered by Shodan and Zoomeye respectively.
The underlying reasons are two fold: using recessive features and the deep semi-
supervised algorithm help our approach to achieve the higher classification results.
Comparison with Existing Approaches. Table 3 shows the comparison
between our approach and ARE [7]/IoTtracker [14]. Using the same dataset of
3, 123, 489 samples, The surveillance devices which can be identified with their
brands by our approach are 7.2 times and 4.7 times more than those identified by
ARE [7] and IoTtracker [14] respectively. With regard to ARE by generating finger-
prints based on dominant feature [7], our recessive features on twenty consequen-
tial Responses approach added new feature space. With regard to IoTtracker by
using recessive features on semi-structured contents similarly, our approach added
the identification results benefiting from mixed enhanced-TextCNN, DS 3 L and
open-world algorithms.

Table 3. Comparison with popular search engines and approaches.

{Surveillance, -, null} {Surveillance, -, Brand}


Search Engine # of Devices # of Samples Rate (%) of Approach # of Devices Rate (%) of
Identification Identification
Shodan 435,641 4,145,206 10.51 ARE 341,324 10.93
Zoomeye 708,184 2,224,485 31.84 IoTtracker 520,317 16.66
TDO 2,803,406 3,123,489 89.75 TDO 2,457,661 78.68

5.4 Distribution
We analyzed the distribution of identified results in two dimensions: geography
and brand.
Distribution of Countries. We locate the identified type-level results over 97
countries, with the Top10 countries accounting for 84% of 2, 803, 406 surveillance
devices. Table 4 indicates that the maximum devices (nearly one-third) belong to
China. The two main reasons are: (1) most brands of identified results (see right
part of Table 4) are manufactured by China; (2) the statistical results (including
Taiwan, Hongkong, etc.) are combined to China because these domains belong
to China.
34 Z. Yan et al.

Table 4. Distribution of Top10 brands and countries.

{Surveillance, -, null} {Surveillance, -, Brand}


Country # of (%) Devices Brand # of (%) Devices
China 946,215(33.75) Dahua 1,041,941(42.39)
America 416,457(14.85) Hikvision 654,036(26.61)
Vietnam 239,185(8.53) Xiongmai 348,293(14.17)
Korea 239,104(8.53) TVT 119,322(4.85)
Japan 101,411(3.62) D-Link 101,951(4.15)
Australia 98,509(3.51) Foscam 50,886(2.07)
France 81,301(2.90) Hisilicon 43,460(1.77)
Brazil 80,101(2.86) iCatch 42,717(1.74)
Russia 77,654(2.77) Uniview 23,323(0.95)
Canada 55,748(1.99) Samsung 16,195(0.66)
Other 467,721(16.68) Other 15,537(0.63)
Total 2,803,406(100) Total 2,457,661(100)

Distribution of Brands. With regard to the uneven sample distribution, we


analyzed the distribution of found brand-level surveillance devices with their
brands. Table 4 indicates the Top10 brands occupy 99.37% of the identified
results(2, 457, 661). Among which, Dahua is the Top1 brand which has the max-
imum number of devices. Meanwhile, the Chinese brands (including Xiongmai,
TVT, Foscam, etc.) undoubtedly have occupied more surveillance devices than
the brands of the other countries. There are also two main reasons: (1) most
offline labeled samples of acquired surveillance devices in our local Intranet were
bought from China; (2) the distribution is a reflection that China has the biggest
manufacturing power of surveillance devices.

6 Discussion and Conclusion


In this paper, the main limitation is that our work focus on fingerprinting surveil-
lance devices by using recessive features of RTSP protocol. It seems that the
target device and protocol are too simple, and our approach seems to lack gen-
erality. However, in fact, our approach can be widely extended to other common
(like Network Time Protocol) and complex protocols (like industrial protocols).
Consequently, these similar protocols can be evaluated as future work. Mean-
while, attempting on decrease requests/responses (like six irregular packets in
Fig. 1) may be another improving work.
In this paper, we presented a new approach for automatically and accurately
fingerprinting surveillance devices in cyberspace. We novelly explored new reces-
sive features from RTSP responses. Moreover, we creatively used neural networks
and deep semi-supervised algorithms for classification. Combined with the two
effective methods, we found 2, 803, 406 surveillance devices in the wild which
Detecting Internet-Scale Surveillance Devices 35

accounting for 89.7% of all samples. Most significantly, the performance of pre-
cision and recall of our experimental results both reach up to 93%.

Acknowledgments. Supported by the science and technology project of State Grid


Corporation of China(No. 521304190004).

References
1. Nltk: the natural language toolkit. http://www.nltk.org/
2. Antonakakis, M., et al.: Understanding the mirai botnet. In: Proceedings of 26th
USENIX Security Symposium (2017)
3. Bouharb, E., Debbabi, M., Assi, C.: Cyber scanning: a comprehensive survey. In:
IEEE Communications Surveys and Tutorials. vol. 16, pp. 1496–1519 (2014)
4. Cao, K., Brbić, M., Leskovec, J.: Open-world semi-supervised learning. In:
arXiv:2102.03526 (2021)
5. Durumeric, Z., Adrian, D., Mirian, A., Bailey, M., Halderman, J.A.: A search
engine backed by internet-wide scanning. In: Proceedings of 22nd Computer and
Communications Security (2015)
6. Durumeric, Z., Wustrow, E., Halderman, J.A.: Zmap: fast internet-wide scanning
and its security applications. In: Proceedings of 23th USENIX Security Symposium
(2013)
7. Feng, X., Li, Q., Wang, H., Sun, L.: Acquisitional rule-based engine for discovering
internet-of-thing devices. In: Proceedings of 27th USENIX Security Symposium
(2018)
8. Guo, L.Z., Zhang, Z.Y., Jiang, Y., Li, Y.F., Zhou, Z.H.: Safe deep semi-supervised
learning for unseen-class unlabeled data. In: III, H.D., Singh, A. (eds.) Proceedings
of the 37th International Conference on Machine Learning Proceedings of Machine
Learning Research, vol. 119, pp. 3897–3906. PMLR (2020)
9. Kim, Y.: Convolutional neural networks for sentence classification. In: Empirical
Methods in Natural Language Processing, pp. 1746–1751 (2014)
10. Li, Q., Feng, X., Wang, H., Sun, L.: Automatically discovering surveillance devices
in the cyberspace. In: Proceedings of 8th ACM International Conference on Mul-
timedia System (2017)
11. Michael Gilleland, M.P.S.: Levenshtein distance, in three flavors. https://people.
cs.pitt.edu/ (2006)
12. Schulzrinne, H., Rao, A., Lanphier, R.: Real time streaming protocol (rtsp). In:
RFC2326 (1998)
13. Shodan: https://www.shodan.io/explore/tag/webcam
14. Wang, X., Wang, Y., Feng, X., Zhu, H., Sun, L., Zou, Y.: Iottracker: an enhanced
engine for discovering internet-of-thing devices. In: Proceedings of IEEE WoW-
MoM (2019)
15. Yan, X., Jacky, K., Bennin, K.E., Qing, M.: Improving bug localization with word
embedding and enhanced convolutional neural networks. Inf. Softw. Technol. 105,
17–29 (2019)
16. Yan, Z., Lv, S., Zhang, Y., Zhu, H., Sun, L.: Remote fingerprinting on internet-wide
printers based on neural network. In: Proceedings of IEEE GLOBECOM (2019)
17. Yang, K., Li, Q., Sun, L.: Towards automatic fingerprinting of IOT devices in the
cyberspace. Comput. Netw. 148, 318–327 (2019)
18. Zoomeye: https://www.zoomeye.org/
An Intrusion Detection Framework for
IoT Using Partial Domain Adaptation

Yulin Fan1,2(B) , Yang Li1,2 , Huajun Cui1 , Huiran Yang1 , Yan Zhang1,2 ,
and Weiping Wang1,2
1
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
[email protected]
2
School of Cyber Security, University of Chinese Academy of Sciences,
Beijing, China

Abstract. With the rapid development of the Internet of Things (IoT),


the security problem of IoT is becoming increasingly prominent. Deep
learning (DL) has achieved success in network intrusion detection sys-
tems (NIDS) for IoT. Its capability of automatically extracting high-
dimensional features from data and finding the association between data
make it easy to identify abnormal activity from network traffic. However,
DL method requires a large amount of labeled data, which is very time-
consuming and expensive. Due to the privacy of IoT data, it is hard to
collect enough data to train models. Also, the heterogeneity of IoT makes
the NID model trained from the data collected from one IoT unable to be
directly applied to another one. To address the problem, domain adapta-
tion (DA) has been used by transferring the knowledge from the domain
with huge amounts of labeled data to the domain with less or unla-
beled data. However, previous DA methods generally assume the same
label spaces between source and target domain, which is not feasible in
a complex real environment of IoT. In this paper, we propose a NID
framework using a weighted adversarial nets-based partial domain adap-
tation method to address this problem of inconsistent label spaces by
mapping two domains to a domain-invariant feature space. The proposal
can train a highly accurate NID model through the knowledge trans-
fer from the abundant public labeled dataset of the traditional Internet
to the unlabeled dataset of IoT. In addition, the proposed scheme can
detect unknown attacks in the IoT with the help of knowledge from the
traditional Internet. Moreover, the proposed scheme is an online NID
detection which is more suitable for real IoT application. The experi-
ments results demonstrate that our proposed scheme can achieve a good
performance to detect attacks.

Keywords: IoT · Intrusion detection · Partial domain adaptation

1 Introduction
The Internet of things (IoT) is actively shaping the world. It combines various
sensors and end-devices with the Internet to realize the interconnection of peo-
ple, machines and things at any time and place [1]. As the application of IoT will
c Springer Nature Switzerland AG 2021
W. Lu et al. (Eds.): SciSec 2021, LNCS 13005, pp. 36–50, 2021.
https://doi.org/10.1007/978-3-030-89137-4_3
An Intrusion Detection Framework 37

involve various fields and influence all walks of life, the importance of its net-
work security is self-evident. Due to the open deployment environment, limited
resources, the inherent security loopholes of the network and the vulnerability
of IoT terminal equipment, the harm and loss of attacks will be greater than the
similar situation in traditional Internet. Therefore, the research of IoT security
technology is particularly important.
Network intrusion detection (NID) is a kind of security defense technology,
which can actively collect and analyze the network information, trying to find
out whether there is a violation of security policy. Many researches in NIDs have
made great efforts based on machine learning (ML) and deep learning (DL).
These methods capture packets from network layer and extract the character-
istics based on packets or flows to train a NID model. They do not consider
underlying protocol and can provide a good adaptive ability. At present, there
are some available public NID datasets for research, such as DARPA [4]and
KDD [3]. However, the privacy and distributed features of IoT make it difficult
to collect typical datasets of the IoT. Therefore, some NID schemes of IoT are
based on those traditional NID data rather than IoT data [8]. However, labeled
data from traditional Internet may not be suited for training DL models for
IoT. Moreover, due to heterogeneity, different IoT have different network traffic
patterns. Even the NID model trained by data collected from one IoT network
often has poor generalization performance and cannot be applied to another IoT
networks directly.
In order to solve the scarcity of labeled data and differences in data distri-
bution, domain adaptation (DA) has been used in NIDs recently [10–12]. DA is
a branch of transfer learning [13] that enable to transfer knowledge gained from
source domain with an adequate labeled data to a different but similar target
domain with few and unlabeled data [14]. In our case, the source domain refers
to a large amount of labeled NID dataset collected from traditional Internet,
while the target domain is a relative smaller unlabeled NID dataset drawn from
a specific IoT network. On one hand, our source domain and target domain are
different since they have different traffic patterns due to the different network
protocols, architectures and application modes. On the other hand, the attack
pattern in IoT is similar to the traditional Internet, such as Man-in-the-middle
Attack and Botnet. DA tries to use the similarity between data to apply the
knowledge previously learned in source domain to the new unknown domain
(target domain). Since most DA methods try to map source and target domains
to a common domain-invariant space, the label spaces of the two domains are
required to be the same for feasible transfer. However, if they have different
label spaces, the effect of DA will be greatly damaged, which is called negative
transfer. Ignoring its negative transfer effect will lead to many false positives
and even performance degradation. Therefore, the inconsistent label spaces of
source and target domain is necessary to be considered.
In reality, the source domain and target domain often have the different
label spaces, especially in our IoT research problem. For example, the tradi-
tional Internet contains various attack including Web Attack, such as XSS and
38 Y. Fan et al.

SQL Injection, DDoS, Botnet etc. Due to the large number of end-devices, the
IoT is also vulnerable to DDoS and Botnet attacks. However, many resource-
constrained IoT networks such as remote meter reading and smart parking hardly
suffers from Web Attack. So, it is reasonable to assume that the attack types in
IoT domain is a subset of that in tradition Internet domain. Here, the classes that
both the domains have constitute the shared label spaces, while the classes
not contained in the target domain but only in the source domain constitute
the private label spaces. If the knowledge containing private label spaces (like
Web Attack) and shared label spaces (like DDoS) is directly transferred to the
IoT, it may cause negative transfer. Fortunately, Partial Domain Adaptation
(PDA) [18] method has been proposed to solve the problem of inconsistent label
spaces. PDA is a DA technique that allows to find the common parts (shared
label spaces) and restrain the private label spaces of source domain that has
little relationship with target domain to improve transfer performance.
In this paper, we apply a weighted adversarial nets-based PDA [19] in NIDs
of IoT. It can transfer knowledge from one to another dataset although the label
spaces of the two domains are different. It uses adversarial network structure
to map two domains to a common domain-invariant feature space to complete
domain adaptation, and reduce weight of the samples from private label spaces
to restrain negative transfer. Our contributions are as follows:

– To the best of our knowledge, we are the first to address the problem of the
inconsistent label spaces between Internet and IoT and apply PDA method
into NIDs of IoT. We use PDA to train a highly accurate NID model with
unlabeled dataset for IoT, with the knowledge transfer of the abundant public
labeled datasets in the traditional Internet, in a specific scenario that source
and target datasets own different label spaces. To foster further research, the
source code is public1 .
– The proposed NID scheme can identify unknown attacks in IoT, with the
help of the abundant public labeled datasets in the traditional Internet.
– We design an online NID framework by offline training to detect attack auto-
matically in real-time, which is quite suitable for real IoT application.
– We implement the methods on two available NID datasets from traditional
Internet and IoT respectively. Experiment results show that, with the few
unlabeled data of IoT, the proposed PDA based NID approach can achieve a
good performance to detect attacks.

2 Background

2.1 Generative Adversarial Networks (GAN)

In GAN framework [21], two models are trained simultaneously: the generation
model G and the discriminant model D. G is responsible for capturing data dis-
tribution and D is responsible for estimating the probability of samples coming
1
Our code is public available at https://github.com/rainforest2378/IoT-PDA.git.
An Intrusion Detection Framework 39

from the true distribution. G and D play with each other and self-strengthen
effect will be produced on both of them through the game.
In computer vision, G is a network that generates pictures. It receives a
random noise z and generates pictures through this noise, which is denoted as
G(z). The goal of G is to generate pictures as real as possible to deceive D. D
takes the real and fake generated images as input, and outputs the probability
that x is a real picture (comes from the distribution of real images). The closer
the output D(x) is to 1, the more likely the input image is to be real. The goal of
D is to separate images generated by G from the real image as much as possible.
In this way, G and D form a dynamic “game process”. So the minimax loss of
GAN can be denoted as:

min max V (D, G) = Ex∼Pdata [log(D(x))] + Ex∼PG [log(1 − D(G(z)))] (1)


G D

2.2 Domain Adaptation and Partial Domain Adaptation


Domain adaptation (DA) [13] is aimed to solve the problem of difference between
the training dataset (source domain) and the test dataset (target domain). Its
critical insight is to learn a mapping to find the similarity between the source
and the target domain. Since original DA methods rely on the comparison of
marginal distributions between the source and target domains, the label spaces
between the two domains are required to be the same for feasible adaptation.
However, in a realistic scenario, the source domain often has different label spaces
with target domain.
In this way, partial domain adaptation (PDA) [18] is required for feasible
transfer. A natural and possible way is to strengthen the effect of the source
domain samples in the shared label spaces, while restrain that in the private label
spaces. J. Zhang et al. proposes a novel Importance Weighted Adversarial Nets
architecture for PDA [19], which is based on the idea of GAN. As shown in middle
picture of Fig. 1, it consists of two feature extractors, Fs and Ft ,(corresponding
to generator of GAN) and two domain classifiers, D and D0 , (corresponding
to discriminator of GAN). It solves the inconsistent label spaces problem with
the help of the first domain classifier D. D measures the importance of source
samples, and assigns appropriate weights to them. For example, if samples are
from private label spaces, lower weights will be assigned to them to restrain the
negative transfer. The details will be introduced in next section.

3 The Proposed Framework


The main goal of the proposed scheme is to use PDA to learn a NID model
that can accurately predict the samples classes (benign or specific attack) in the
target domain. In our research, the source domain is Internet with labeled NID
data, and target domain is IoT with insufficient and unlabeled NID data. Our
proposal can be divided into three phases as shown in Fig. 1. First, we process the
40 Y. Fan et al.

original PCAP packets and extract flow-based statistical features to prepare the
source data and target data. Then feature extractor Fs and classifier are trained
for source domain. Second, we perform the PDA to train feature extractor Ft
of target domain and two domain discriminators (D and D0 ). Third, an online
intrusion detection model is designed to mark target samples as benign or known
attack classes, or even identify unknown attacks.

3.1 System Model

Fig. 1. The overview of three phases of our scheme, including pre-processing and pre-
training, partial domain adaptation and online intrusion detection. The blue trapezoid
Fs indicates the feature extractors of source domain. The green trapezoid Ft represents
feature extractors of target domain. The pink trapezoid D indicates domain classifier
that producing weights while the brown one D0 indicates domain classifier that identify
the source and target samples. The block filled with slashes indicates its model is fixed
and its parameters will not be updated during current phase. (Color figure online)

We start with the description of our system model. In this paper, source domain
with labeled data is denoted as Ds = {xsi , yis }ni=1
s
. xsi is a sample of source domain,
which is drawn from the Internet distribution Ps (x), where i is the index of source
samples. Target domain with unlabeled data is denoted as Dt = {xtj }nj=1 t
. xtj is a
sample in target domain which is drawn from the IoT distribution Pt (x), where
j is the index of target samples. ns and nt are respectively the number of source
samples and target samples. Our proposal focuses on datasets with sufficient
source data and limited target data, so ns > nt . Assume that feature space of
two domains are the same, that is, Xs = Xt . Assume that the attacks in IoT
are a subset of traditional Internet attacks. So, the label spaces are different
and the label space of the target domain is contained in the label space of the
source domain, that is, Ys ⊆ Yt . The edge distributions of these two domains
are different, namely Ps (x) = Pt (x). Our task is to train a NID model for target
IoT domain Dt to predict the labels y t ∈ Yt with the help of source Internet
domain Ds .
An Intrusion Detection Framework 41

3.2 Pre-processing and Pre-training

As shown in left figure of Fig. 1, we first preprocess the raw network packets of
the traditional Internet and IoT network and pretrain classification model for
traditional Internet. We use the same tool to extract the same statistical features
so that the source domain and the target domain have the same feature spaces.
CICFlowMentor [22] is an open-source tool that generates bidirectional flows
(bi-flows) from PCAP files, and extracts the statistical time-related features
from these flows. In general, a unidirectional flow refers to a set of packets with
the same protocol type, source IP address, destination IP address, source port
and destination port. So, a bi-flow can be defined as a set of network packets
that move forward or back forward between two endpoints. In order to obtain
time-related statistics, bi-flows are supposed to be counted in a limited period
of time. Concretely, TCP bi-flows are terminated by the end of TCP connection
(signed by FIN packet) while UDP bi-flows are terminated by a flow timeout.
The flow timeout value can be set arbitrarily.
We also pre-train NID model C(Fs (xs )) of source domain in this phase. The
feature extractor Fs of source domain is constructed with two convolutional
layers (with 64 and 128 filters), two maximum pool layers, one flatten layer
and two full connection layer (with 128 and 64 neurons) activated by ReLU
function. The attack classifier C consists of some neurons (the number of the
neurons corresponds to number of attack classes of source samples) activated by
sigmoid function. Fs is used to extract superior features of source domain and
C is used to classify the source samples into benign or attack classes y s , such as
DDoS, Web Attack, Port Scan, Botnet, Brute Force. Note that Fs and C will be
used (fixed) in the following PDA training and online intrusion detection phases
respectively. The model C(Fs (xs )) for source domain is obtained by minimizing
the loss function and learning the parameters of Fs and C, denoted as:

min Ls = Ex,y∼Ps (x,y) L(C(Fs (x)), y) (2)


Fs ,C

where Ls is the cross-entropy loss of multi-classes classification task for source


domain.

3.3 Partial Domain Adaptation

This phase aims to realize knowledge transfer with partial domain adaptation.
A weighted adversarial nets-based PDA method [19] is applied to boost shared
label spaces alignment. Two domain classifier D, D0 and two feature extractors
Fs , Ft are adopted. Fs is obtained in pre-training phase. Ft share the same
network structure but different network parameters with Fs . The output of the
feature extractors will be fed into D and D0 . D and D0 are common neural
networks. D consists of 2 fully connected layers with 20, 10 neurons in that
order and D0 consists of 3 fully connected layers with 50, 40, 10 neurons in that
order.
42 Y. Fan et al.

The training process is shown in Algorithm 1, including three training parts


(learning D, D0 and Ft ). First of all, Fs and Ft take samples from the source and
target data xs , xt respectively and output z s , z t . D takes z s labelled by 1 and
target data z t labelled by 0 as input and output D(z) = p(y = 1|z) = σ(a(z)). σ
is the logistic sigmoid function and a(z) is the output of the last fully connection
layer. The value of D(z) is the predictive probability of the sample coming from
source distribution. Maximizing the loss LD with respect to D is used to improve
its ability to distinguish the source and target samples, denoted as:

max LD (D, Fs , Ft ) = Ex∼Ps (x) [logD(Fs (x))] + Ex∼Pt (x) [log(1 − D(Ft (x)))] (3)
D

After above training, D has converged to its optimal value. Therefore, D(z)
can indicate the likelihood of the sample coming from shared label spaces or
private label spaces of source domain. An importance weight ω(z s ) should be
assigned to each source sample to adjust its effect on transfer process. The weight
is inversely related to D(z) [19], denoted as
1
ω(z s ) = 1 − D(z s ) = Ps (z s )
(4)
Pt (z s ) +1

If the PPst (z
s
(z )
s ) is high, the sample can be perfectly discriminated from the target

IoT domain, which means it is more likely coming from the private label spaces
distribution of source Internet domain. Therefore, a smaller value w(z s ) will be
gotten. For example, the samples belong to Web Attack and Brute Force classes
in our case will be assigned smaller weights to restrain their effect on knowledge
transfer in PDA. In contrast, a small PPst (z (z s )
s ) means that samples are more likely

to come from the shared label spaces, such as, DDoS, Botnet and Scan classes in
our case. These kinds of samples are really needed because they will give positive
effect for domain adaptation. Hence, higher weights are assigned to them.
In order to distinguish weighted source features z s and target features z t and
optimize Ft , D0 is introduced as the second domain classifier. Ft and D0 play a
two-player game to align shared label spaces, that is to say, to reduce the shift
on the shared label spaces. After importance weights obtained, the objective
function can be described as following. Maximizing the loss with respect to the
parameters of D0 attempts to identify the difference between distributions of
the source and target samples,

min max Lω (D0 , Fs , Ft ) = λEx∼Ps (x) w(z s )logD0 (Fs (x)) + Ex∼Pt (x) log(1 − D0 (Ft (x))) (5)
Ft D0

where λ is a tradeoff parameter that measures the inhibition degree of private


label spaces. Note that importance weights w(z) have been added to the source
samples to restrain effect of samples from private label spaces. Minimizing the
loss with respect to Ft is to minimize the divergence between the weighted z s
and z t to confuse D0 , described as

min Lω (D0 , Fs , Ft ) = Ex∼Pt (x) [log(1 − D0 (Ft (x)))] (6)


Ft
Other documents randomly have
different content
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
back
back
back
back
back
back
back
back
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookball.com

You might also like