0% found this document useful (0 votes)
8 views

Generation and deployment of honeytokens in relational databases for cyber deception

This article presents a framework for generating and deploying honeytokens in relational databases to enhance cyber deception and early detection of data breaches. It addresses the limitations of existing honeytoken generation methods by utilizing a hierarchical machine learning algorithm to create synthetic data that mimics the statistical and structural properties of real databases. The proposed method aims to improve organizational security by actively monitoring sensitive records and generating alerts upon unauthorized access to honeytokens.

Uploaded by

Agesky Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Generation and deployment of honeytokens in relational databases for cyber deception

This article presents a framework for generating and deploying honeytokens in relational databases to enhance cyber deception and early detection of data breaches. It addresses the limitations of existing honeytoken generation methods by utilizing a hierarchical machine learning algorithm to create synthetic data that mimics the statistical and structural properties of real databases. The proposed method aims to improve organizational security by actively monitoring sensitive records and generating alerts upon unauthorized access to honeytokens.

Uploaded by

Agesky Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Computers & Security 146 (2024) 104032

Contents lists available at ScienceDirect

Computers & Security


journal homepage: www.elsevier.com/locate/cose

Generation and deployment of honeytokens in relational databases for cyber


deception
Nilin Prabhaker, Ghanshyam S. Bopche ∗, Michael Arock
National Institute of Technology, Tiruchirappalli, 620015, Tamil Nadu, India

ARTICLE INFO ABSTRACT

Keywords: Despite considerable investments in database security, global statistics indicate an exponential increase in data
Data breaches breaches. Organizations are often unaware of data breaches for weeks, months, or even years. Sufficient for
Identity theft adversaries to compromise and ex-filtrate business or mission-critical data. Recent research suggests using
Database security
honeytokens for early detection of data breaches in organizations. Existing honeytoken generation methods
Cyber deception
rely on regular expressions, rule mining, constraint satisfaction, or representation learning, which are complex
Honeytoken
Data synthesis
and limited to a few attributes. We created a framework for generating and deploying honeytokens in relational
Hierarchical modeling algorithms databases that actively monitor sensitive records and quickly detect data breaches and their misuse. To generate
the honeytoken we have used the hierarchical machine learning algorithm which uses a recursive technique to
model the parent–child relationships of multi-table databases. The proposed method enables the organization
to take remedial action to reduce the impact of data breaches and complement existing database security
solutions.

1. Introduction country to compromise and endanger national security. Addressing


such insider threats is necessary to obtain foolproof security. The attack
The exponential increase in the use of database services over the on databases for government organizations, businesses, and institutions
Internet has enhanced the reachability of organizations worldwide. In continues to be a critical concern. According to a global report, the
today’s digital age, managing and organizing electronically stored data average cost of insider threats rose by 31% in the last few years (Saxena
effectively is crucial for the efficient operations of organizations and et al., 2020). It is one of the significant reasons behind data breaches
government entities. Databases play a vital role in the organization
in healthcare, finance, education, and government organizations. In
of such aspects. According to the structure of data being stored in
addition, the global average data breach cost increased by 15% over
the database, it can be classified as hierarchical (IBM, 1960; Henry,
the last three years (IBM, 2023). The data breaches resulting from
1969), relational (Codd, 1970), network (Saxton and Raghavan, 1990),
and object-oriented database (Atkinson et al., 1990). The relational compromised credentials take longer to identify, at an average of 327
database (MySql, Oracle, SQL Server) consists of numerous related days (IBM, 2023). So, the active detection of data breach is required to
tables such as employee, customer, transaction, product, and services, reduce the impact on organization and customers.
which are crucial for typical organizations and government agencies. Let us take an example of a typical enterprise network of an insur-
The data and information stored in organizational databases are sen- ance company as depicted in Fig. 1. It consists of numerous components
sitive and often considered the most valuable assets. Protecting such such as a web server, webmail server, application server, database
databases for organizations (small, medium, or large) is paramount. server, file server, network devices, users (normal, malicious(red)),
Attacks such as data ex-filtration, identity theft, compromise of creden- and adversaries. Among all components, the database server present
tials, etc., may lead to data loss, loss of privacy, misuse of sensitive in the enterprise is crucial for an organization. The database contains
data, damage to reputation, customer mistrust, etc. Among all the personally identifiable information about employees and customers
potential adversaries, detecting malicious insiders (Saxena et al., 2020) and stores business and mission-critical data. The attacker (external
is challenging because they are technically skilled, highly motivated,
or internal) wants to bypass the existing security controls to steal the
and have access to extensive resources. The best examples are a dis-
critical information stored in the database server. As defenders, we aim
gruntled employee who wishes to sell sensitive information to an
to detect data breaches using existing security controls. A plethora of
overseas competitor for personal benefits or a spy working for a foreign

∗ Corresponding author.
E-mail addresses: [email protected] (N. Prabhaker), [email protected] (G.S. Bopche), [email protected] (M. Arock).

https://doi.org/10.1016/j.cose.2024.104032
Received 13 May 2024; Received in revised form 10 July 2024; Accepted 31 July 2024
Available online 10 August 2024
0167-4048/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 1. Typical enterprise network of an insurance company. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

security solutions are present to protect the databases in the enter- Furthermore, they play a crucial role in predicting and anticipating the
prise network. Two types of threat actors can threaten the database: behavior of attackers. Dubey et al. (2022) proposed different flavors
external malicious attackers and internally disgruntled employees. The of side-channel defenses for ML models in the hardware blocks based
protection mechanisms such as authentication (Chenchev et al., 2021), on Boolean masking techniques. In this work, our focus is on data-
privilege management (Chadwick and Otenko, 2002; Botha and Eloff, level cyber deception, heavily used as a proactive cyber defense to
2001), firewalls (Al-Shaer and Hamed, 2003), intrusion detection and mislead, confuse, or slow down potential attackers. Different types of
prevention systems (Scarfone et al., 2007), auditing (Natan, 2005), honeytokens, such as fake login credentials, fake email, fake credit
logging (Haerder and Reuter, 1983), classical cryptography (Diffie and card numbers, etc., are being proposed to know whether the critical
Hellman, 1976; Rivest et al., 1978; Koblitz, 1987; Miller, 1985) and resources available in the organization are compromised. Honeytokens
post-quantum cryptography (Bernstein and Lange, 2017) are standard are fake digital data artifacts that track down the unauthorized access of
security solutions effective against external threats; however, not effi- sensitive data by purposefully inserting it into a database or password
cient in dealing with insider threats (Maasberg et al., 2020). It occurs files. We want to create and deploy indistinguishable honeytokens that
when existing employees misuse the access provided by the organiza- must mislead the attacker and waste their resource and time. Moreover,
tion. Due to the unpredictable nature of humans, detecting an attacker’s the honeytoken must be distributed throughout the organizational net-
behavior is challenging, and predicting damage due to potential data work to detect data breaches. Each access to honeytoken will generate
breaches to the organization is also very difficult. Moreover, the sophis- an alert to the administrator indicating the database breach or misuse
tication of attacker strategies and dependency on third parties for risk of sensitive data.
identification and implementation of security controls are significant Numerous methods have been proposed in the literature to generate
concerns. Despite numerous risks, organizations must grant employees honeytoken dealing with fake user accounts, fake emails, fake credit
and business partners privileges to function efficiently. A security lapse card numbers, fake records, fake tables, or even fake databases. The
at a single point (any user or computer in the network, as depicted authors used various methods to generate honeytoken, focusing on
in Fig. 1) may compromise all the confidential information in the character replacement (Juels and Rivest, 2013), rule mining (Cenys
entire database. Since humans are the weakest link, compromising et al., 2005; White, 2009), constraint satisfaction (Onaolapo et al.,
human perception may lead to the failure of all other security controls 2016), or representation learning (Bengio et al., 2013) techniques.
implemented to protect the database system. However, the methods used to create the honeytoken are specific to a
Cyber deception (Galanxhi and Nah, 2007; Han et al., 2018; Ferguson- particular database; they need manual intervention and are limited to
Walter et al., 2023; Javadpour et al., 2024) is an active defense a single table. A typical enterprise database consists of numerous tables
technique used to mislead the attacker towards an unreal target. In en- related to each other. The generation of fake records for individual
terprises, it serves multiple purposes to achieve many important goals tables looks distinguishable, and if adversaries exploit the database,
ranging from understanding attackers’ behavior to tricking them into they can easily separate the fake records from the database. To in-
spending resources and time with fake targets. The deception can be im- crease the believability of honeytokens, we are generating synthetic
plemented at various layers, such as the system (Albanese et al., 2015), data for all the related tables present in a chosen relational database.
application (Izagirre, 2017), network (Spitzner, 2003a,b; Chiang et al., The defender uses the generated synthetic data as a honeytoken to
2018; Qin et al., 2024), and data layers (Spitzner, 2003c; Yuill et al., detect organizational data breaches and the active misuse of sensitive
2004; Almeshekah et al., 2015; Dionysiou et al., 2021). Moreover, data. There are numerous methods present in the literature to create
the cyber deception technique can also be used to secure lightweight- synthetic records used to complete partial tables, data augmentation,
PUF based authentication (Gu et al., 2020). In the age of machine and anonymization in various domains such as healthcare, the US
learning and generative AI, there are limited but notable studies where Census, etc. However, there is no work in the literature for detection of
deception has been integrated with adversarial artificial intelligence to data breaches using synthetic data. We have used the HMA synthesizer
safeguard machine learning models (Lopes Antunes and Llopis Sanchez, available in the SDV (Patki et al., 2016) library to generate the syn-
2023). These approaches manipulate the decision-making processes of thetic data, which is best suited for generating enterprise databases.
machines by diverting the attacker’s focus and misleading their actions. The HMA synthesizer works for multi-table synthetic data generation

2
N. Prabhaker et al. Computers & Security 146 (2024) 104032

by resembling a database’s statistical and structural properties. After data in the database used to confuse the attacker and generate an
successful generation, the administrator deploys the synthetic data alert to the administrator for each access of honeytoken by genuine
(fake database) to detect data breaches in the organization. The alert user or an attacker. The proposed method detects internal malicious
generation module will monitor the deployed honeytoken and generate activity in an organization in three different ways. At first, the authors
an alert to the administrator. To fulfill all the requirements, we have tried to incorporate honeytoken with pipeline function, then used
proposed a deception-based framework that takes the database as input triggers, and finally, incorporated with fine-grained auditing. However,
and generates honeytokens, their deployment, and monitoring; finally, the implementation of honeytoken was specific to the Oracle database
we evaluated the generated honey tokens using the human evaluation (9iR2)Oracle 9𝑖 server.
technique. To create honeytokens for personally identifiable information (PII)
The organization of the rest of the paper is as follows: Section 2 stored in database, White (2009) introduced a statistical based tech-
explores the deception techniques and the use of honeytokens in mon- nique which relies on analyzing real databases to understand the
itoring and detecting the misuse of sensitive information stored in the distribution, value ranges, and frequency of the data. It also consid-
database. We will discuss the overall workflow of the proposed method ers the dependencies between various attributes to generate realistic
in Section 3 along with the deployment strategies of generated honey- decoy data. The author generated personally identifiable information
tokens. The description of dataset used for our experiment is explained and original records to detect insider threats. However, the generated
in Section 4. Section 5 deals with the experiments conducted to obtain honeytokens are biased towards the frequent entries present in the
the realistic honeytokens and their deployment, along with monitoring database. Bercovitch et al. (2011) have proposed a framework that
and generation of alerts. Section 6 will discuss the challenges we automatically takes the table as input to generate the deceptive table.
faced during the experiment, the limitations of the techniques used to It extracts the pattern in records followed by honeytoken generation
generate honeytoken, and the future scope. Finally, we conclude our and evaluates using the likelihood calculation of generated records.
work in Section 7. The problem with their approach is that the attacker can also ex-
tract the rule to filter out the fake records, so we need believable
2. Related work honeytokens. Padayachee (2015) proposed the honeytoken deployment
strategies for database security using AspectJ (Eclipse Foundation,
Deception has been a survival tactic in Nature and Warfare for 2023) programming language. The author developed an environment
millions of years. Living organisms use various deception strategies, capable of augmenting the existing database system. Moreover, it is
including mimicry, camouflage, and nocturnality, as defensive and easy to maintain and portable. The primary focus was on insider threat
survival techniques. Deception has likely been a tactic to mislead detection. However, it is necessary to update the data with environmen-
the enemy in military operations. The Russo-Ukrainian war is the tal changes. To prevent off-site password discovery (Almeshekah et al.,
best example of deception in warfare, where Ukrainian soldiers have 2015) created Ersatzpassword that stops password cracking and provide
used the decoy tank to confuse the Russian missile’s object detection detection of password leakage using physically unclonable function
techniques (The Washington Post, 2023). Usually, cyber attackers use (PUF) or a hardware security module (HSM) at the authentication
deception strategies to play with human perception (Butavicius et al., server. However, the proposed work limited to password attributes
2022) to compromise the accounts, different devices, and software. and their hashes. Shabtai et al. (2016) conducted a behavioral study
In contrast, the defender employs deception to safeguard systems by focusing on data misuse detection using honeytoken and the required
diverting adversaries towards false or decoy targets. The early appli- number of deployed honeytoken to achieve optimum performance.
cation of deception tactics in computer defense was pioneered by Cliff The authors highlighted several future directions for the proposed
Stoll (Stoll, 1989), who employed fictitious files with enticing names, approach. At first, they emphasized the need for more comprehensive
and Eugene Spafford (Spafford, 2011), who utilized the Unix sparse rules that accurately reflect real database constraints and logic to
file structure as a deceptive measure. Essentially, Fred Cohen (Co- generate more convincing honeytokens. Additionally, they suggested
hen, 1998) developed the first publicly available tool for computer evaluating the balance between the detection rate and the associated
defense, known as the ‘‘Deception Toolkit’’.1 Usually, organizations use costs of deploying honeytokens. Our proposed work will solve the prob-
deception strategies to enhance existing security controls. There are lem of generating believable fake tables using synthetic data generation
numerous deception technique proposed in literature to protect the sys- techniques and explore their feasibility as a honeytoken.
tems, software, network or data at different attack phases of advanced A series of experiments have been conducted by Wang et al. (2018)
persistent threats (APT) (Tankard, 2011). The honeytoken (Spitzner, to analyze the security obtained by the Honeygen system developed by
2003c) are remarkably flexible towards all phases of cyber-kill chain Juels and Rivest (Juels and Rivest, 2013). The authors reveal that all
in different forms of digital objects and that can be placed anywhere methods used in the Honeygen system fail against trawling guessing,
in the organizational environment. Even though attacker identifies the advanced trawling guessing, and targeted guessing attacks. Further-
honeytoken at one place, they may still be trapped at somewhere more, the authors suggest that even honeyword techniques fail to
else. The different form of honeytoken are honeyfiles, honeyaccounts, provide the expected security, and generating believable decoy pass-
honeypasswords, honeydata, honeytables, honeyIPaddresses, etc. words is challenging. To solve this problem, Dionysiou et al. (2021)
In this work we are focusing on active data breach detection in proposed a framework to create indistinguishable passwords (honey-
organizational database. The use of deception in database security words). The authors leveraged representation learning (Bengio et al.,
started with the concept of honey server, proposed by Cenys et al. 2013) technique to create high-quality passwords. However, the table
(2004) which mainly focused on intruders searching the database or storing personally identifiable information is also very important and
trying to access the databases. The honey server was deployed as a may contain mission-critical information. The major problem arise
honeypot to analyze the activity of attackers in the network. However, when the attackers bypass the authentication, they will dump the
it was required to interact with the deployed honeypots to generate database to get all the actual records of tables present in the enter-
the alert. If attackers bypass the honeypot, the detection of adversaries prise network. Detection of such data theft and misuse is crucial for
will be challenging. To solve the problem of limited interaction in the the organization. A deep learning-based privacy-preserving honeydata
environment, Cenys et al. (2005) used the concept of honeytokens in generation technique proposed by Abay et al. (2019) which tries to
their subsequent work. The deployed honeytoken is nothing but fake mislead the potential attacker from the actual target. Although their
approach is accurate, significant care must be taken to deploy gen-
erated honeytoken in a natural environment. Moreover, the proposed
1
http://all.net/dtk/ method was limited to single table and knowledge about the number

3
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Table 1
Database Honeytoken Generation Techniques.
Author(s) Proposed work Technique(s) Limitation(s)
Cenys et al. (2004) Fake Database Server Database objects with different Interaction with fake database is
names mandatory to detect the insider
Cenys et al. (2005) Fake Records Incorporated honeytoken with Limited to specific database
pipeline function, triggers, and
fine-grained auditing
White (2009) Fake Records (PII) Statistical methods such as Biased towards frequent entries in
frequency of records, range of the database
numerical values etc.
Bercovitch et al. Automatic generation of Honey Rule Mining Finding unique patterns in data is
(2011) record challenging
Juels and Rivest Honeywords: Making Character Replacement (Chaffing) Need of additional hardened
(2013) Password-cracking detectable computer server
Padayachee (2015) Change the attributes at runtime AspectJ programming Necessary to change in data with
change in environment
Almeshekah et al. Ersatzpasswords Character or Word Replacement Limited to password
(2015)
Shabtai et al. (2016) Behavior Analysis of honeytokens Rule Mining Unbelievable honeytoken
Onaolapo et al. Honey Accounts to monitor the Constraint Satisfaction Limited to 100 gmail accounts
(2016) activity of cybercriminals
Abay et al. (2019) Privacy preserving Honeydata Deep Learning Limited to single table and
generation requirement of columns details in
advance
Wang et al. (2020) A Security analysis of Character Replacement Generation of indistinguishable
Honeywords honeytoken is very challenging
Dionysiou et al. Honey password Representation learning Limited to few attributes,
(2021) sometimes indistinguishable from
real records

of columns and the types of the columns in a given dataset must 3. Honeytoken generation and deployment
be known in advance. The proposed work tries to solve the problem
by generating synthetic data for all the database-related tables and The overall workflow of the proposed framework is depicted in
their deployment as honeytokens. The overall summary of database Fig. 2. The objective of the proposed framework is to generate in-
honeytoken generation techniques and their limitation are depicted in distinguishable fake records (honeytokens) for all the tables in the
Table 1. database, followed by effective deployment of honeytokens and their
Numerous techniques have been proposed in the literature for gen- monitoring for active detection of data breaches and misuse of sensitive
erating synthetic data, serving various purposes across different do- information. It starts with the extraction module, which extracts the
mains. These purposes include predicting diseases (Che et al., 2017),
tabular data and their metadata from the database. The output obtained
maintaining data accessibility and privacy (Choi et al., 2017) concern
from the extraction module will be input for the honeytoken generation
of electronic health records (EHR), forecasting drug exposure (Yahi
module, which consists of an HMA synthesizer that generates the syn-
et al., 2017), safeguarding against machine learning-based attacks (like
thetic records for all the tables. After that, the evaluation module inputs
re-identification and membership inference) (Park et al., 2018), and
ensuring differential privacy (Jordon and Yoon, 2018) in the gen- the generated synthetic data and metadata to compare the records
erated data. Chen et al. (2019) proposed a GAN-based tabular data with original entries and checks the honeytoken quality. To choose
augmentation technique to complete the partial table released by the the believable honeytoken, we ran the experiment multiple times and
US Census Bureau for further analysis and investigation. The synthetic compared the report obtained from the evaluation module. We have
data satisfies the statistical properties of actual data and functional selected the best result based on the evaluation score and saved the
dependency between tables. The primary application of synthetic data generated synthetic records(honeytokens) in the tabular format. For
generation is diverse. It fulfills the need for data unavailability in hos- active detection of data breaches, there is a need for proper deploy-
pitals. Industrial organizations use it to train and test machine learning ment of the generated honeytokens, which must be indistinguishable
and deep learning models. Moreover, it reduces the confidentiality and and mislead the adversary towards an unreal target. The deployment
privacy issues in clinical trial research. The synthetic data can also module supports three types of deployment strategies at different levels
ensure the accuracy and reliability of risk models while preserving of abstraction. Furthermore, the monitoring and alert generation mod-
privacy concerns. Likewise, numerous examples are present in the ule monitors the deployed honeytoken and generates an alert to the
literature to solve the scarcity of data in various domains. However, administrator upon each access. We will discuss each module in great
no work in the literature focuses on the use of synthetic data to detect detail in the subsequent Sections.
the breach and misuse of sensitive information present in organizations
and government entities.
3.1. HMA synthesizer
In this work, we will use synthetic data as a honeytoken for active
detection of data breaches and misuse of sensitive information. A typi-
Numerous techniques have been proposed in the literature to gen-
cal enterprise database is an organized collection of information or data
stored electronically in a computer system. It consists of a collection of erate synthetic data. Among all the models, the HMA synthesizer meets
tables that store interrelated data. So, there is a need for a technique our requirement to generate synthetic relational data from multiple
that resembles the structural and statistical properties of interrelated tables present in the database. The HMA synthesizer is a module present
tables. We will use the HMA synthesizer to solve this issue, which is in the SDV library that uses a hierarchical machine-learning model to
best suited for multi-table synthetic data generation. Organizations use learn the pattern present in actual data. It is designed to capture the
the generated data as a honeytoken to generate alerts on each access. statistical and structural properties of records present in the database
In the next Section, we will discuss how we have incorporated the tables. The model takes actual data and multi-table metadata as input
Hierarchical Modeling Algorithm to generate the honeytoken and their and generates the synthetic data out of it. Essentially, the SDV (Mon-
deployment strategies. tanez et al., 2018) library consists of three modules: copula, reversible

4
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 2. Workflow of proposed synthetic (fake) tabular data generation and monitoring Framework.

data transform (RDT), and SDV module, developed by Patki et al.


(2016) as depicted in Fig. 3. The copula is a modeling technique that
works on the multivariate distribution property of marginal cumulative
distribution. It takes numeric data as input and creates a copula model
that can generate the samples by learning the statistical properties of
the actual database. To handle the data present in tables other than
numeric format, the SDV uses the RDT library, which transforms the
categorical and textual columns into the numerical format and reverses
transforms to the original format. The SDV library recursively traverses
through the tables and applies adequate modeling techniques to the
respective columns. It helps in learning the relationship among the
tables present in the database. Fig. 3 shows the interaction among the
libraries present in the SDV library.

3.2. Algorithm

The synthetic data generation process consists of numerous steps, as


depicted in Algorithm 1. In the first step, we took the database as input
in our framework and extracted the actual data in ‘CSV’ format. The
real data consists of five tables: employee, department, department em-
ployee, department manager, and salary. The employee table contains
the employee’s personal information (emp_no, first_name, last_name
birth_date, gender, and hire_date). The department table has the at-
Fig. 3. Interaction among the modules present in SDV library.
tributes such as dept_no and dept_name. The Dept-Emp table contains Source: Adopted from Montanez et al. (2018)
information such as emp_no, dept_no, from_date, and to_date. Similarly,

5
N. Prabhaker et al. Computers & Security 146 (2024) 104032

the department manager table contains the same attributes to maintain in the organization. It enhances the administrator’s ability to mon-
the manager records. At the same time, the salary table consists of itor sensitive data and timely detection of breaches. The strategies
four attributes such as emp_no, amount, from_date, and to_date. The for deploying honeytoken in the database are categorized into three
pre-processing module takes the real data as input. We have added levels: honey database, honey tables, and honey records. To deploy
attributes such as mobile_number, email, and password to the employee the honeytoken at the database level, we can use various fake objects
table. The tables will be passed as input to the metadata extraction such as tables, views, schema, log files, etc. All the objects work as
module present in the SDV library. Since the SDV library only supports honeytoken, and each user interaction with the objects generates an
numerical data, we have converted all the columns in date format into alert to the administrator. Moreover, it can be used as a honeypot to
numerical format. After that, the obtained metadata was compared analyze the behavior of adversaries. The table-label honeytoken can be
with the original structure of the database and checked the validity achieved by the deployment of fake tables with attractive names into
of the metadata. We added the constraints to the metadata using the original database. The attractive names lure the attacker towards
the add_constraints function available in SDV (Montanez et al., 2018) fake tables. Compared to database-level honeytoken, this strategy is
library. The updated metadata and real data are passed as input to more straightforward to generate but challenging to deploy inside the
the HMA synthesizer, which learns the inter-table relationship present real database.
among the tables. The sample function generates the synthetic (fake) The other approach we use to deploy honeytoken is inserting fake
data for all the tables. The synthesizer faced challenges in generating records into existing database tables. The record-level deployment
the attributes such as email and password of the employee table, which of honeytoken can be obtained using numerous strategies depending
looks unbelievable. The post-processing module updates the records upon the availability of resources and their infrastructure. This method
of the generated tables to make the table’s records believable. The increases the probability of interaction with the attacker. Whenever
synthetic data for all the interrelated tables follows similar statistical adversaries want to use the records in the table, since they may need
properties in a real database. Finally, evaluate the result using the more information about all the records(unaware of the environment),
metrics present in SDV (Patki et al., 2016) library, such as the diag- it is difficult to distinguish whether the record they will use is genuine
nostic and data quality reports. Moreover, to compare the statistical or fake. In case an adversary tries to access all the records of tables
properties of real and synthetic data, we used the table-evaluator2 for sending phishing mail. The email will also be delivered to the fake
library (Brenninkmeijer, 2024). We deployed the Synthetic (fake) data email IDs. Which is unexpected access to the records; the administrator
afterward using the strategies explained in the subsequent Section 3.3. will notify the employee and take immediate action to reduce the
Even though we will discuss the deployment and monitoring strategies impact on the organization. The deployment strategies will help the
in this paper, we mainly focused on the generation of believable defender in various attack vectors. In case the adversary has ex-filtered
honeytoken. Monitoring the deployed honeytokens is required to detect the database of an organization. They want to sell the data into the
data breaches and misuse of sensitive information. We will discuss Darknet; before this, they have to rectify the original data, which will
the deployed honeytoken monitoring and alert generation modules in kill their time and resources. There is a possibility of a loss of the
Section 5.6. In conclusion, the algorithm is designed to generate a attacker’s reputation in the Darknet community. In case the adversary
specified number of fake records to deploy in the database to detect wants to log into the system to increase the access privilege of the user
data breaches. It uses the hierarchical machine learning model to learn account using stolen credentials. The deployed honeytoken will gen-
the statistical properties of data and the relationship among tables in erate an alert for the administrators, moreover if the adversary wants
the original database. to avail facilities using sensitive information such as email, credit card
number, names, and social security number. The deployed honeytoken
(fake email) mapped with the monitoring module will generate an alert
Algorithm 1: Honeytoken Generation for administrators. It is evident that the honeytoken helps in quick
Input: Sample Relational Database (MySQL) with ‘𝑛’ Tables detection of data breaches and misuse of sensitive information.
Output: Honey Database (MySQL)
1 Function GenerateHoneyDataBase(Database): 4. Sample database
2 𝑟𝑒𝑎𝑙_𝑑𝑎𝑡𝑎 ← DataAcquisition(𝐷𝑎𝑡𝑎𝑏𝑎𝑠𝑒);
3 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝑑𝑎𝑡𝑎 ← Preprocessing(𝑟𝑒𝑎𝑙_𝑑𝑎𝑡𝑎); Due to the lack of availability of real-time relational data, we
4 𝑚𝑒𝑡𝑎𝑑𝑎𝑡𝑎 ← Metadata_extraction(𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑_𝑑𝑎𝑡𝑎); have used the relational database generated by Fusheng Wang and
5 𝑣𝑎𝑙𝑖𝑑=isValid(𝑚𝑒𝑡𝑎𝑑𝑎𝑡𝑎); Carlo Zaniolo at Siemens Corporate Research, publicly available at
6 if ! 𝑣𝑎𝑙𝑖𝑑 then GitHub (Giuseppe and Patrick, 2023) in ‘.sql’ format. The database
7 Update_Metadata (metadeta) contains information about 10,000 employees, their department, and
8 end salary details. The database consists of six tables (employee, depart-
9 𝑆𝑦𝑛𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑟 ← HMA Synthesizer(); ment, Dept_Emp, Department_Manager, salary and employee_title) con-
10 𝑆𝑦𝑛𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑟.fit(𝑝𝑟𝑒𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑, 𝑚𝑒𝑡𝑎𝑑𝑎𝑡𝑎); nected with foreign keys. We have excluded employee_title table from
11 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐_𝑑𝑎𝑡𝑎 ← 𝑆𝑦𝑛𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑟.sample(𝑠𝑐𝑎𝑙𝑒 = 1.5); our experiment. Table 2 represents the summary of the number of
12 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐_𝑑𝑎𝑡𝑎 ← Postprocessing(𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐_𝑑𝑎𝑡𝑎); records and their attributes for each table present in the ‘employee-
13 𝑒𝑣𝑎𝑙_𝑟𝑒𝑠𝑢𝑙𝑡 ← Evaluation(𝑟𝑒𝑎𝑙_𝑑𝑎𝑡𝑎, 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐_𝑑𝑎𝑡𝑎); details’ database selected for our experiment. The employee table
14 𝐹 𝑎𝑘𝑒_𝐷𝑎𝑡𝑎𝑏𝑎𝑠𝑒 ← 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐_𝑑𝑎𝑡𝑎 present in the original database consists of employee_number (emp_no)
15 return Fake_Database; first_name, last_name, birth_date, gender, and hire_date. We have added
sensitive attributes such as mobile_number, email, and password to
the employee table to fulfill the information required for the ex-
3.3. Deployment strategies periment. The department table contains the department_no and de-
partment_name. The tables for the Department employee and the
We already explored the honeytoken generation method in the Department_Manager(dept-mgr) store the information regarding the
previous subsection. Now, we will discuss the strategies for effective tenure of the employee in a particular department or the information
deployment of generated honeytoken. The honeytoken must be de- about the tenure of the department manager in the organization.
ployed so the attacker cannot distinguish the deployed fake objects Finally, the salary table has information about the transaction of the
amount credited to the employee for a particular duration. The relation
among the table and their attributes with the data types are depicted
2
https://github.com/Baukebrenninkmeijer/table-evaluator/tree/master in Fig. 4.

6
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 4. Tables with columns and data types. The relationship among tables are shown by the connecting arrows.

Table 2 will discuss the steps used in the proposed framework for generating,
Tables in employee-details database and their size.
deploying, and effectively monitoring records.
Sr. No. Table Records ×Columns
1 Employee 10000 × 6 5.1. Data acquisition
2 Salary 94917 × 4
3 Department 09 × 2
4 Dept-Emp 11051 × 4 Data acquisition is a critical phase in machine learning, commonly
5 Department-Manager 24 × 4 known as gathering or collecting data. The proposed technique uses
database tables as its input data. We have used Python libraries such
as MySQL-connector and Pandas to obtain the necessary tables and pre-
5. Experiment and result pare them for machine learning techniques. We have used the MySQL
connector library to extract the data from the database and saved all
Our experiment focuses on the generation of believable synthetic the database tables into a ‘.csv’ file. Sample of records of different
records for financial organizations. Synthetic records can be used as tables of the database are depicted in Tables 3–7. In case of depart
honeytokens to detect data breaches or the misuse of critical informa- manager Table 7. Since, it is difficult to predict the age of a department
tion. We implemented the algorithm 1 present in Section 3 to generate in an organization, the employees who are currently working in the
the synthetic data. After that, to detect the data breach, we deployed department have the default value (1/1/9999) for to_date attributes.
the honeytoken in the database using the strategies mentioned in
Section 3.3. To monitor the deployed honeytoken in particular, we 5.2. Pre-processing
have used hash-based authentication techniques that detect the use of
honeytokens to enter into the system. Moreover, we closely monitor To fulfill the initial requirement of the experiment, we have ex-
the activity of personally identifiable honey tokens deployed; now, we tended the employee table in the database. Using regular expressions,

Table 3
Pre-processed Sample records of Employee table.
Emp_no Birth_date First_name Last_name Gender Hire_date Mobile_number Passwrd Email
10001 9/2/1953 Georgi Facello M 6/26/1986 6260399536 facello@9604 [email protected]
10002 6/2/1964 Bezalel Simmel F 11/21/1985 7803800842 Bsimmel@6928 [email protected]
10003 12/3/1959 Parto Bamford M 8/28/1986 6260399526 Bamford_898 [email protected]
10004 5/1/1954 Chirstian Koblick M 12/1/1986 9926564079 Ckoblick3876 [email protected]
10005 1/21/1955 Kyoichi Maliniak M 9/12/1989 6260667877 maliniak@0110 [email protected]
10006 4/20/1953 Anneke Preusig F 6/2/1989 9926576226 anneke@99 [email protected]
10007 5/23/1957 Tzvetan Zielinski F 2/10/1989 8112200997 Tzielinski28 [email protected]
10008 2/19/1958 Saniya Kalloufi M 9/15/1994 9399446294 Skalloufi@299 [email protected]
10009 4/19/1952 Sumant Peac F 2/18/1985 9993229841 sumant@1910 [email protected]
10010 6/1/1963 Duangkaew Piveteau F 8/24/1989 9755075358 Dpiveteau@2874 [email protected]
10011 11/7/1953 Mary Sluis F 1/22/1990 9926564090 sluis@9670 [email protected]
10012 10/4/1960 Patricio Bridgland M 12/18/1992 8359714129 patricio@2193 [email protected]
10013 6/7/1963 Eberhardt Terkki M 10/20/1985 8112200947 eberhardt@2316 [email protected]
10014 2/12/1956 Berni Genin M 3/11/1987 6260667969 berni@0497 [email protected]
10015 8/19/1959 Guoxiang Nooteboom M 7/2/1987 6260667946 guoxiang@791 [email protected]
… … … … … … … … …

7
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Table 4 to generate the columns such as email, password, and mobile number
Sample records of Salary table. as depicted in the Algorithm 2. Moreover, we changed the data type
Emp_no Amount From_date To_date of ‘from_date’ and ‘to_date’ attributes into numerical columns present
10001 88 958 6/22/2002 9/28/2003 in all tables. Fig. 5 represents the relationship among the tables and
10002 65 828 8/3/1996 8/3/1997 metadata in the dataset.
10002 65 909 8/3/1997 8/3/1998
10003 43 311 12/1/2001 1/1/2002 Algorithm 2: EmployeeColumnExtender
10004 40 054 12/1/1986 12/1/1987 Data: 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒_𝑡𝑎𝑏𝑙𝑒
10005 91 453 9/9/2000 9/9/2001 Result: 𝐸𝑥𝑡𝑒𝑛𝑑𝑒𝑑_𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒_𝑡𝑎𝑏𝑙𝑒
10005 94 692 9/9/2001 1/1/2002
10006 40 000 8/5/1990 8/5/1991
1 Function Column_Extend(Employee_Columns):
10007 88 070 2/7/2002 1/1/2003 2 extcolumn =[ ];
10008 46 671 3/11/1998 3/11/1999 3 for 𝑒𝑚𝑝_𝑛𝑜 = 10001 to 𝑛 do
10012 40 000 12/18/1992 12/18/1993 4 mobile_number= FakerModule();
… … … …
5 password= createPassword(first_name, last_name,
mobile_number, birth_date);
Table 5 6 email=createEmail(first_name, last_name,
Sample records of Department table. mobile_number, birth_date);
Dept_no Dept_name 7 ̇
extcolumnappend(mobile_number, password, email);
d001 Marketing 8 end
d002 Finance 9 Extended_employee_table=
d003 Human Resources
d004 Production
employee_table.addcolumns(extcolumn);
d005 Development 10 return Extended_employee_table;
d006 Quality Management 11 Function createPassword(A1, … , An):
d007 Sales 12 methods = [M1, M2,… , Mn] /* Different methods used to
d008 Research
d009 Customer Service
create password*/ ;
13 chosen_method = random.choice(methods);
14 random_Password = chosen_method(A1,… , An);
Table 6
15 return random_Password;
Sample records of Dept-Emp table.
16 Function createEmail(A1,… , An):
Emp_no Dept_no From_date To_date
17 methods = [M1, M2,… , Mn] /* Different methods used to
10001 d005 6/26/1986 1/1/9999
create email*/ ;
10002 d007 8/3/1996 1/1/9999
10003 d004 12/3/1995 1/1/9999 18 chosen_method = random.choice(methods);
10057 d005 1/15/1992 1/1/9999 19 random_email = chosen_method(A1,… , An);
10058 d001 4/25/1988 1/1/9999 20 return random_email;
10059 d002 6/26/1991 1/1/9999
10060 d007 5/28/1989 11/11/1992
10199 d007 2/7/1998 1/1/9999
10200 d004 2/17/1994 10/3/2000
5.3. Metadata extraction
… … … …
In this phase, we passed all the tables(‘.csv’ format) as input to the
multi-table metadata extraction module in the ‘SDV’ library that returns
Table 7
the metadata in the dictionary (‘.json’) format. The metadata consists
Sample records of Deartment-Manager table.
of the table name, attributes, data types, and various keys present in
Emp_no Dept_no From_date To_date
the tables to create relationships among them. After that, the metadata
10002 d001 1/1/1985 10/1/1991
was validated with the original structure of the database. If there is any
10039 d001 10/1/1991 1/1/9999
10085 d002 1/1/1985 12/17/1989 mismatch, we manipulate the metadata according to the properties of
10114 d002 12/17/1989 1/1/9999 the original tables and their relationship. The updated metadata will be
10183 d003 1/1/1985 3/21/1992 passed as input to the synthetic data generation phase.
10228 d003 3/21/1992 1/1/9999
10303 d004 1/1/1985 9/9/1988
5.4. Generation of synthetic data
10420 d004 8/30/1996 1/1/9999
10511 d005 1/1/1985 4/25/1992
… … … … Synthetic data is supposed to follow the property of real data,
from which it is difficult to trace any real or individual information.
According to the input provided to generate the synthetic data, it can
be classified into three categories: fully synthetic, partially synthetic
We added columns for personally identifiable information such as and hybrid synthetic. In fully synthetic, the data can be generated from
email, password, and phone number. It is a common practice among scratch using statistical model without taking real data as input. The
individuals to use personal information to create passwords and email partially synthetic method uses real data as input, replacing the original
IDs. This behavior stems from the convenience of using familiar in- values using masking, perturbation or imputation techniques. However,
formation and making it easier to remember login credentials. People the data generated by the hybrid model combines real and synthetic
often use their names, family members’ names, birth dates, anniversary records. The techniques such as sampling, augmentation or interpo-
dates, pet names, hobbies, favorite sports teams, or other personal lation can be used to generate the synthetic data. The synthetic data
details as part of their passwords or email IDs. The intention behind can be used for various purposes such as testing, training, research and
using familiar information is to create passwords and email IDs that analysis without violating the privacy and confidentiality of real data
are easy to remember. By incorporating elements from their personal or subjects. In this work, we use synthetic (fake) data as honeytokens
lives, individuals believe they can recall their login credentials more to detect the data breach.
easily without relying on external aids like password managers. Based At the first stage we collected the data from Github (Giuseppe
on this analysis, we have created different regular expression patterns and Patrick, 2023). We have used the Algorithm 2 to add sensitive

8
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 5. Pre-processed Tables with columns and their data types, The relationship among tables are shown by the connecting arrows.

Table 8
Sample records of synthetic Employee table.
Emp_no Birth_date First_name Last_name Gender Hire_date Mobile_number Passwrd Email
10001 3/22/1963 Krista Smith F 3/23/1989 6776838827 smiThy9 [email protected]
10002 11/16/1957 Annette Solis M 4/15/1989 7989388236 aNnette010 [email protected]
10003 10/1/1962 Corey Trevino M 9/16/1987 9253701259 Ctrevino293@ [email protected]
10004 5/2/1953 Mackenzie Roman F 12/13/1990 9952509343 manmanfrOm10 [email protected]
10005 7/12/1952 Patrick Sanchez M 9/7/1985 9639283127 patRick%1918 [email protected]
10006 9/12/1964 Jessica Foster M 7/23/1985 9870259826 faStvwR32 [email protected]
10007 3/22/1953 Brian Vazquez F 8/12/1991 8893795456 brian0 [email protected]
10008 12/20/1955 Timothy Williams F 8/29/1990 8512087738 williammuteBi [email protected]
10009 10/27/1959 Andrew Johnson F 6/15/1990 7067624171 andrew[2185 [email protected]
10010 8/2/1960 Robert Brown F 9/23/1990 9340932515 brown3y0 [email protected]
10011 4/21/1953 Jacob Lee M 12/7/1986 9970156865 jacob8 [email protected]
10012 4/4/1958 Brandon Ramirez F 3/27/1997 9970939321 brandon9 [email protected]
10013 9/22/1958 Jacob Huang F 5/11/1992 6758462749 Huang"771 [email protected]
10014 6/18/1960 Julian Torres M 10/11/1985 6676439793 toRreS9 [email protected]
10015 12/21/1962 Amanda White M 10/24/1985 6315032299 XiTxa5 [email protected]
… … … … … … … … …

records to the employee table. After that pre-processed the data to Table 9
make suitable for HMA synthesizer, then we extracted the metadata Sample records of synthetic Salary table.

of the database and applied the required constraints present in original Emp_no Amount From_date To_date

database for better accuracy and statistical coverage. We have passed 10001 145 732 3/23/1999 2/12/2000
the processed data to the HMA synthesizer along with metadata. After 10002 56 672 9/20/1998 10/9/1998
10002 56 672 8/13/1996 9/1/1996
training the model we have generated the samples (scale = 1.50) of 10003 46 885 10/27/1990 4/1/1991
the database. Here, scaling factor indicates the multiplicative increase 10004 45 034 8/21/2000 1/5/2002
in number of records present in original table. For example if we have 10004 45 034 9/7/1996 1/22/1998
a table with 100 records after applying the scaling factor of value 1.5, 10005 145 732 8/1/2002 4/19/2004
the sample function will generate 150 records in fake table. Followed by 10005 145 732 9/28/1990 6/24/1993
10006 145 732 3/14/2002 4/19/2004
the post processing step to make the generated data indistinguishable. 10007 35 679 3/11/1996 12/6/1998
Finally, evaluated using diagnostic and quality report of the generated 10008 46 595 1/25/2000 6/14/2001
synthetic data. The obtained sample of synthetic data is depicted in 10009 57 228 8/1/2002 4/19/2004
Tables 8–12. 10009 56 972 5/31/1997 2/25/2000
10010 79 927 8/14/1998 5/10/2001
… … … …
5.5. Evaluation of result

We have used the metrics present in the SDV (Patki et al., 2016)
library to evaluate the generated synthetic data, which generate the the percentage of properties captured by the tables present in synthetic
diagnostic and quality report. The diagnostic report covers three prop- database. In data validity the model has captured the 97.94% of the
erties of the database such as data validity, data structure and rela- original data. The score of 100% for data structure property indicates
tionship validity as shown in Table 13. The average score indicates that the model resembles the original structure of the table. Whereas,

9
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Table 10 Table 14
Sample records of synthetic Department table. Quality Report.
Dept_no Dept_name Column properties Average score (%)
d001 Marketing Column Shape 78.79
d002 Finance Column Pair Trends 94.82
d003 Human Resources Cardinality 60.92
d004 Production Intertable trends 81.14
d005 Development Overall Score 78.92
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service 5.5.1. Statistical evaluation and analysis
d010 Loan
Starting with the first evaluation, we look at real and synthetic
d011 Advertisement
d012 Audit data’s mean and standard deviation. The assumption we made is that
d013 Treasury if the synthesizer fails to capture the basic properties of tabular data,
the other derived properties will also likely fail. We plotted the log-
transformed value of all the numeric columns present in the tables.
Table 11
Sample records of synthetic Department-Employee table. Fig. 9 shows the employee table’s average mean and standard deviation.
Emp_no Dept_no From_date To_date The values of the average mean (Fig. 9(a)) for data in the employee
table follow the line (𝑦 = 𝑥), which means both real and fake data in the
10002 d008 8/21/1986 1/1/9999
10004 d008 3/29/1998 1/1/9999 employee table have similar properties concerning the average mean.
10006 d002 4/17/1985 8/8/1985 The plot for the average standard deviation of data in the employee
10006 d008 6/21/1985 8/8/1985 table is shown (Fig. 9(b)) and also follows the line (𝑦 = 𝑥). The model
10006 d008 3/31/1987 7/21/1996
captures real data’s mean and standard deviation with a relative edge.
10008 d012 9/24/1996 1/1/9999
10017 d009 8/8/1985 3/26/1989 Similarly, the plots for the related tables, such as dept-emp and salary,
10019 d002 8/23/1997 1/1/9999 are depicted in Fig. 10 and Fig. 11, respectively. The synthetic data
10021 d013 2/18/2000 1/1/9999 keeps the mean and standard deviation in the same order of magnitude.
10025 d005 8/8/1985 3/15/1994 Synthesizer retains the statistical properties of real data.
10028 d001 9/3/1985 3/14/1988
10564 d010 4/29/1990 1/1/9999
The column-wise correlation for real and fake employee tables and
10564 d012 1/17/1992 1/1/9999 the difference among their correlation plot is depicted in Fig. 12.
10566 d001 4/6/1999 2/6/2002 The column of the employee table represented in Fig. 12(a) is highly
… … … … correlated with the same attributes. A similar correlational pattern is
represented by fake records as depicted in Fig. 12(b). The difference
Table 12 between the correlation plot of real and fake columns is depicted in
Sample records of synthetic Department Manager (dpt-mgr) table. Fig. 12(c), which is very low, which is quite significant to have similar
Emp_no Dept_no From_date To_date properties among the columns of the real and fake tables. It becomes
15269 d001 10/5/1989 12/22/2015 clear that the HMA synthesizer learned the correlation pattern present
22653 d002 10/6/1994 12/5/1999 in the original employee table.
21198 d003 2/7/1991 11/8/2003 Similarly, the column-wise correlation and their difference for salary
21426 d004 12/2/1993 1/1/9999
and dept-emp table are depicted in Fig. 13 and Fig. 14 respectively.
14828 d004 5/6/1993 1/1/9999
23406 d005 11/7/1990 11/19/2016 We observed that the synthesizer performs well and can generate
19483 d006 11/18/1989 1/1/9999 realistic synthetic data seamlessly. For some tables, the synthesizer
10761 d006 5/11/1989 3/21/2011 has difficulty capturing correlation; however, the generated data are
21622 d012 3/23/1990 1/1/9999
indistinguishable.
17521 d013 6/21/1991 1/1/9999
20563 d007 10/29/1988 12/19/2003 Principal Component Analysis (PCA) (Wold et al., 1987) is a pow-
15347 d007 12/22/1988 1/1/9999 erful tool used to reduce the dimension of datasets while preserving
12270 d008 6/10/1991 10/27/1999 crucial information. It transforms the original variable into a set of new
11100 d009 7/29/1993 1/1/9999 uncorrelated variables known as principal components. Here, we have
… … … …
taken the first two essential components of PCA obtained by applying
the algorithm to the tables of the database.
Table 13 Fig. 15 represents the correlation between the values of the first two
Diagnostic Report. components of PCA for real and fake employee tables. It is evident that
Table properties Average score (%) the HMA synthesizer can retain the features with significant differences.
Data Validity 97.94 In the case of the Dept-Emp table, as shown in Fig. 16. The obtained
Data Structure 100 plots for real and fake tables are similar to each other. However, due to
Relationship 88.71
Overall Score 95.55
the fewer records in the dept-mgr table, the model is unable to capture
the important features present in the table, as shown in Fig. 17.
The model also retains the features of the salary table with signif-
icant differences in the range of values between the first two compo-
the relationship score (88.71%) indicates the model faced moderate dif- nents of PCA attributes of real and fake tables, as depicted in Fig. 18.
ficulties to capture the relationship property among tables. The quality This strengthens our hypothesis of generating a honytoken using an
report consists of the properties of columns shape, column pair tends, HMA synthesizer.
cardinality and intertable trends presented in Table 14. The plots for We have used the visual evaluation techniques to add the insight to
cardinality among tables, data validation score of the column present generated synthetic data which is not covered by qualitative evaluation
in tables and the column pair plots of employee table are depicted in result. We plot the cumulative sum plots for columns present in each
[Figs. 6–8]. The average score indicates the percentage of properties table of the database. [Figs. 19–21] represent the cumulative sum plots
captured by the synthetic data. for columns of the tables of database. Due to less number of records

10
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 6. Cardinality plot among employee, salary, department manager, and dept_emp tables (a) represents the number of times an employee shifted to a different department.
(b) represent the relationship between department employees and department managers. (c) represent the relationship between the department manager and the employee. Table
(d) represents the number of times the organization has credited the salary to the employee. It is evident that the frequency of records in parent–child relationships follows the
pattern for relations (c) and (d); the model has faced challenges in obtaining the cardinality for the relation shown in (a) and (b)

available in the department and department manager tables, it is easy birth_date and hire_date columns. However, there is some difference in
to compare the data manually. The curve for synthetic data depicted the emp_no and mobile_number attributes. Fig. 23 represents the proba-
in Fig. 19(a) follows the similar trend present in real table. In case of bility distribution plot for the numerical columns in real and fake salary
column (emp_no) (Fig. 19(a)) present in employee table there is some table. The generated synthetic emp_no follows the real emp_no curve.
difference in plot due to scaling factor (1.5) applied to generate records The model captured the pattern moderately and generated the peak
in synthetic data. The other columns follows the similar trends present value for from_date, to_date, and amount attributes. The probability
in the real table. distribution plot for real and fake data in numerical and categorical
The cumulative sum plots for columns in the salary table are de- columns of dept-emp tables are depicted in Fig. 24. The model has
picted in Fig. 20. The synthetic data follows a similar trend to the resembled the property or real data very well. We observed the creation
actual table. In the case of column (emp_no) present in the employee of a few new departments in the dept_emp table due to the scaling
table, there is some difference in the plot due to the scaling factor (1.5) factor (1.5) used to generate the sample data.
applied to generate records in synthetic data. We also observed the
difference in the case of the amount column; however, the model is 5.6. Monitoring and alert generation
capable of capturing the range of values in the actual data. The other
columns follow similar trends with predefined constraints (from_date Now, we will discuss about the functionality of monitoring and alert
< to_date) present in the actual table. Fig. 21 shows the cumulative generation module. It consists of various phase such as authentication,
redirection and alert generation.
sum trends of the real and synthetic data for the each column present
in dept-emp table. Here, we observe the similar trends for column • Authentication: The validation of login credentials is a crucial
emp_no present in other table of the database. The synthetic data component of any organization. After deploying the indistin-
present in columns such as from_date and to_date for dept-emp table guishable honeytoken to the database, the authentication module
has captured the pattern accurately. However there is a difference handles the authentication process for real and fake users. Finding
in dept_no attribute of column that is due to scaling factor already the difference between honey tokens and actual records is a
explained in previous Section. The synthesizer is capable to generate significant problem. To overcome this issue, we recommended
new record for the new department not available in the real database. adding an extra field to the database table to contain hashes
The distribution plot compares the distribution of values of real and of the primary key attribute that distinguish honeytokens from
synthetic data for the tables in the database. Fig. 22 represents the genuine entries; the approach we propose includes inserting extra
distribution of values for columns present in the employee table. We characters to the original value of the primary property. Addi-
observed that the synthetic data follows a similar distribution for the tional characters are required before the hashing procedure. We

11
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 7. Data validation plot for Dept-Emp, Salary and Employee tables (a) represent the Boundary Adharance property of from_date and to_date columns of department employee
table (b) represent the Boundary Adharance property of amount, from_date and to_date columns present in salary table (c) represent the score for Key Uniqueness(emp_no),
Boundary Adharance (birth_date, hire_date, mobile_number), and Category Adharance(gender) property for employee table. It is evident that the numerical columns present in
synthetic database tables follow the constraints such as key Uniqueness, Boundary Adherence, and Category Adharance present in real databases.

have used the well-known SHA256 hashing technique. This cryp- receives information about the suspected threat as soon as possi-
tographic technique converts the updated attribute value into a ble. After that, the administrator can take quick and appropriate
unique hash code. Then, a separate representation of honeytokens action to reduce the impact of data breaches and misuse of sensi-
will be created within the database using hashes of particular tive information. Blocking the attacker’s access to prevent future
records. Based on the unique hash patterns created by the updated unauthorized activity, investigating the incident, and installing
attribute values, this method allows the authentication module to essential security measures to protect the system and its assets
distinguish between genuine tokens and honeytokens. are examples of such actions.
• Redirection: It determines the action to be performed after the
authentication process. When the authentication module success- Here we are not monitoring the honey record deployed in the
fully detects the use of a honeytoken, the system starts a redi- children tables. The main goal for deployment of entries in related
rection procedure. The Redirection comprises forwarding the ma- tables is to increase the believability of records present in the parent
licious user to a website that closely resembles the original or table. Our main focus is to detect the data breach incident and the
to a server that serves as a honeypot. Moreover, the redirection misuse of stolen credential and personally identifiable information
module engages the user in an environment that mimics the (PII) as quickly as possible. Apart from monitoring of honeytoken at
appearance and functionality of the original system or services. authentication module, we are closely monitoring the sensitive records
We have created a controlled and carefully monitored environ- such as deployed fake email and mobile_number, the use of which also
ment by forwarding the adversary to the deceptive website or indicates the possibility of data breach.
honeypot server to obtain useful insights and suspected behaviors.
The primary goal of redirecting the user to a deceptive page or 5.7. Evaluation of deployed honeytoken
honeypot server is twofold: first, to collect vital intelligence on
the attacker’s behavior, intentions, and techniques, and second, to We performed a human evaluation study on the deployed hon-
reduce the risk of further unauthorized access or malicious actions eytoken. The collection of tables was given to a group of students;
within the actual system. the objective was to distinguish between real and fake entries. This
• Alert Generation: This module alerts the administrator after evaluation helped us by providing information regarding the attackers’
successfully identifying honeytoken. This alert acts as a noti- tactics for filtering out the honeytokens from the table. It became
fication mechanism, informing the system administrator of the clear that attackers may use a variety of ways to discover and filter
unauthorized activity. The alert will be sent through email, SMS, out honeytokens. One such method that has been identified is open-
pop-up messages, log files, etc., ensuring that the administrator source intelligence (OSINT) to obtain information about a particular

12
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 8. Column-pair plots for attributes in employee table (a) represent the boxplot comparison between real and fake data for hire_date and gender column (b) represent the
boxplot comparison between real and fake data for mobile_number and gender column (c) represent the scatter plot comparison between real and fake data for mobile_number
and hire_date columns. Synthetic data follows similar trends to those present in real databases.

Fig. 9. Average mean and standard deviation plots(log-transformed) for numerical columns of Original and Synthetic Employee table. It is evident that the average mean and
standard deviation score follow the lines (x=y), which means the model is capable of capturing the statistical properties of actual data.

13
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 10. Average mean and standard deviation plots (log-transformed) for numerical columns of Original and Synthetic Dept-Emp table. It is evident that the average mean and
standard deviation score follow the lines (x=y), which means the model is capable of capturing the statistical properties of actual data.

Fig. 11. Average mean and standard deviation plots (log-transformed) for numerical columns of Original and Synthetic Salary table; it is evident that the average mean and
standard deviation score do not follow the lines (x=y) for few attributes, which means the model has generated the outliers data for those attributes.

entry. Our study aims to improve the effectiveness of the honeytoken • Attackers can exploit the presence of mobile numbers within the
approach by developing countermeasures to these tactics. Targeting honeytokens as a means to filter out fake entries. They may
the mobile number, respective email ID, and password attribute are use one of two main methods: making phone calls or sending
SMS messages. These techniques allow attackers to classify the
probable tactics. Our study aims to increase honeytokens’ resistance to
entries into different groups by evaluating the reliability and
potential attacks and their general effectiveness as a security measure reachability of the mobile numbers. To solve this problem, an
by identifying and addressing these particular areas of vulnerability. organization can register a new mobile number and map the
Possible strategies that an attacker use are: call and message of each fake mobile number deployed in the

14
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 12. Column correlation plots for numerical columns in Real and Synthetic Employee table (a) represent the relation among the columns of original employee table (b)
represent the relation among the columns of synthetic (fake) employee table (c) represent the difference between correlation plots (a) and (b), where the maximum magnitude of
difference is 0.10. It is evident that the model has effectively captured the pattern in employee table.

database as honeytoken. Whenever a call or message comes to real-time environment indicates the breach incident by triggering the
those numbers, it indicates the breach in the database. alert to the administrator. After that, the security administrator can
• Attackers may send test emails or engage in other email-based take appropriate action to reduce the impact of data breaches. The
investigation techniques in an effort to distinguish between real experimental result shows that the model used to generate honeytoken
email addresses and synthetic (fake) email addresses. To solve this performs very well and covers the statistical properties of the actual
problem, organizations can identify suspicious activity by closely databases. It shows the capability of the hierarchical modeling algo-
monitoring any attempts to send or receive emails from these rithm to capture the structure of tables along with their relationship.
bogus email addresses. However, the model generates random values for some attributes, such
• If any of the generated honeytokens violate the organization’s as password and email, in the employee table. We post-processed those
developed password policy, it will impact the effectiveness of attributes to closely resemble the properties of genuine data entries.
Honeytokens. Attackers may use filtering techniques in such sit- Since we are using the HMA synthesizer to generate the synthetic data,
the inherent challenges of the synthesizer also affect the believability
uations to find and remove honeytokens that do not adhere to
of the honeytoken. One of the issues the SDV still faces is that it takes
the password policy’s requirements. To address this problem, we
a long time to model an extensive database with many relationships.
can deploy the passwords that follow the organization’s password
In the case of multiple-children relationships, the flattening operation
policy.
creates a vast table, which may take a lot of time to train the model. In
the case of multi-parent relationships, the modeling process selects the
6. Discussion random parent, which drastically differs from the obtained result. The
SDV model will face significant challenges in generating the database in
We focused on the generation of honeytokens for enterprise databases which multi-level multi-child relationships occur. The model takes lots
having multiple tables connected with foreign key relationships and of time to learn structural and statistical patterns of actual data. The
their strategic deployment to detect data breaches and misuse of critical model also fails to cover the contextual properties between the columns
information. Since honeytoken is a fake data object, any use in a of the tables. Moreover, we explored the attacker’s strategies that

15
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 13. Column Correlation plots for numerical columns in Salary table (a) represent the relation among the columns of original salary table (b) represent the relation among
the columns of synthetic (fake) salary table (c) represent the difference between correlation plots (a) and (b), where the maximum magnitude of difference is 0.30. It is evident
that the model has moderately captured the pattern in salary table.

could be employed to identify the honeytoken and developed robust • Dynamic Honeytoken Generation: Exploring the generation of
techniques to counter these strategies. Suppose an attacker uses stolen dynamically changing honeytokens that adapt to evolving at-
honeytokens to gain unauthorized access to organizational resources. tacker behaviors could provide an additional defense against
In that case, the activity of adversaries can be detected by enabling sophisticated adversaries.
appropriate actions, such as blocking the attacker’s access and initiating • Honeytoken Placement Strategies: Investigating optimal place-
an investigation. ment strategies for honeytokens within a network or system can
For further research direction, there will be more work on effec- maximize their effectiveness in detecting unauthorized access
tively integrating fake user profiles for all the deployed honeytoken attempts. It includes identifying critical areas or sensitive assets
records. Apart from the numerical, categorical, and text data types, we
that require heightened honeytoken coverage.
have yet to focus on the data stored in tables such as clob, blob, cursor,
XML, etc. In the future, we want to explore those attributes to make the By addressing these areas in future research, honeytokens can
table more believable. Moreover, we need to focus on the enhancement be refined and integrated into organizations’ security frameworks to
in synthetic data generation techniques and other factors that affect the strengthen their defense against data breaches and malicious insider
believability of honeytokens, such as: activities. In addition to honeytoken generation for tabular data stored
• Scale and Diversity: The effectiveness of honeytokens could be in the database, we want to explore other formats (for geospatial
further evaluated by conducting experiments on a larger scale and coordinates, temperature, humidity etc.), predominantly being used in
using a more diverse set of participants. It would help validate the sensors and embedded systems. Moreover, as the large language models
robustness of the algorithm across different user groups. (LLMs) are predominant nowadays, we plan to explore the use of LLMs
• Adversarial Analysis: Conducting comprehensive adversarial for generation of honeytokens.
analysis can help identify potential weaknesses and vulnerabil- We have discussed about the generation of believable honeytoken
ities in honeytoken detection. This analysis could involve sim- and their deployment techniques. Now we will discuss the feasibility of
ulated attacks and exploring attacker strategies to devise more deception in the era of traditional and post-quantum cryptography for
advanced honeytoken generation techniques. active detection of data breaches.

16
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 14. Column Correlation plots for numerical columns in Dept-Emp table (a) represent the relation among the columns of original dept-emp table (b) represent the relation
among the columns of synthetic (fake) dept-emp table (c) represent the difference between correlation plots (a) and (b), where the maximum magnitude of difference is 0.20. It
is evident that the model has effectively captured the pattern in dept-emp table.

Fig. 15. PCA plot (log transformed) for employee table.

17
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 16. PCA plot (log transformed) for Dept-Emp tables.

Fig. 17. PCA plot (log transformed) for Dept-Manager tables.

Fig. 18. PCA plot (log transformed) for Salary tables.

18
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 19. Cumulative Sums plots for numerical columns in the Employee table, where (a) shows the trend of the cumulative sum of the value of emp_no. (b) shows the trend of
the cumulative sum of the value of birth_date. (c) shows the trend of the cumulative sum of the value of hire_date. (d) represent the trend of the cumulative sum of the value of
mobile_number attribute. It is evident that the generated emp_no are believable and follow similar trends.

6.1. Post-quantum cryptography and deception elliptic curve (Koziel et al., 2017), NTT-based polynomial multiplica-
tion accelerators (Bisheh-Niasar et al., 2021b), Edwards curve digital
Cryptographic algorithms are cornerstone in the field of cyber- signature algorithm (EdDSA) (Bisheh-Niasar et al., 2021a) and NEON-
security, offering vital mechanisms to protect data and secure com- SIDH (Koziel et al., 2016) etc., designed to withstand attacks from
munication channels across various domains. Numerous techniques both classical and quantum computers. However, a common challenge
are employed to safeguard data whether it is stored, transmitted, or persists in both traditional and post-quantum cryptography: the man-
actively used, effectively countering external threats. However, man- agement of shared secret keys among users, which leaves these systems
aging and securely distributing cryptographic keys, especially among susceptible to malicious insiders.
privileged users, remains a critical challenge. If a super user or technical As far as protection of data stored on embedded systems is con-
staff inadvertently or deliberately exposes a secret key, the security of cern proposed lightweight cryptography algorithm are also susceptible
the data is immediately compromised. This breach could go undetected, towards the attack using quantum computers. To protect the data
leaving us unaware if our database has been accessed or altered by on embedded systems post-quantum cryptographic algorithms such as
unauthorized entities. The proposed deception technique work as last curve448 and Ed448 on cortex-M4 (Anastasova et al., 2022), and SIKE
line of defense that enforce the current encryption schemes. It cannot on cortex-M4 (Jalali et al., 2018) etc., has been proposed in literature.
be directly compared with algorithms of cryptography as there func- However, it is very challenging to protect data from malicious insider
tional components are totally different. In encryption more attention of threats and social engineering attacks. So we need to deploy security
research has to focus on strengthening the key by increasing the length solutions in such a way that, even if adversary is successful in data
and complexity. However, the secret key will be handed over to the ex-filtration we can detect the data breach and take remedial action as
agent that may be vulnerable towards social engineering attacks. Our early as possible to reduce the adverse impact on organization. In case
intuition behind the proposed technique is not to discard the existing of embedded systems storage of honeytoken may consume the memory.
security controls, but to reinforce them. There are some work present in We can optimize the requirement by deployment of less entries in
literature that incorporates the encryption scheme with deception tech- database. It is essential to develop methods that work in conjunction
nology (Juels and Ristenpart, 2014; Omolara et al., 2019) to provide with post-quantum cryptographic algorithms to actively monitor and
security beyond the brute-force attacks. analyze adversarial behavior. In future, we want to explore the use
With the advent of post-quantum cryptography (Bernstein and of deception technique incorporation with existing post-quantum cryp-
Lange, 2017), it has been proven that current encryption schemes are tographic algorithm which strengthen the capability of cryptographic
vulnerable to attacks by quantum computers. To address this threat, re- technique to detect the insider theft and active detection of data
searchers have proposed new algorithms such as, FPGA based isogenies breach.

19
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 20. Cumulative Sums plots for numerical columns in Salary table, where (a) shows the trend of cumulative sum of the value of emp_no. (b) shows the trend of cumulative
sum of the value of amount. (c) shows the trend of cumulative sum of the value of from_date. (d) represent the trend of cumulative sum of the value of to_date attribute. The
generated fake data follows the similar trends present in real tables.

6.2. Use of deception against attacks on machine learning algorithms administrator. After that, the security administrator can take action
to mitigate the risk of the breach incident. Despite having numerous
Machine Learning (ML) models being a relatively new target com- advantages over other security controls, obtaining a fully automatic
pared to cryptography that poses the problem of adversarial attacks, deceptive environment is challenging. Hence, it is impossible to re-
data poisoning attacks, side-channel attacks etc. To prevent the ML place the existing security controls. The incorporation of deceptive
model from attacks, a few works has been proposed which uses decep- objects, along with other security solutions, may enhance the security
tion techniques, mainly focuses on adversarial attacks (Lopes Antunes of the enterprise. Overall, our study’s results highlight honeytokens’
and Llopis Sanchez, 2023), side-channel attacks (Dubey et al., 2022), effectiveness as a means to detect malicious use of stolen credentials
or modeling attacks (Gu et al., 2020). However, the use of deception in and potential data breaches. The ability to generate genuinely looking
protection of machine learning models are still in growing phase. Es- honeytokens enhances their practicality and strengthens the security
sentially, the models trained for personal health monitoring, industrial measures employed by organizations to safeguard their sensitive data.
control, avionics systems and missile guided systems are very critical,
a small mistake may lead to catastrophic result or maybe loss of life. CRediT authorship contribution statement
In future we want to explore the feasibility of deception based tech-
niques that help to create the ML models invariant towards different
Nilin Prabhaker: Writing – review & editing, Writing – original
attacks such as data poisoning (Mozaffari-Kermani et al., 2015), model
draft, Visualization, Validation, Methodology, Investigation, Formal
poisoning (Liu et al., 2017), backdoor attack (Gu et al., 2019) etc. The
analysis, Data curation, Conceptualization. Ghanshyam S. Bopche:
generated synthetic data can also be used as test set to analyze the
Validation, Supervision. Michael Arock: Supervision.
resilience of trained model before deployment.

7. Conclusion Declaration of competing interest

In this work, we explore the applicability of a hierarchical modeling The authors declare that they have no known competing finan-
algorithm (HMA Synthesizer) to generate the honeytoken in a multi- cial interests or personal relationships that could have appeared to
table environment and provide deployment strategies to enhance the influence the work reported in this paper.
existing security controls. The deployed realistic honeytoken can be
an effective way for early detection of database breaches and moni- Data availability
toring the misuse of stolen data. Using the deployed honeytoken in
real time indicates a possibility of breach, triggering an alert for the Data will be made available on request.

20
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 21. Cumulative Sums plots for numerical columns in the Dept-Emp table, where (a) shows the trend of the cumulative sum of the value of emp_no. (b) shows the trend of
the cumulative sum of the value of dept_no. (c) shows the trend of the cumulative sum of the value of from_date. (d) represent the trend of the cumulative sum of the value of
the to_date attribute. It is evident that similar to other tables, the dept-emp table generated by the model followed the trends available in real data.

21
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 22. Shows the Probability Distribution plots for numerical columns in Employee table, where (a) shows the difference of distribution of values of emp_no present in real
and fake data (b) represent the plot for birth_date attribute (c) represent the plot for hire_date and (d) represent the distribution plot for mobile_number attributes. It seems that
the model has retained the statistical distribution features of original database.

22
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 23. Shows the Probability Distribution plots for numerical columns in Salary table, where (a) shows the difference of distribution of values of emp_no present in real and
fake data (b) represent the plot for from_date attribute (c) represent the plot for to_date and (d) represent the plot for credited amount attributes. It is evident the model has faced
challenges to generate the similar distribution for few attributes and replicated the maximum value frequently.

23
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Fig. 24. Shows the Probability Distribution plots for columns in dept_emp table, where (a) shows the difference of distribution of values of emp_no present in real and fake data
(b) represent the plot for dept_no attributes, we can see the new entries for newly created department (c) represent the plot for from_date attribute and (d) represent the log
transformed plot for to_date.

References Botha, R.A., Eloff, J.H.P., 2001. Separation of duties for access control enforcement in
workflow environments. IBM Syst. J. 40 (3), 666–682.
Abay, N.C., Akcora, C.G., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., 2019. Using Brenninkmeijer, B., 2024. Table evaluator. https://baukebrenninkmeijer.github.io/
deep learning to generate relational honeydata. Auton. Cyber Decep. Reason. Adapt. table-evaluator/. (Accessed 7 May 2024).
Plan. Eval. HoneyThings 3–19. Butavicius, M., Taib, R., Han, S.J., 2022. Why people keep falling for phishing scams:
Al-Shaer, E.S., Hamed, H.H., 2003. Firewall policy advisor for anomaly discovery and The effects of time pressure and deception cues on the detection of phishing emails.
rule editing. Integr. Netw. Manag. VIII: Managing it all 17–30. Comput. Secur. 123, 102937.
Albanese, M., Battista, E., Jajodia, S., 2015. A deception based approach for defeating Cenys, A., Rainys, D., Radvilavicius, L., Bielko, A., Semiconductor Physics Inst Vil-
OS and service fingerprinting. In: 2015 IEEE Conference on Communications and nius (Lithuania), 2004. Development of honeypot system emulating functions of
Network Security. CNS, IEEE, pp. 317–325. database server. In: RTO IST Symposium.
Almeshekah, M.H., Gutierrez, C.N., Atallah, M.J., Spafford, E.H., 2015. ErsatzPasswords: Cenys, A., Rainys, D., Radvilavius, L., Gotanin, N., 2005. Implementation of honeytoken
Ending password cracking and detecting password leakage. In: Proceedings of the module in dbms oracle 9ir2 enterprise edition for internal malicious activity
31st Annual Computer Security Applications Conference. ACSAC ’15, Association detection. IEEE Comput. Soc. TC Secur. Priv. 1–13.
for Computing Machinery, New York, NY, USA, pp. 311–320. Chadwick, D.W., Otenko, A., 2002. The PERMIS x.509 role based privilege management
Anastasova, M., Azarderakhsh, R., Kermani, M.M., Beshaj, L., 2022. Time-efficient infrastructure. In: Proceedings of the Seventh ACM Symposium on Access Control
finite field microarchitecture design for curve448 and ed448 on cortex-M4. In: Models and Technologies. pp. 135–140.
International Conference on Information Security and Cryptology. Springer, pp.
Che, Z., Cheng, Y., Zhai, S., Sun, Z., 2017. Boosting deep learning risk prediction
292–314.
with generative adversarial networks for electronic health records. In: 2017 IEEE
Atkinson, M., DeWitt, D., Maier, D., Bancilhon, F., Dittrich, K., Zdonik, S., 1990.
International Conference on Data Mining. ICDM, IEEE, pp. 787–792.
The object-oriented database system manifesto. In: Deductive and Object-Oriented
Chen, H., Jajodia, S., Liu, J., Park, N., Sokolov, V., Subrahmanian, V., 2019. FakeTables:
Databases. Elsevier, pp. 223–240.
Using GANs to generate functional dependency preserving tables with bounded real
Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: A review and new
data.. In: IJCAI. pp. 2074–2080.
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), 1798–1828.
Bercovitch, M., Renford, M., Hasson, L., Shabtai, A., Rokach, L., Elovici, Y., 2011. Chenchev, I., Aleksieva-Petrova, A., Petrov, M., 2021. Authentication mechanisms and
HoneyGen: An automated honeytokens generator. In: Proceedings of 2011 IEEE classification: A literature survey. In: Intelligent Computing: Proceedings of the
International Conference on Intelligence and Security Informatics. IEEE, pp. 2021 Computing Conference, vol. 3. Springer, pp. 1051–1070.
131–136. Chiang, C.J., Venkatesan, S., Sugrim, S., Youzwak, J.A., Chadha, R., Colbert, E.I.,
Bernstein, D.J., Lange, T., 2017. Post-quantum cryptography. Nature 549 (7671), Cam, H., Albanese, M., 2018. On defensive cyber deception: A case study using
188–194. SDN. In: MILCOM 2018-2018 IEEE Military Communications Conference. MILCOM,
Bisheh-Niasar, M., Azarderakhsh, R., Mozaffari-Kermani, M., 2021a. Cryptographic IEEE, pp. 110–115.
accelerators for digital signature based on Ed25519. IEEE Trans. Very Large Scale Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J., 2017. Generating multi-
Integr. (VLSI) Syst. 29 (7), 1297–1305. label discrete patient records using generative adversarial networks. In: Machine
Bisheh-Niasar, M., Azarderakhsh, R., Mozaffari-Kermani, M., 2021b. High-speed NTT- Learning for Healthcare Conference. PMLR, pp. 286–305.
based polynomial multiplication accelerator for post-quantum cryptography. In: Codd, E.F., 1970. A relational model of data for large shared data banks. Commun.
2021 IEEE 28th Symposium on Computer Arithmetic. ARITH, pp. 94–101. ACM 13 (6), 377–387.

24
N. Prabhaker et al. Computers & Security 146 (2024) 104032

Cohen, F., 1998. A note on the role of deception in information protection. Comput. Onaolapo, J., Mariconti, E., Stringhini, G., 2016. What happens after you are pwnd:
Secur. 17 (6), 483–506. Understanding the use of leaked webmail credentials in the wild. In: Proceedings
Diffie, W., Hellman, M., 1976. New directions in cryptography. IEEE Trans. Inform. of the 2016 Internet Measurement Conference. pp. 65–79.
Theory 22 (6), 644–654. Padayachee, K., 2015. Aspectising honeytokens to contain the insider threat. IET Inf.
Dionysiou, A., Vassiliades, V., Athanasopoulos, E., 2021. Honeygen: Generating honey- Secur. 9 (4), 240–247.
words using representation learning. In: ACM Asia Conference on Computer and Park, N., Mohammadi, M., Gorde, K., Jajodia, S., 2018. Data synthesis based on
Communications Security. pp. 265–279. generative adversarial networks. arXiv preprint arXiv:1806.03384.
Dubey, A., Cammarota, R., Suresh, V., Aysu, A., 2022. Guarding machine learning Patki, N., Wedge, R., Veeramachaneni, K., 2016. The synthetic data vault. In: IEEE
hardware against physical side-channel attacks. ACM J. Emerg. Technol. Comput. International Conference on Data Science and Advanced Analytics. DSAA, pp.
Syst. (JETC) 18 (3), 1–31. 399–410.
Eclipse Foundation, 2023. Aspectj programming. https://eclipse.dev/aspectj/. Online, Qin, X., Jiang, F., Dong, C., Doss, R., 2024. A hybrid cyber defense framework for
(Accessed 25 September 2023). reconnaissance attack in industrial control systems. Comput. Secur. 136, 103506.
Ferguson-Walter, K.J., Major, M.M., Johnson, C.K., Johnson, C.J., Scott, D.D., Rivest, R.L., Shamir, A., Adleman, L., 1978. A method for obtaining digital signatures
Gutzwiller, R.S., Shade, T., 2023. Cyber expert feedback: Experiences, expectations, and public-key cryptosystems. Commun. ACM 21 (2), 120–126.
and opinions about cyber deception. Comput. Secur. 130, 103268. Saxena, N., Hayes, E., Bertino, E., Ojo, P., Choo, K.-K.R., Burnap, P., 2020. Impact and
Galanxhi, H., Nah, F.F.-H., 2007. Deception in cyberspace: A comparison of text-only key challenges of insider threats on organizations and critical businesses. Electronics
vs. Avatar-supported medium. Int. J. Hum.-Comput. Stud. 65 (9), 770–783. 9 (9), 1460.
Giuseppe, M., Patrick, C., 2023. Employee sample database for MySQL. https://github. Saxton, L.V., Raghavan, V.V., 1990. Design of an integrated information re-
com/bytebase/employee-sample-database/tree/main/mysql. Online, (Accessed 7 trieval/database management system. IEEE Trans. Knowl. Data Eng. 2 (02),
May 2024). 210–219.
Gu, C., Chang, C.-H., Liu, W., Yu, S., Wang, Y., O’Neill, M., 2020. A modeling attack Scarfone, K., Mell, P., et al., 2007. Guide to intrusion detection and prevention systems
resistant deception technique for securing lightweight-PUF-based authentication. (idps). NIST Special Publ. 800 (2007), 94.
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 40 (6), 1183–1196. Shabtai, A., Bercovitch, M., Rokach, L., Gal, Y., Elovici, Y., Shmueli, E., 2016.
Gu, T., Liu, K., Dolan-Gavitt, B., Garg, S., 2019. Badnets: Evaluating backdooring attacks Behavioral study of users when interacting with active honeytokens. ACM Trans.
on deep neural networks. IEEE Access 7, 47230–47244. Inf. Syst. Secur. 18 (3), 1–21.
Haerder, T., Reuter, A., 1983. Principles of transaction-oriented database recovery. ACM Spafford, E.H., 2011. More than passive defense. https://www.cerias.purdue.edu/site/
Comput. Surv. (CSUR) 15 (4), 287–317. blog/post/. Online, (Accessed 7 May 2024).
Han, X., Kheir, N., Balzarotti, D., 2018. Deception techniques in computer security: A Spitzner, L., 2003a. The honeynet project: Trapping the hackers. IEEE Secur. Priv. 1
research perspective. ACM Comput. Surv. 51 (4), 1–36. (2), 15–23.
Henry, W.R., 1969. Hierarchical structure for data management. IBM Syst. J. 8 (1), Spitzner, L., 2003b. Honeypots: Tracking Hackers, vol. 1. Addison-Wesley Reading.
2–15. Spitzner, L., 2003c. Honeytokens: The other honeypot. In: Security Focus Information.
IBM, 1960. Information management systems. https://www.ibm.com/history/ Stoll, C., 1989. The Cuckoo’s Egg: Tracking a Spy through the Maze of Computer
information-management-system. Espionage. Simon and Suster.
IBM, 2023. Cost of a data breach report 2023. https://www.ibm.com/reports/data- Tankard, C., 2011. Advanced persistent threats and how to monitor and deter them.
breach. Netw. Secur. 2011 (8), 16–19.
Izagirre, M., 2017. Deception strategies for web application security: application-layer
The Washington Post, 2023. Ukraine lures Russian missiles with decoy of
approaches and a testing platform.
U.S. rocket system. https://www.washingtonpost.com/world/2022/08/30/ukraine-
Jalali, A., Azarderakhsh, R., Kermani, M.M., 2018. NEON SIKE: Supersingular isogeny
russia-himars-decoy-artillery/. Online, (Accessed 1 August 2023).
key encapsulation on ARMv7. In: Security, Privacy, and Applied Cryptography
Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A., 2020. Imaginator: Conditional
Engineering: 8th International Conference, SPACE 2018, Kanpur, India, December
spatio-temporal gan for video generation. In: Proceedings of the IEEE/CVF Winter
15-19, 2018, Proceedings 8. Springer, pp. 37–51.
Conference on Applications of Computer Vision. pp. 1160–1169.
Javadpour, A., Ja’fari, F., Taleb, T., Shojafar, M., Benzaïd, C., 2024. A comprehensive
Wang, D., Cheng, H., Wang, P., Yan, J., Huang, X., 2018. A security analysis of
survey on cyber deception techniques to improve honeypot performance. Comput.
honeywords. In: Network and Distributed System Security Symposium. URL https:
Secur. 103792.
//api.semanticscholar.org/CorpusID:51804454.
Jordon, J., Yoon, J., 2018. PATE-GAN: Generating synthetic data with differential
White, J., 2009. Creating personally identifiable honeytokens. In: Innovations and
privacy guarantees. In: International Conference on Learning Representations.
Advances in Computer Sciences and Engineering. Springer, pp. 227–232.
Juels, A., Ristenpart, T., 2014. Honey encryption: Security beyond the brute-force
Wold, S., Esbensen, K., Geladi, P., 1987. Principal component analysis. Chemometr.
bound. In: Advances in Cryptology–EUROCRYPT 2014: 33rd Annual Interna-
Intell. Lab. Syst. 2 (1), 37–52, Proceedings of the Multivariate Statistical Workshop
tional Conference on the Theory and Applications of Cryptographic Techniques,
for Geologists and Geochemists.
Copenhagen, Denmark, May 11-15, 2014. Proceedings 33. Springer, pp. 293–310.
Yahi, A., Vanguri, R., Elhadad, N., 2017. Generative adversarial networks for electronic
Juels, A., Rivest, R.L., 2013. Honeywords: Making password-cracking detectable. In:
health records: A framework for exploring and evaluating methods for predicting
Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications
drug-induced laboratory test trajectories. arXiv preprint arXiv:1712.00164.
Security. pp. 145–160.
Yuill, J., Zappe, M., Denning, D., Feer, F., 2004. Honeyfiles: Deceptive files for intrusion
Koblitz, N., 1987. Elliptic curve cryptosystems. Math. Comput. 48 (177), 203–209.
detection. In: Proceedings from the 5𝑇 ℎ Annual IEEE SMC Information Assurance
Koziel, B., Azarderakhsh, R., Mozaffari Kermani, M., Jao, D., 2017. Post-quantum
Workshop. IEEE, West Point, NY, USA, pp. 116–122.
cryptography on FPGA based on isogenies on elliptic curves. IEEE Trans. Circuits
Syst. I. Regul. Pap. 64 (1), 86–99.
Koziel, B., Jalali, A., Azarderakhsh, R., Jao, D., Mozaffari-Kermani, M., 2016. Nilin Prabhaker received his B.Sc. (Computer Science) degree from Nilambar Pitambar
NEON-SIDH: Efficient implementation of supersingular isogeny diffie-hellman key University, Jharkhand, India. He earned his MCA from Indira Gandhi National Open
exchange protocol on ARM. In: Cryptology and Network Security: 15th Interna- University (IGNOU), New Delhi, India. He has received the Junior Research Fellow
tional Conference, CANS 2016, Milan, Italy, November 14-16, 2016, Proceedings (UGC-JRF) award on 31st December 2019. He is currently pursuing his Ph.D. in
15. Springer, pp. 88–103. Department of Computer Applications, National Institute of Technology, Tiruchirappalli,
Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W., Zhang, X., 2017. Trojaning India. His research interests lie in the area of Cyber Security, Cyber Deception, Machine
attack on neural networks. Learning, Deep Learning and Generative AI.
Lopes Antunes, D., Llopis Sanchez, S., 2023. The age of fighting machines: the use Dr. Ghanshyam S. Bopche: Assistant Professor, Department of Computer Applications,
of cyber deception for adversarial artificial intelligence in cyber defence. In: National Institute of Technology, Tiruchirappalli. He received his MCA and B.Sc.
Proceedings of the 18th International Conference on Availability, Reliability and (Electronics) degree from Nagpur University, Maharashtra, India. He earned his Ph.D.
Security. pp. 1–6. from Institute for Development and Research in Banking Technology (IDRBT) and
Maasberg, M., Van Slyke, C., Ellis, S., Beebe, N., 2020. The dark triad and insider University of Hyderabad, India. He was exchange research scholar at the State
threats in cyber security. Commun. ACM 63 (12), 64–80. University of New York (SUNY) at Buffalo, NY, USA during 2015. His research interests
Miller, V.S., 1985. Use of elliptic curves in cryptography. In: Conference on the Theory lie in the area of Computer and Network Security, Cyber Deception, Information
and Application of Cryptographic Techniques. Springer, pp. 417–426. Assurance, and Digital Forensics.
Montanez, A., et al., 2018. SDV: An Open Source Library for Synthetic Data Generation
(Ph.D. thesis). Massachusetts Institute of Technology. Dr.Michael Arock: Professor and Head, Department of Computer Applications, National
Mozaffari-Kermani, M., Sur-Kolay, S., Raghunathan, A., Jha, N.K., 2015. Systematic Institute of Technology, Tiruchirappalli. He graduated as a B.Sc. (Mathematics) from
poisoning attacks on and defenses for machine learning in healthcare. IEEE J. GTN Arts College, Dindigul, Madurai Kamaraj University and as an MCA from
Biomed. Health Inf. 19 (6), 1893–1905. St.Joseph’s College, Bharathidasan University and earned his Ph.D. from NITT under
Natan, R.B., 2005. Implementing database security and auditing. Elsevier. Bharathidasan University. His Doctoral thesis is on Design and Analysis of Parallel
Omolara, A.E., Jantan, A., Abiodun, O.I., Dada, K.V., Arshad, H., Emmanuel, E., 2019. Algorithms on CREW PRAM and LARPBS models. His specialization is Parallel Algorith-
A deception model robust to eavesdropping over communication for social network mics. His areas of Interest include Data Structures and Algorithms, High Performance
systems. IEEE Access 7, 100881–100898. Computing and Bioinformatics.

25

You might also like