0% found this document useful (0 votes)

32 views

COCOA: A Synthetic Data Generator For Testing Anonymization Techniques

The document presents COCOA, a framework for generating synthetic microdata to test anonymization techniques. COCOA aims to address limitations in testing, such as limited real datasets and lack of diversity. It generates realistic synthetic datasets while preserving functional dependencies between attributes. COCOA leverages attribute generators and predefined domains to automatically create datasets of varying characteristics. The authors evaluate COCOA and make available 72 synthetic datasets across different domains to help researchers perform more robust testing of anonymization techniques.

Uploaded by

Driff Sedik

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

COCOA: A Synthetic Data Generator For Testing Anonymization Techniques

Uploaded by

Driff Sedik

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

COCOA: A Synthetic Data Generator for

Testing Anonymization Techniques

Vanessa Ayala-Rivera, A. Omar Portillo-Dominguez,

Liam Murphy, and Christina Thorpe
Lero@UCD, School of Computer Science, University College Dublin, Dublin, Ireland
[email protected],
{andres.portillodominguez,liam.murphy,christina.thorpe}@ucd.ie

Abstract. Conducting extensive testing of anonymization techniques

is critical to assess their robustness and identify the scenarios where
they are most suitable. However, the access to real microdata is highly
restricted and the one that is publicly-available is usually anonymized
or aggregated; hence, reducing its value for testing purposes. In this
paper, we present a framework (COCOA) for the generation of realistic
synthetic microdata that allows to define multi-attribute relationships in
order to preserve the functional dependencies of the data. We prove how
COCOA is useful to strengthen the testing of anonymization techniques
by broadening the number and diversity of the test scenarios. Results
also show how COCOA is practical to generate large datasets.

1 Introduction
The increasing availability of microdata has attracted the interest of or-
ganizations to collect and share this data for mining purposes. However,
current legislation requires personal data to be protected from inappro-
priate use of disclosure. Anonymization techniques help to disseminate
data in a safe manner while preserving enough utility for its reuse. For
this reason, a plethora of methods for anonymizing microdata exists in
the literature. Each of these techniques claims a particular superiority
over the others (e.g., improving data utility or computational resources).
However, their performance can vary when they are tested with different
datasets [13]. This fact makes difficult for data controllers to generalize
the conclusions of the performance evaluation to decide which algorithm
is best suited for their requirements.
All performance evaluations are limited in one way or the other (due
to time/effort/cost constraints). A common limitation is the number and
the diversity of the datasets used, as the process to obtain good quality
datasets can be burden and time consuming. For example, the access to
real microdata is highly restricted to protect the privacy of individuals.
Agencies grant access only for certain period and once this expires, the
provided files must be destroyed [2, 4, 10] which does not always allow for
reproducibility of experimental results.

The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-45381-1_13

Consequently, researchers often use real datasets that are publicly-
available or synthetic data generated in an ad-hoc manner [20]. Both
solutions are to some extend biased as they are constrained by an spe-
cific data distribution: either a real one which consists of a single sce-
nario (e.g., demographics of a single country) or an ideal synthetic data
which may never be found in the real world. Moreover, the applicability of
most real datasets is often limited for testing anonymization techniques as
datasets available in research are usually aggregated and pre-anonymized
to protect the privacy of people (which is exactly the use case of privacy-
preserving methods). Thus this data lacks of enough diversity to simulate
attacking scenarios or to evaluate the robustness of the proposed tech-
niques (e.g., data sparsity and outliers). For these reasons, synthetic data
is a valuable resource for conducting testing in multiple areas, as it allows
to manipulate the data to meet specific characteristics that are not found
in the real data but still need to be considered for testing (hypothetical
future scenarios). To be adequate substitutes for real data, the functional
dependencies of the data need to be preserved.
To tackle these issues, our research work has centered on developing
techniques to create realistic synthetic datasets applicable to the pri-
vacy domain. The aim has been to help researchers and practitioners
(hereinafter referred as users) to improve the testing of anonymization
techniques by decreasing the effort and expertise needed to create useful
datasets in order to be able to perform more robust testing. As a first
step in that direction, in our previous work [12] we described the process
followed to mimic the distribution of the aggregated statistics from the
2011 Irish Census. The work proposed in this paper builds on top of that
work by presenting a framework (COCOA) that facilitates the creation of
realistic datasets in the privacy domain. Internally, COCOA leverages on
a set of supported domains to automatically generate datasets of different
characteristics. The contributions of this paper are the following:
1. A framework (COCOA) to generate realistic synthetic datasets (at
record-level) that allows to define multi-attribute relationships in or-
der to preserve the functional dependencies of the data.
2. A comprehensive practical evaluation of COCOA, consisting of a pro-
totype and a set of experiments to assess it in terms of the benefits it
brings to the testing of anonymization techniques as well as its costs.
3. Three sets of datasets (publicly available [1]), each one composed of 72
different datasets (based on real data and offering diverse data distri-
butions and sizes) which can be useful for the research community to
perform more comprehensive validations in the data privacy domain.
2 Related Work
Numerous approaches have been proposed to generate synthetic data.
Some of those works are general-purpose synthetic data generators [17,24].
That is, they do not target a specific application area. However, this gen-
erality may reduce the accuracy in the results when specific features are
required. Depending on the domain, it is important to preserve certain
characteristics in the generated datasets to make it realistic and thus,
suitable for their intended use. In the data privacy community, synthetic
data generation is used as a strategy for disseminating data while ensur-
ing confidentiality. The work in this research area focuses on generating
synthetic populations that reproduce certain statistical properties of the
original data. This kind of approaches was firstly introduced in [26]. Af-
terward, many techniques and tools have been proposed for generating
synthetic datasets and populations [7, 8, 23]. Similarly, other research ef-
forts have focused on evaluating the privacy of synthetic methods and
avoid the risk of disclosure [18, 22]. In our work, we focus on generat-
ing synthetic microdata, not as a technique of data protection, but as a
mechanism to improve the testing of anonymization techniques. Further-
more, our solution focuses on categorical data due to its relevance (e.g.,
a valuable asset of data mining) and because methods for numerical data
have been well studied in the literature.

3 COCOA: A Synthetic Data Generator

Here we provide the context of our solution, discuss its internal workings,
and describe the generated datasets.

3.1 Overview
The goal of this research work has been to develop a framework for syn-
thetically generating datasets (COCOA) that can be used to diversify
the set of characteristics available in microdata. This strategy would help
researchers to improve the testing of anonymization techniques. In Fig. 1,
we depict the conceptual view of our solution. It can be seen how CO-
COA follows an iterative process (see Section 3.2) to build a set of datasets
based on the information base provided by the user. The information base
is composed of all the input parameters required by the chosen dataset
domain (e.g., dataset size).
The key element of COCOA is its domain base, which encapsulates
the expert knowledge of the supported business domains (e.g., census,
healthcare, finance). This element allows COCOA to be easily extensi-
ble and capable of incorporating multiple business cases (even for the
Fig. 1. COCOA - Conceptual View.

same domain e.g., Irish census, USA census), which might be suitable to
different test scenarios. In this context, a domain defines the rules and
constraints required to generate a dataset that preserves the functional
dependencies of a business case. Each domain is characterized by a name,
a set of attributes and their corresponding attribute generators.
To generate data, the domains make use of the available set of at-
tribute generators. These elements are supporting logic which offer mis-
cellaneous strategies to generate the values of attributes. For example, a
generator might focus on diversifying the data in an attribute by fitting
it into a data distribution (e.g., normal distribution). In this example,
several generators can be combined to offer different data distributions
(as the appropriate distribution might vary depending on the usage sce-
nario). In case an attribute generator requires any particular settings to
work properly (e.g., the default values for its applicable parameters), this
information (e.g., mean and variance for a normal distribution) can also
be captured by the framework (as an attribute generator setting).

3.2 Core Process

From a workflow perspective, COCOA has a core process (depicted in
Fig. 2a). It requires three user inputs related to desired characteristics of
the resulting dataset: the domain (selected among the ones available in
the framework), the size, and the dataset name. An additional optional
user input is the set of specific parameters that a domain might need to
initialize a dataset. As an initial step, the process sets all the configured
input parameters and generates an empty dataset for the chosen domain.
Next the loop specified in the monitor, generate, consolidate and validate
phases is performed: First, the size of the new dataset (number of tuples)
is monitored. This check is done in order to determine when the dataset
has reached the target size and finish the process. Next, the applicable
generators to create the attributes’ data are retrieved from the knowledge
database (as they will depend on the domain). Then, their relationships
are identified in order to sort the generators and execute them in the cor-
rect order, so that the proper chain of functional dependencies is executed
(see Section 3.4). Once there are new values for all the attributes, a new
Fig. 2. COCOA Architecture

entity object is created to represent the new tuple. Next, its attributes
are populated with the new values and the tuple entity is added to the
dataset. Moreover, any exceptions are internally handled and reported.
This core process continues iteratively until the new dataset has been
fully generated. As a final step, the dataset is saved to disk.

3.3 Architecture
COCOA is complemented by the architecture presented in the component
diagram [9] of Fig. 2b. COCOA is composed of three main components:
The generic component contains the control logic and all supporting func-
tionality which is independent of the supported domains and attribute
generators (e.g., the monitor and validate phases of the core process).
Regarding the logic that interfaces with the domains, it needs to be cus-
tomized per domain (by defining a tuple entity and a dataset generator).
Similar case with the supported attribute generators. Therefore, these two
logics are encapsulated in their respective components to minimize the re-
quired code changes. To complement this design strategy, the components
are only accessed through interfaces. This is exemplified in Fig. 2c, which
presents the high-level structure of the attribute generator component. It
contains a main interface IGenerator to expose all required actions and
an abstract class for all the common functionality. This hierarchy can
then be extended to support specific attribute generators.

3.4 Attribute Generators

As discussed in Section 3.1, attribute generators are supporting logic
which offer miscellaneous strategies to generate the values of attributes.
Among the alternative choices to develop attribute generators for CO-
COA, we have initially concentrated on implementing the following three:
Distribution-based generator. This type of generator produces inde-
pendent data, so it is applicable for attributes whose values do not de-
pend on others. In this case, the strategy used to diversify the data is to
mimic a probability distribution. This generator can also be a good fit
for data that comes from a previously consolidated real data source (like
most of the datasets currently available). This is because, the objective in
those cases is commonly to diversify the frequencies of the existing values
without modifying the actual values or the cardinality of the attribute.
A distribution-based generator requires two parameters: a distribution
(which will be used to re-distribute the frequencies of the values) and a
sorting strategy (which will be used to order the candidate values be-
fore applying the distribution). For the initial version of COCOA, 11
commonly-used data distributions are supported: normal, beta, chi, chi
square, exponential (exp), gamma, geometric, logarithmic (log), poisson,
t-student (Tstu), and uniform (uni) [29]. An additional supported distri-
bution is the “original” one. When it is used, the generator only mirrors
the given input distribution. This is useful for generating new datasets
(of different sizes) for existing domains (e.g., adult and german credit)
without modifying the original distribution present in the real data. Re-
garding the sorting strategies, COCOA currently supports three: alpha-
numeric, reverse alpha-numeric, and no-sort. The usage of alpha-numeric
and reverse alpha-numeric allows a user not only to logically sort the cat-
egorical values, but also helps to further diversify the tested behaviors by
mixing the supported sorts and distributions. For instance, the usage of
an alpha-numeric sorting and an exponential distribution can generate a
J-Shaped distribution, while a reverse J-Shaped distribution can be eas-
ily generated by switching to the reverse alpha-numeric sorting. Finally,
the no-sort strategy respects the original order of the data (following the
same line of thought as the original distribution previously discussed).
Attribute-based generator. This type of generator produces dependent
data. It is applicable to cases when there are functional dependencies that
need to be preserved between attributes (so that the generated data can
be realistic) as well as cases when an attribute needs to be derived from
others. For instance, consider a dataset with information about individu-
als owned by an insurance company. The aim is to derive a class attribute
that identifies groups with a higher risk of having an accident. One way to
derive this class (e.g., low, medium, high) would be to use the values from
attributes such as age, occupation, and hobbies. The parameters required
by this type of generator might vary, but normally they will take most
of their required input information from other attributes. As discussed
in Section 3.2, COCOA executes the generators in such an order that an
attribute-based generator has available its required information (i.e., the
attributes it depends on) before its execution.
Distribution and attribute-based generator. This type of generator also
produces dependent data. This is because it is a hybrid of the previous
two generators. It is useful to capture more complex relationships where
the generation of a value is influenced not only by a frequency distribu-
tion, but also by the value of one or more attributes. For instance, the
place where an activity is practiced does not only depend on the per-
formed activity but also on a certain distribution. For example, based on
historical information, soccer is mostly practiced on outside fields, and in
less degree, in other places such as an indoor facility or beach. Another
example is the salary of a person, which is influenced by multiple fac-
tors such as her occupation and years of work experience. This kind of
relationships can be easily captured in COCOA by this type of generator.
3.5 Supported Domains
The following sections describe the domains currently supported by CO-
COA. The datasets generated for all the domains are publicly available [1].
Irish census. This domain, initially presented on [12], is composed
of 9 attributes belonging to the Irish Census 2011. Appendix A lists the
attributes and their generators. It is worth noticing how different types
of generators were used to capture the relationships among the data.
Insurance domain. To further exemplify the capabilities of COCOA
and its different attribute generators, we have designed the insurance
domain. It represents information of interest to an insurance company
carrying out a risk assessment on potential clients. This domain is based
on real information obtained from the Payscale USA website [6]. The list
of attributes, and the type of generator used for generating their data,
are shown in appendix A. It is worth highlighting the extensive usage of
different attribute generators to retain the realistic characteristics of the
generated data. For example, the gender assigned to a tuple depends on
a person’s occupation (e.g., nursing is overwhelmingly female).
Adult domain. This domain is based on the Adult census dataset
from the UCI Machine Learning Repository [21], which has become one of
the most widely-used benchmarks in the privacy domain. This domain is
composed of 9 socio-economic demographic attributes (e.g., occupation,
education, and salary class). To mimic the values of this dataset, this
domain exclusively leverages on distribution-based data generators.
German credit domain. This domain is based on the German credit
dataset from the UCI Machine Learning Repository [21]. This domain is
composed of 20 credit-related attributes (e.g., employment, purpose of
credit, credit class). To mimic the values of this dataset, this domain
exclusively leverages on distribution-based data generators.

4 Experimental Evaluation
Here we present the experiments performed to assess the benefits and
costs of using COCOA. Firstly, we evaluated how well COCOA achieved
its objective of generating diverse datasets within a domain. Secondly, we
assessed how useful the resulting datasets are to strengthen the testing
of an anonymization algorithm. Thirdly, we evaluated COCOA’s costs.

4.1 Experimental Setup

Here we present the developed prototype, the test environment and the
parameters that defined the evaluated experimental configurations. We
also describe the evaluation criteria used in our experiments.
Prototype: Our current prototype supports the domains discussed
in Section 3.5, and the three types of attribute generators discussed in
Section 3.4. To simplify the configuration of the distribution-based at-
tributes, as well as to exemplify the benefits of using attribute generator
settings, representative default values were configured for the parameters
applicable to each distribution (e.g., mean and standard deviation for the
normal distribution). From a technical perspective, we built our proto-
type on top of the Benerator tool [3]. This solution was chosen because it
is open source and developed in Java, characteristics which facilitated its
integration with other used libraries (e.g., the OpenForecast library [5],
which is used for all statistical calculations). Developing the prototype in
Java also makes our solution highly portable, as there are Java Virtual
Machines (JVM) available for most contemporary operating systems.
Environment: All experiments were performed in an isolated test
environment using a machine with an Intel Core i7-4702HQ CPU at
2.20Ghz, 8 GB of RAM and 450 GB of HD; running 64-bit Windows
8.1 Professional Edition, and Hotspot JVM 1.7.0 67 with a 1 GB heap.
Configurations: As evaluation data, we used the insurance, adult
and german credit domains. This was done with the aim of diversifying
the evaluated domains and test the generality of the benefits and costs of
using COCOA. For the adult and german datasets, the richest attribute
(in terms of diversity of categorical values) was used as quasi-identifier
(QID) for the anonymization and for the data generation: occupation for
adult, and purpose for german credit. In contrast, as the insurance dataset
offers a broader set of interesting categorical attributes, four of them were
selected: occupation (occ), place of work, activity (act) and place of activ-
ity. For each domain, 72 different datasets were created. This was achieved
by varying the frequencies of the QIDs (by using the 12 distributions dis-
cussed in Section 3.4) and generating 6 different dataset sizes (5K, 10K,
20K, 30K, 50K and 100K). As anonymization settings, we selected a pop-
ular greedy multidimensional algorithm called Mondrian [20] (from the
UTD Anonymization Toolbox [11]). It partitions the domain space recur-
sively into regions that contain at least k records (k-anonymity [27, 28]).
We tested different levels of privacy, varying the k-values ∈ [2..100].
Evaluation Criteria: To evaluate the diversity among the gener-
ated datasets, we used the Principal Components Analysis (PCA), which
is a technique commonly used in the literature [14–16] to assess the
(dis)similarity among benchmarks (such as our datasets). PCA is a mul-
tivariate statistical procedure that decreases a X dimensional space into
a lower dimensional uncorrelated space (hence simplifying the analy-
sis). Moreover, it generates a positive or negative weight associated with
each metric. These weights transform the original higher dimension space
into Y principal components. In our case, the chosen constituent metrics
were the average and standard deviations of the frequencies of all the
attributes in a domain. This strategy is similar to the one used in [14].
To measure the utility remaining in the data after anonymization, we
used Generalized information loss (GenILoss) [19], a widely-used general-
purpose metric that captures the penalty incurred when generalizing a
specific attribute. In terms of costs, our main metrics were execution
time, CPU (%) and memory (MB) utilizations. Garbage collection (GC)
was also monitored as it is an important performance concern in Java [25].

4.2 Experimental results

Here we discuss our experimental results in terms of the diversity of the
generated datasets, the benefits of using diverse datasets in the testing of
anonymization algorithms, and the costs of using COCOA. Due to space
constraint, we only present the most relevant results.
Dataset diversity. To assess the diversity of the datasets generated
by COCOA, we calculated the PCA for all the datasets (grouped by
domain and size). PCA computes the principal components (i.e., PC1,
PC2, etc.) and identifies them in order of significance (i.e., PC1 is the
most determinative component). Typically, most of the total variance is
explained by the first few principal components. In our case, PC1 and
PC2 accounted for at least 80% of the total variance in all the cases, so
our analysis centered on them. Our hypothesis was that the usage of the
different types of attribute generators (see Section 3.4) should lead to
variances in the data regardless the domain and size. This was confirmed
by the results of this analysis. Although there were some (expected) dif-
ferences in the degree of variance that COCOA achieved across the tested
domains, in all cases a fair diversity of characteristics was achieved. Sim-
ilar behavior was exhibited with all the sizes of the generated datasets.
The main difference when comparing them was that the “scales” of the
differences varied (precisely due to the size differences). This behavior is
visually illustrated in Figs. 3a and 3b, which shows how the insurance
datasets differ in a two-dimensional space (PC1 vs. PC2) for the 5K and
100K sizes. Intuitively, the farther the distance that separates the datasets
within a domain size, the more different they are with respect to the met-
rics. Thus, as the datasets are well dispersed in both figures, the datasets
differ independently of the size. For further reference, appendix B presents
the constituent metrics used for PCA.
100 2000
80 1500
60
1000
40
500
PC2

20
PC2

0 0
-20
-500
-40
-60 -1000
-80 -1500
-400 -300 -200 -100 0 100 200 -8000 -6000 -4000 -2000 0 2000 4000
PC1 PC1
(a)5K (b)100K
Fig. 3. PC1 vs PC2 for Insurance dataset with sizes (a)5K and (b)10K

Anonymization. We performed two sets of anonymizations. Firstly,

to exemplify the usefulness of testing with multiple different datasets,
five datasets of size 5K for each domain were anonymized. Secondly, to
exemplify the benefits of testing with different sizes (i.e., scalability), we
used a single variant of the insurance dataset (ActPoissonOccNormal)
with 6 different sizes. It is important to remark that our intention was
not to exhaustively investigate the data utility of the Mondrian algorithm
(nor the effectiveness of k-anonymity), but to document how the resulting
datasets are useful to broaden the testing of anonymization algorithms.
Our first analysis focused on assessing the data utility. Figs. 4a and
4b offer a high-level view of the GenILoss obtained, per domain, at each
tested k-value. Fig. 4a shows the results obtained by using the original
versions of the domains, while Fig. 4b shows the results achieved by using
all the different generated datasets per domain. It can be clearly noticed
how diverse values can be obtained by using a broad range of datasets
(depicted by the standard deviations in Fig. 4b). This test strategy can
then help to derive more generic conclusions from the results.
To further exemplify the risks of evaluating with single datasets,
Figs. 5a, 5b, and 5c presents the GenILoss obtained for each anonymized
dataset version, per domain, using a 5 k-value. It can be seen how the
GenIloss values considerably fluctuate among the dataset variants. This
shows the variability of results that can be obtained when different datasets
are considered. This further motivates the usage of benchmarks that pro-
vide an adequate coverage of test scenarios.
0.5 0.5
adult adult
0.45 german 0.45 german
insurance insurance
0.4 0.4
Data Utility (GenILoss)

Data Utility (GenILoss)

0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
2 5 10 25 50 100 2 5 10 25 50 100
k-value k-value
(a) Original Datasets (b) Generated Datasets
Fig. 4. Effectiveness of (a)Original and (b)Generated datasets of 5K size
Data Utility (GenILoss)

0.12 0.12 0.2

Data Utility (GenILoss)

0.1 0.1 0.195
0.08 0.08 0.19
0.06 0.185
0.06
0.18
0.04 0.04
0.175
0.02 0.02
0.17
0 0 0.165
xp al al on ni
p al al on ni E m in ss U p al al son ni
Ex m in iss U rp or rig i rp Ex m in s l tU
cc or rig cc Pu N Po Pu c t i 2 or g rig Poi ma Ac Stu
O N O Po O rp O rp A Ch tN cLo O
cc cc Pu Pu cc Ac Oc c t or c c T
O O O A cN O
c
O
Datasets Datasets Datasets
(a) Adult (a) German Credit (a) Insurance
Fig. 5. Effectiveness of anonymized datasets (a)Adult, (b)German and (c)Insurance

Normally, the most important aspect of a scalability testing is to assess

the costs of using an anonymization technique. These results are shown
in Figs. 6a, 6b, and 6c which depict the execution time, average CPU,
and memory utilizations, respectively. It can be noticed how Mondrian
experiences a relatively exponential growth in terms of execution time,
while requiring a low amount of resources (as both CPU and memory
do not considerable grow with respect to the dataset size). Finally, the
analysis of GC behavior showed that its performance cost was only signif-
icant for the 100K versions of the datasets. In those cases, the time spent
performing MajorGC (MaGC) was 5% of the execution time (meaning
that the processes can benefit from additional memory).
Costs. Figs. 7a, 7b, and 7c depict the execution time, average CPU,
and memory utilization of the data generation process (per dataset size).
It can be seen how COCOA experienced a relatively linear growth in all
metrics. Furthermore, COCOA proved to be lightweight in terms of CPU
6
4*10 0.40% 800
750
3*10
6 0.35%

Anonym Execution Time (ms)

700
6 0.30% 650

Memory Usage (MB)

2*10
0.25% 600

CPU Usage
2*106 550
0.20%
6 500
2*10
0.15% 450
6
1*10 0.10% 400
3 350
500*10 0.05%
300
0
0*10 0.00% 250
5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K

Dataset Size Dataset Size Dataset Size

Fig. 6. Efficiency of Anonymization w.r.t. (a)Execution time, (b)CPU and (c)Memory

and execution time. For instance, the generation of the largest datasets
(i.e., 100K) took an average of 4 sec (with a standard deviation of 1.5
sec). Similarly, the CPU never exceeded 10% (meaning that there was a
considerable amount of idle resources to support larger dataset sizes). In
terms of memory, COCOA only used approximately 35% of the available
memory. It also never triggered a MaGC, which was another indicator
that the memory settings were always appropriate for COCOA. Costs
were also analyzed per domain. Although comparable across domains, the
biggest costs were experienced by the german credit. This was because
it contains the largest number of attributes. A second factor influencing
the execution time was the complexity of the attribute generators. In this
sense, the distribution-based ones tend to be less expensive (in terms of
resources) than the other two types.
6 10% 1200
Data Generation Execution Time (ms)

9%
5 1000
8%
Memory Usage (MB)

4 7% 800
CPU Usage

6%
3 600
5%
2 4% 400
3%
1 200
2%
0 1% 0
5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K
Dataset Size Dataset Size Dataset Size

Fig. 7. Efficiency of COCOA w.r.t. (a)Execution time, (b)CPU and (c)Memory

5 Conclusions And Future Work

This paper presented COCOA, a framework for the generation of realistic
synthetic microdata that can facilitate the testing of anonymization tech-
niques. Given the characteristics of the desired data, COCOA can effec-
tively create multi-dimensional datasets. Our experiments demonstrated
the importance of using a comprehensive set of diverse testing datasets,
and how COCOA can help to strengthen the testing. Finally, we showed
that COCOA is lightweight in terms of computational resources, which
makes it practical for real-world usage. As future work, we plan to ex-
pand the domains supported by COCOA, investigate other correlation
types in the dataset, integrate other features for testing anonymization
techniques, and release COCOA as a publicly-available tool.
Acknowledgments

This work was supported, in part, by Science Foundation Ireland grant

10/CE/I1855 to Lero - the Irish Software Research Centre (www.lero.ie)

References
1. COCOA Datasets, https://github.com/ucd-pel/COCOA/
2. CSO. Access to Microdata, http://www.cso.ie/en/aboutus/dissemination/
accesstomicrodatarulespoliciesandprocedures/accesstomicrodata/
3. Data Benerator Tool, http://databene.org/databene-benerator
4. Eurostat. Access to Microdata, http://ec.europa.eu/eurostat/web/microdata
5. OpenForecast Library, http://www.stevengould.org/software/openforecast/
index.shtml
6. Payscale USA, http://www.payscale.com/research/US/
7. simPop: Simulation of Synthetic Populations for Survey Data Considering Auxil-
iary Information, https://cran.r-project.org/web/packages/simPop
8. synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Dis-
closure Control, https://cran.r-project.org/web/packages/synthpop
9. UML basics: The component diagram, https://www.ibm.com/developerworks/
rational/library/dec04/bell/
10. US Census. Restricted-Use Microdata, http://www.census.gov/research/data/
restricted_use_microdata.html
11. UTD Anonymization ToolBox, http://cs.utdallas.edu/dspl/cgi-bin/
toolbox/
12. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Synthetic data gener-
ation using benerator tool. arXiv preprint arXiv:1311.3312 (2013)
13. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic compar-
ison and evaluation of k-anonymization algorithms for practitioners. Transactions
on Data Privacy 7(3), 337–370 (2014)
14. Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S.,
Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., et al.: The da-
capo benchmarks: Java benchmarking development and analysis. In: ACM Sigplan
Notices. vol. 41, pp. 169–190. ACM (2006)
15. Chow, K., Wright, A., Lai, K.: Characterization of java workloads by principal
components analysis and indirect branches. In: Proceedings of the Workshop on
Workload Characterization (WWC-1998), held in conjunction with the 31st Annual
ACM/IEEE International Symposium on Microarchitecture (MICRO-31). pp. 11–
19 (1998)
16. Eeckhout, L., Georges, A., De Bosschere, K.: How java programs interact with vir-
tual machines at the microarchitectural level. In: ACM SIGPLAN Notices. vol. 38,
pp. 169–186. ACM (2003)
17. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator.
ACM SIGMOD Record 36(1), 19–24 (2007)
18. Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic cate-
gorical data. In: Privacy in Statistical Databases. pp. 185–199. Springer (2014)
19. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Int. Conf. on
Knowledge Discovery and Data Mining. pp. 279–288 (2002)
20. LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian Multidimensional K-
Anonymity. In: Int. Conf. Data Eng. p. 25 (2006)
21. Lichman M.: UCI Machine Learning Repository (2013)
22. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: The-
ory meets Practice on the Map. In: Int. Conf. Data Eng. pp. 277–286 (2008)
23. Mateo-Sanz, J.M., Martı́nez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of
accurate synthetic microdata. In: Privacy in Statistical Databases. pp. 298–306.
Springer (2004)
24. Pedersen, K.H., Torp, K., Wind, R.: Simple and realistic data generation. In: Pro-
ceedings of the 32nd International Conference on Very Large Data Bases. Associ-
ation for Computing Machinery (2006)
25. Portillo-Dominguez, A.O., Perry, P., Magoni, D., Wang, M., Murphy, J.: Trini:
an adaptive load balancing strategy based on garbage collection for clustered java
systems. Software: Practice and Experience (2016)
26. Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–
468 (1993)
27. Samarati, P.: Protecting respondents identities in microdata release. Trans. on
Knowledge and Data Engineering 13(6), 1010–1027 (2001)
28. Sweeney, L.: k-Anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzzi-
ness Knowl.-Based Syst 10(05), 557–570 (Oct 2002)
29. Walck, C.: Handbook on statistical distributions for experimentalists (2007)
Appendix A Structure of the Irish Census and Insurance
Domains
Tables 1 and 2, respectively, list the attributes and the type of generators used for producing
data for the Irish census and insurance domains (discussed in Section 3.5).

Table 1. Irish census domain structure.

distribution distribution/ attribute
attribute name
-based attribute-based -based
given name x (age,native country)
family name x (native country)
age x
gender x
marital status x (age)
education level x (age)
county x(age)
economic status x (age)
industral group x (gender)
field of study area x (gender)
field of study x (field of study area)
native country x

Table 2. Insurance domain structure.

Frequency Frequency/
Attribute name Attribute-based
-based attribute-based
gender x (occupation)
age x
age range x (age)
occupation x
place of work x (occupation)
activity x
place of activity x(activity)
years of experience x (age)
salary x (occupation,YoE)
x (age, occ, placeOfWork,
risk of accident
activity, placeOfActivity)
salary class x (salary)
risk of accident class x (riskOfAccident)
Appendix B PCA Constituent Metrics
Table 3 lists the constituent metrics used to perform the PCA analysis of the datasets generated
by COCOA (discussed in Section 4.2).

Table 3. Constituent Metrics for PC Analysis per domain.

German domain Adult domain Insurance domain
Metrics Metrics Metrics
otherDebtors-Stdev.Freq. country-Avg.Freq. placeOfActivity-Stdev.Freq.
foreignWorker-Stdev.Freq. country-Stdev.Freq. activity-Avg.Freq.
presentResidence-Stdev.Freq. sex-Avg.Freq. riskOfAccidentClass-
Stdev.Freq.
housing-Avg.Freq. race-Stdev.Freq. riskOfAccidentClass-Avg.Freq.
otherInstallmentPlans-Avg.Freq. education-Avg.Freq. salaryCatClass-Avg.Freq.
housing-Stdev.Freq. marital-Avg.Freq. occupation-Avg.Freq.
presentEmployment-Stdev.Freq. education-Stdev.Freq. placeOfActivity-Avg.Freq.
creditHistory-Stdev.Freq. income-Stdev.Freq. salaryClass-Avg.Freq.
presentEmployment-Avg.Freq. marital-Stdev.Freq. occupation-Stdev.Freq.
ageRange-Avg.Freq. occupation-Avg.Freq. semAge-Avg.Freq.
ageRange-Stdev.Freq. sex-Stdev.Freq. age-Avg.Freq.
duration-Stdev.Freq. typeEmployer-Stdev.Freq. age-Stdev.Freq.
statusCheckingAccount-Stdev.Freq. typeEmployer-Avg.Freq. placeOfWork-Stdev.Freq.
duration-Avg.Freq. occupation-Stdev.Freq. gender-Stdev.Freq.
savings-Stdev.Freq. income-Avg.Freq. activity-Stdev.Freq.
creditHistory-Avg.Freq. race-Avg.Freq. salaryCatClass-Stdev.Freq.
savings-Avg.Freq. age-Avg.Freq. placeOfWork-Avg.Freq.
property-Stdev.Freq. age-Stdev.Freq. gender-Avg.Freq.
creditAmount-Stdev.Freq. semAge-Stdev.Freq.
property-Avg.Freq. salaryClass-Stdev.Freq.
presentResidence-Avg.Freq.
age-Avg.Freq.
installmentRate-Avg.Freq.
statusCheckingAccount-Avg.Freq.
statusAndSex-Stdev.Freq.
statusAndSex-Avg.Freq.
job-Stdev.Freq.
age-Stdev.Freq.
otherInstallmentPlans-Stdev.Freq.
telephone-Stdev.Freq.
costMatrix-Stdev.Freq.
numberExistingCredits-Stdev.Freq.
numberLiablePeople-Stdev.Freq.
telephone-Avg.Freq.
numberLiablePeople-Avg.Freq.
creditAmount-Avg.Freq.
numberExistingCredits-Avg.Freq.
foreignWorker-Avg.Freq.
job-Avg.Freq.
costMatrix-Avg.Freq.
otherDebtors-Avg.Freq.
installmentRate-Stdev.Freq.
purpose-Avg.Freq.
purpose-Stdev.Freq.

View publication stats

Comm 214 Past Final Exam
No ratings yet
Comm 214 Past Final Exam
17 pages
Instant Download (Ebook PDF) Research Methods in Psychology: Investigating Human Behavior 3rd Edition PDF All Chapters
100% (8)
Instant Download (Ebook PDF) Research Methods in Psychology: Investigating Human Behavior 3rd Edition PDF All Chapters
41 pages
R Statistics For Comparing Means Interior
100% (1)
R Statistics For Comparing Means Interior
205 pages
The Microsoft Fabric Handbook: Simplifying Data Engineering and Analytics
From Everand
The Microsoft Fabric Handbook: Simplifying Data Engineering and Analytics
Robert Johnson
No ratings yet
[Ebooks PDF] download (eBook PDF) Business Analytics Data Analysis Decision Making 6th full chapters
100% (5)
[Ebooks PDF] download (eBook PDF) Business Analytics Data Analysis Decision Making 6th full chapters
46 pages
Stochastic Dynamics and Irreversibility: Tânia Tomé Mário J. de Oliveira
100% (2)
Stochastic Dynamics and Irreversibility: Tânia Tomé Mário J. de Oliveira
402 pages
Set Up IP and Port Send SMS: Adminip123456 Ip Port Such As: Adminip123456 220.231.142.241 7700
No ratings yet
Set Up IP and Port Send SMS: Adminip123456 Ip Port Such As: Adminip123456 220.231.142.241 7700
4 pages
2020 Mock Exam B - Afternoon Session
No ratings yet
2020 Mock Exam B - Afternoon Session
20 pages
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet
Abowd Lane Barcelona 2004
No ratings yet
Abowd Lane Barcelona 2004
10 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Big Data Ethics in Research
From Everand
Big Data Ethics in Research
Nicolae Sfetcu
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
8 pages
Privacy-Preserving Data Analysis - A Survey
No ratings yet
Privacy-Preserving Data Analysis - A Survey
3 pages
Guidelines Data Privacy - 1
No ratings yet
Guidelines Data Privacy - 1
2 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Final-Guidelines Data Privacy
No ratings yet
Final-Guidelines Data Privacy
2 pages
CCWC24-Corrections-Nov19th
No ratings yet
CCWC24-Corrections-Nov19th
8 pages
Data Mining 101: Core Concepts and Algorithms
From Everand
Data Mining 101: Core Concepts and Algorithms
Swarnalata Verma
No ratings yet
Reality Mining: Using Big Data to Engineer a Better World
From Everand
Reality Mining: Using Big Data to Engineer a Better World
Nathan Eagle
4/5 (2)
Data Leakage Detection - Final 26 April
100% (2)
Data Leakage Detection - Final 26 April
62 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Blockchain Adoption in Supply Chain Management and Logistics
From Everand
Blockchain Adoption in Supply Chain Management and Logistics
Niels Hackius
No ratings yet
Essential Federated Learning: AI at the Edge
From Everand
Essential Federated Learning: AI at the Edge
Robert Johnson
No ratings yet
Trust between Cooperating Technical Systems: With an Application on Cognitive Vehicles
From Everand
Trust between Cooperating Technical Systems: With an Application on Cognitive Vehicles
Walter Bamberger
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Data Management and Security in Blockchain Systems
From Everand
Data Management and Security in Blockchain Systems
Sonali Vyas
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Strategic Policy Insights in Data Science
From Everand
Strategic Policy Insights in Data Science
Zemelak Goraga
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Templ et al (2017) Simulation of Synthetic Complex Data The R Package simPop
No ratings yet
Templ et al (2017) Simulation of Synthetic Complex Data The R Package simPop
38 pages
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
Survey On Anonymization Techniques in Big Data and Privacy Models
No ratings yet
Survey On Anonymization Techniques in Big Data and Privacy Models
20 pages
Data-Driven Decision Making
From Everand
Data-Driven Decision Making
Aadinath Pothuvaal
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
FOD New Paper
No ratings yet
FOD New Paper
5 pages
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
From Everand
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
Bolakale Aremu
5/5 (1)
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Cybersecurity in Cloud Computing
From Everand
Cybersecurity in Cloud Computing
Akula Achari
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
E-Agriculture in Action: Blockchain for Agriculture Opportunities and Challenges
From Everand
E-Agriculture in Action: Blockchain for Agriculture Opportunities and Challenges
Food and Agriculture Organization of the United Nations
2.5/5 (2)
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
Synthpop: Bespoke Creation of Synthetic Data in R: Beata Nowok Gillian M Raab Chris Dibben
No ratings yet
Synthpop: Bespoke Creation of Synthetic Data in R: Beata Nowok Gillian M Raab Chris Dibben
26 pages
sensitive data hiding (4)
No ratings yet
sensitive data hiding (4)
6 pages
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
Cloud Storage Evolution
From Everand
Cloud Storage Evolution
Lucas Lee
No ratings yet
ERPANET Case Study: Project Gutenberg
From Everand
ERPANET Case Study: Project Gutenberg
ERPANET
No ratings yet
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Edge Computing Applications in Supply Chain Management
From Everand
Edge Computing Applications in Supply Chain Management
Bo Li
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Crowd-Sourced Data Publishing
No ratings yet
Crowd-Sourced Data Publishing
2 pages
Secure Persona Prediction and Data Leakage Prevention System Using Python
No ratings yet
Secure Persona Prediction and Data Leakage Prevention System Using Python
49 pages
Dancing on a Cloud: A Framework for Increasing Business Agility
From Everand
Dancing on a Cloud: A Framework for Increasing Business Agility
David Sterling
No ratings yet
Sensitive Data Hiding
No ratings yet
Sensitive Data Hiding
6 pages
Artificial intelligence: AI in the technologies synthesis of creative solutions
From Everand
Artificial intelligence: AI in the technologies synthesis of creative solutions
Alexander V. Andreichikov
No ratings yet
Managing Big Data Effectively
From Everand
Managing Big Data Effectively
Bhima Asan
No ratings yet
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
From Everand
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Crowd Funding Pettazzoni Rachele
No ratings yet
Crowd Funding Pettazzoni Rachele
135 pages
Automatic Construction of Generalization Hierarchies
No ratings yet
Automatic Construction of Generalization Hierarchies
12 pages
Query-Time Record Linkage and Fusion Over Web Databases
No ratings yet
Query-Time Record Linkage and Fusion Over Web Databases
12 pages
Ontology-Based Quality Evaluation of Value Generalization Hierarchies For Data Anonymization
No ratings yet
Ontology-Based Quality Evaluation of Value Generalization Hierarchies For Data Anonymization
18 pages
"The Grace Period Has Ended": An Approach To Operationalize GDPR Requirements
No ratings yet
"The Grace Period Has Ended": An Approach To Operationalize GDPR Requirements
11 pages
Pert
No ratings yet
Pert
28 pages
Engle & Manganelli (2004) - CAViaR Conditional Autoregressive Value at Risk by Regression Quantiles
No ratings yet
Engle & Manganelli (2004) - CAViaR Conditional Autoregressive Value at Risk by Regression Quantiles
15 pages
Chapter-6-Random Variables & Probability Distributions
No ratings yet
Chapter-6-Random Variables & Probability Distributions
13 pages
Probit Analysis
No ratings yet
Probit Analysis
66 pages
Luhanjula_On(2010)
No ratings yet
Luhanjula_On(2010)
25 pages
Lecture 3 PDF
100% (2)
Lecture 3 PDF
77 pages
Statistical Analysis in JASP 2024
No ratings yet
Statistical Analysis in JASP 2024
189 pages
Practical Statistics for Geographers and Earth Scientists 1st Edition Nigel Walford download pdf
No ratings yet
Practical Statistics for Geographers and Earth Scientists 1st Edition Nigel Walford download pdf
45 pages
Full Download Elementary Statistics, 7e Ron Larson PDF DOCX
60% (5)
Full Download Elementary Statistics, 7e Ron Larson PDF DOCX
40 pages
Granger CH 22
No ratings yet
Granger CH 22
30 pages
Chapter 5. Confidence Interval Estimation - One Population
No ratings yet
Chapter 5. Confidence Interval Estimation - One Population
21 pages
Chapter4 Notes
No ratings yet
Chapter4 Notes
14 pages
Topic 1 - Estimating Market Risk Measures Question PDF
No ratings yet
Topic 1 - Estimating Market Risk Measures Question PDF
16 pages
SPC Awareness Training
No ratings yet
SPC Awareness Training
70 pages
Electrical Power and Energy Systems: Josif V. Spiric, Miroslav B. Doc Ic, Slobodan S. Stankovic
No ratings yet
Electrical Power and Energy Systems: Josif V. Spiric, Miroslav B. Doc Ic, Slobodan S. Stankovic
9 pages
Tackling Complex Problems: Analysis of The AN/TPQ-53 Counterfire Radar
No ratings yet
Tackling Complex Problems: Analysis of The AN/TPQ-53 Counterfire Radar
8 pages
Inference and Learning From Data: Volume 3: Learning 1st Edition Ali H. Sayed Ebook All Chapters PDF
100% (3)
Inference and Learning From Data: Volume 3: Learning 1st Edition Ali H. Sayed Ebook All Chapters PDF
79 pages
History and Origin of Statistics
67% (3)
History and Origin of Statistics
11 pages
Probability Worksheet
No ratings yet
Probability Worksheet
4 pages
Analisis Statistika: Materi 5 Inferensia Dari Contoh Besar (Inference From Large Samples)
No ratings yet
Analisis Statistika: Materi 5 Inferensia Dari Contoh Besar (Inference From Large Samples)
19 pages
Level 3 Comp PROBABILITY PDF
100% (1)
Level 3 Comp PROBABILITY PDF
30 pages
Reference - Chapter 4
No ratings yet
Reference - Chapter 4
21 pages
MFIN 5800 Chapter 11
No ratings yet
MFIN 5800 Chapter 11
32 pages
Gaussian Identities
No ratings yet
Gaussian Identities
1 page

COCOA: A Synthetic Data Generator For Testing Anonymization Techniques

Uploaded by

COCOA: A Synthetic Data Generator For Testing Anonymization Techniques

Uploaded by

COCOA: A Synthetic Data Generator for

Testing Anonymization Techniques

Vanessa Ayala-Rivera, A. Omar Portillo-Dominguez,

Abstract. Conducting extensive testing of anonymization techniques

The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-45381-1_13

3 COCOA: A Synthetic Data Generator

3.2 Core Process

3.4 Attribute Generators

4.1 Experimental Setup

4.2 Experimental results

Anonymization. We performed two sets of anonymizations. Firstly,

Data Utility (GenILoss)

0.12 0.12 0.2

Data Utility (GenILoss)

Normally, the most important aspect of a scalability testing is to assess

Anonym Execution Time (ms)

Memory Usage (MB)

Dataset Size Dataset Size Dataset Size

Fig. 6. Efficiency of Anonymization w.r.t. (a)Execution time, (b)CPU and (c)Memory

Fig. 7. Efficiency of COCOA w.r.t. (a)Execution time, (b)CPU and (c)Memory

5 Conclusions And Future Work

This work was supported, in part, by Science Foundation Ireland grant

Table 1. Irish census domain structure.

Table 2. Insurance domain structure.

Table 3. Constituent Metrics for PC Analysis per domain.

View publication stats

You might also like