COCOA: A Synthetic Data Generator For Testing Anonymization Techniques
COCOA: A Synthetic Data Generator For Testing Anonymization Techniques
1 Introduction
The increasing availability of microdata has attracted the interest of or-
ganizations to collect and share this data for mining purposes. However,
current legislation requires personal data to be protected from inappro-
priate use of disclosure. Anonymization techniques help to disseminate
data in a safe manner while preserving enough utility for its reuse. For
this reason, a plethora of methods for anonymizing microdata exists in
the literature. Each of these techniques claims a particular superiority
over the others (e.g., improving data utility or computational resources).
However, their performance can vary when they are tested with different
datasets [13]. This fact makes difficult for data controllers to generalize
the conclusions of the performance evaluation to decide which algorithm
is best suited for their requirements.
All performance evaluations are limited in one way or the other (due
to time/effort/cost constraints). A common limitation is the number and
the diversity of the datasets used, as the process to obtain good quality
datasets can be burden and time consuming. For example, the access to
real microdata is highly restricted to protect the privacy of individuals.
Agencies grant access only for certain period and once this expires, the
provided files must be destroyed [2, 4, 10] which does not always allow for
reproducibility of experimental results.
3.1 Overview
The goal of this research work has been to develop a framework for syn-
thetically generating datasets (COCOA) that can be used to diversify
the set of characteristics available in microdata. This strategy would help
researchers to improve the testing of anonymization techniques. In Fig. 1,
we depict the conceptual view of our solution. It can be seen how CO-
COA follows an iterative process (see Section 3.2) to build a set of datasets
based on the information base provided by the user. The information base
is composed of all the input parameters required by the chosen dataset
domain (e.g., dataset size).
The key element of COCOA is its domain base, which encapsulates
the expert knowledge of the supported business domains (e.g., census,
healthcare, finance). This element allows COCOA to be easily extensi-
ble and capable of incorporating multiple business cases (even for the
Fig. 1. COCOA - Conceptual View.
same domain e.g., Irish census, USA census), which might be suitable to
different test scenarios. In this context, a domain defines the rules and
constraints required to generate a dataset that preserves the functional
dependencies of a business case. Each domain is characterized by a name,
a set of attributes and their corresponding attribute generators.
To generate data, the domains make use of the available set of at-
tribute generators. These elements are supporting logic which offer mis-
cellaneous strategies to generate the values of attributes. For example, a
generator might focus on diversifying the data in an attribute by fitting
it into a data distribution (e.g., normal distribution). In this example,
several generators can be combined to offer different data distributions
(as the appropriate distribution might vary depending on the usage sce-
nario). In case an attribute generator requires any particular settings to
work properly (e.g., the default values for its applicable parameters), this
information (e.g., mean and variance for a normal distribution) can also
be captured by the framework (as an attribute generator setting).
entity object is created to represent the new tuple. Next, its attributes
are populated with the new values and the tuple entity is added to the
dataset. Moreover, any exceptions are internally handled and reported.
This core process continues iteratively until the new dataset has been
fully generated. As a final step, the dataset is saved to disk.
3.3 Architecture
COCOA is complemented by the architecture presented in the component
diagram [9] of Fig. 2b. COCOA is composed of three main components:
The generic component contains the control logic and all supporting func-
tionality which is independent of the supported domains and attribute
generators (e.g., the monitor and validate phases of the core process).
Regarding the logic that interfaces with the domains, it needs to be cus-
tomized per domain (by defining a tuple entity and a dataset generator).
Similar case with the supported attribute generators. Therefore, these two
logics are encapsulated in their respective components to minimize the re-
quired code changes. To complement this design strategy, the components
are only accessed through interfaces. This is exemplified in Fig. 2c, which
presents the high-level structure of the attribute generator component. It
contains a main interface IGenerator to expose all required actions and
an abstract class for all the common functionality. This hierarchy can
then be extended to support specific attribute generators.
4 Experimental Evaluation
Here we present the experiments performed to assess the benefits and
costs of using COCOA. Firstly, we evaluated how well COCOA achieved
its objective of generating diverse datasets within a domain. Secondly, we
assessed how useful the resulting datasets are to strengthen the testing
of an anonymization algorithm. Thirdly, we evaluated COCOA’s costs.
20
PC2
0 0
-20
-500
-40
-60 -1000
-80 -1500
-400 -300 -200 -100 0 100 200 -8000 -6000 -4000 -2000 0 2000 4000
PC1 PC1
(a)5K (b)100K
Fig. 3. PC1 vs PC2 for Insurance dataset with sizes (a)5K and (b)10K
CPU Usage
2*106 550
0.20%
6 500
2*10
0.15% 450
6
1*10 0.10% 400
3 350
500*10 0.05%
300
0
0*10 0.00% 250
5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K
and execution time. For instance, the generation of the largest datasets
(i.e., 100K) took an average of 4 sec (with a standard deviation of 1.5
sec). Similarly, the CPU never exceeded 10% (meaning that there was a
considerable amount of idle resources to support larger dataset sizes). In
terms of memory, COCOA only used approximately 35% of the available
memory. It also never triggered a MaGC, which was another indicator
that the memory settings were always appropriate for COCOA. Costs
were also analyzed per domain. Although comparable across domains, the
biggest costs were experienced by the german credit. This was because
it contains the largest number of attributes. A second factor influencing
the execution time was the complexity of the attribute generators. In this
sense, the distribution-based ones tend to be less expensive (in terms of
resources) than the other two types.
6 10% 1200
Data Generation Execution Time (ms)
9%
5 1000
8%
Memory Usage (MB)
4 7% 800
CPU Usage
6%
3 600
5%
2 4% 400
3%
1 200
2%
0 1% 0
5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K 5K 10K 20K 30K 50K 100K
Dataset Size Dataset Size Dataset Size
References
1. COCOA Datasets, https://github.com/ucd-pel/COCOA/
2. CSO. Access to Microdata, http://www.cso.ie/en/aboutus/dissemination/
accesstomicrodatarulespoliciesandprocedures/accesstomicrodata/
3. Data Benerator Tool, http://databene.org/databene-benerator
4. Eurostat. Access to Microdata, http://ec.europa.eu/eurostat/web/microdata
5. OpenForecast Library, http://www.stevengould.org/software/openforecast/
index.shtml
6. Payscale USA, http://www.payscale.com/research/US/
7. simPop: Simulation of Synthetic Populations for Survey Data Considering Auxil-
iary Information, https://cran.r-project.org/web/packages/simPop
8. synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Dis-
closure Control, https://cran.r-project.org/web/packages/synthpop
9. UML basics: The component diagram, https://www.ibm.com/developerworks/
rational/library/dec04/bell/
10. US Census. Restricted-Use Microdata, http://www.census.gov/research/data/
restricted_use_microdata.html
11. UTD Anonymization ToolBox, http://cs.utdallas.edu/dspl/cgi-bin/
toolbox/
12. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Synthetic data gener-
ation using benerator tool. arXiv preprint arXiv:1311.3312 (2013)
13. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic compar-
ison and evaluation of k-anonymization algorithms for practitioners. Transactions
on Data Privacy 7(3), 337–370 (2014)
14. Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S.,
Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., et al.: The da-
capo benchmarks: Java benchmarking development and analysis. In: ACM Sigplan
Notices. vol. 41, pp. 169–190. ACM (2006)
15. Chow, K., Wright, A., Lai, K.: Characterization of java workloads by principal
components analysis and indirect branches. In: Proceedings of the Workshop on
Workload Characterization (WWC-1998), held in conjunction with the 31st Annual
ACM/IEEE International Symposium on Microarchitecture (MICRO-31). pp. 11–
19 (1998)
16. Eeckhout, L., Georges, A., De Bosschere, K.: How java programs interact with vir-
tual machines at the microarchitectural level. In: ACM SIGPLAN Notices. vol. 38,
pp. 169–186. ACM (2003)
17. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator.
ACM SIGMOD Record 36(1), 19–24 (2007)
18. Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic cate-
gorical data. In: Privacy in Statistical Databases. pp. 185–199. Springer (2014)
19. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Int. Conf. on
Knowledge Discovery and Data Mining. pp. 279–288 (2002)
20. LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian Multidimensional K-
Anonymity. In: Int. Conf. Data Eng. p. 25 (2006)
21. Lichman M.: UCI Machine Learning Repository (2013)
22. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: The-
ory meets Practice on the Map. In: Int. Conf. Data Eng. pp. 277–286 (2008)
23. Mateo-Sanz, J.M., Martı́nez-Ballesté, A., Domingo-Ferrer, J.: Fast generation of
accurate synthetic microdata. In: Privacy in Statistical Databases. pp. 298–306.
Springer (2004)
24. Pedersen, K.H., Torp, K., Wind, R.: Simple and realistic data generation. In: Pro-
ceedings of the 32nd International Conference on Very Large Data Bases. Associ-
ation for Computing Machinery (2006)
25. Portillo-Dominguez, A.O., Perry, P., Magoni, D., Wang, M., Murphy, J.: Trini:
an adaptive load balancing strategy based on garbage collection for clustered java
systems. Software: Practice and Experience (2016)
26. Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–
468 (1993)
27. Samarati, P.: Protecting respondents identities in microdata release. Trans. on
Knowledge and Data Engineering 13(6), 1010–1027 (2001)
28. Sweeney, L.: k-Anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzzi-
ness Knowl.-Based Syst 10(05), 557–570 (Oct 2002)
29. Walck, C.: Handbook on statistical distributions for experimentalists (2007)
Appendix A Structure of the Irish Census and Insurance
Domains
Tables 1 and 2, respectively, list the attributes and the type of generators used for producing
data for the Irish census and insurance domains (discussed in Section 3.5).