Automatic Construction of Generalization Hierarchies
Automatic Construction of Generalization Hierarchies
1 Introduction
Microdata (i.e., records about individuals) is a valuable resource for organiza-
tions. By exploiting it, companies acquire knowledge to improve or create new
business models. For this reason, many organizations are actively collecting and
publishing data. However, data must be anonymized before being shared for anal-
ysis as it may contain sensitive personal information (e.g., medical conditions)
that can bring harm to the involved parties if it is disclosed (e.g., negative pub-
licity, fines, identity theft). Privacy-Preserving Data Publishing (PPDP) offers
methods for publishing data without compromising individuals’ confidentiality,
while trying to retain the data utility for a variety of tasks [5].
k-Anonymity is a fundamental principle to protect privacy in the release of
microdata [5,21]. It requires that each record appears at least with k occurrences
with respect to the quasi-identifiers (QIDs), i.e., attributes that can be linked to
external information and reidentify individuals in anonymized datasets. General-
ization is the most widely used technique to achieve k-anonymity [21]. It consists
in replacing the original QIDs’ values with less precise (but semantically consis-
tent) ones, reducing the risk of reidentification (e.g., “surgeon” with “doctor”).
Generalization is usually conducted using concept hierarchies, known as Value
Generalization Hierarchies (VGHs), which indicate the transformations that an
attribute can undergo. Fig. 1 shows an example of a VGH. The leaves (L0) cor-
respond to the real values of an attribute in the dataset, and the ancestors (L1
to L3) correspond to the candidate values used for generalization.
VGH design is a burdensome process for data publishers (i.e., people involved
in the dissemination of data in a safe and useful manner; hereinafter referred as
users) as one VGH needs to be created per QID, based on the input dataset.
If the input values change, VGHs must be modified accordingly, which requires
additional manual effort. While it is feasible to create VGHs of small size, the
effort considerably increases when larger VGHs are required (e.g., open-ended
surveys), or in scenarios where data constantly changes (e.g., streaming data). To
tackle this issue, various approaches to generate VGHs automatically have been
proposed [8]. However, most of them are designed for numerical attributes, while
methods applicable to categorical data remain scarce. Numerical approaches
often consist in creating intervals that fit the distribution of the input data.
Thus, they are not suitable for categorical data, as its inherent semantics is
ignored (a key factor to preserve its meaning). The construction of categorical
VGHs presents even more challenges [12]: Disambiguation of the concepts’ senses,
defining meaningful labels to represent clustered lower level concepts, etc.
Traditionally, categorical VGHs are designed by users based on their own
knowledge and experience, as it is commonly assumed that they are fully capable
of bringing adequate domain expertise to the construction of VGHs [8]. A key
problem of this practice is that the quality of VGHs is evaluated in a subjective
and informal way. This issue can lead to misclassifications or inconsistencies
which significantly impact the quality of the anonymized data. To mitigate this
issue, knowledge engineers often participate in the evaluation process. However,
the process may become expensive due to the limited availability of experts and
the laborious work involved. Consequently, the design of VGHs is normally a
highly error-prone and time-consuming process.
Considering these challenges, our paper has the following contributions:
1. A knowledge-based framework (AIKA) to automatically construct and eval-
uate categorical VGHs for anonymization, which considers users’ preferences.
2. A comprehensive practical evaluation of AIKA, consisting of a prototype
and a set of experiments to assess the benefits of AIKA for the creation and
evaluation of VGHs for anonymization, as well as the costs of using AIKA.
3. A case-study comparing the quality and efficiency of the VGHs generated
by AIKA against VGHs manually created.
2 Related Work
Several methods for creating “good” VGHs (i.e., those that yield a good util-
ity in the data after anonymization) have been proposed in literature. However,
most of them focus on numerical attributes. For instance, the authors of [8]
Automatic Construction of VGHs for PPDP 3
3 AIKA Framework
In this section, we provide the context of our solution and describe the methods
proposed for the automatic construction and evaluation of VGHs for PPDP.
3.1 Overview
To address the need for assisting the users in the design of VGHs, we followed
a typical design science research approach [17] to develop our solution. It con-
sists of a knowledge-based framework (AIKA) for the automatic construction
and evaluation of VGHs to be used in data anonymization. Our goal is to offer
a mechanism that not only reduces the human effort and expertise required to
design and evaluate VGHs, but also improves the quality of the generated VGHs.
Fig. 2 depicts the contextual view of AIKA in PPDP: (1) A trusted entity collects
personal data and is required to publish it. Thus, datasets must be anonymized
before being disseminated. (2) The user selects the QIDs to be generalized from
the datasets. (3) For each QID, the user manually creates candidate VGHs mod-
eling their corresponding domains. (4) Once the user is confident about the cre-
ated VGHs, they are used to anonymize the data. (5) The user then evaluates
the utility and disclosure risk of the anonymized data. (6) If they are accept-
able, the data is released. Otherwise, a new anonymization cycle starts (Step 3).
AIKA fits into Step 3, where the VGHs are designed. (3a) AIKA consists of two
components: a constructor and an evaluator. The constructor (see Section 3.2)
automatically generates various candidate VGHs for a particular domain by ex-
ploiting information from a knowledge base and the original dataset. Note that
the constructor does not generate a single “optimal” VGH but a set of VGHs
4 Vanessa Ayala-Rivera et al.
that can fulfill the needs of different use cases. The candidate VGHs are passed
to the evaluator (see Section 3.3), where the VGHs are objectively assessed with
quantifiable metrics from multiple perspectives. (3b) The user can inspect the
VGHs and adjust (or re-evaluate) them as needed. (4) After evaluation, the best
VGHs can be used to drive the data anonymization with more guarantees that
those VGHs will help to retain the desired level of data usefulness and disclosure
risk (hence eliminating the need of costly trial-and-error anonymization cycles).
KB, and Wu and Palmer as semantic similarity metric (two of the most widely
used resources in knowledge-based systems) [16]. Note that relying on a single
ontology does not represent a limitation for AIKA, as several works support the
integration of ontologies [20]. Also, the ontology used by AIKA is configurable.
(1) Words Extraction and Word Sense Disambiguation (WSD).
First, the constructor identifies the leaf nodes of the VGH (by extracting the
distinct values of the QID from the input dataset), and calculates their frequen-
cies of occurrences. Next, WSD is performed, which involves defining the right
sense for the words. In AIKA we use the adapted Lesk algorithm [7] which is
a gloss-based method that relies on the definition of a word (using WordNet as
gloss dictionary). This technique is suitable for microdata anonymization because
there is no background context that can be used (e.g., documents or corpus).
To mitigate the possibility of any remaining noise (i.e., incorrect senses), AIKA
allows users to provide (or adjust) the senses of the individual participant words.
(2) Construction of “base” VGH. To start the generation of VGHs,
AIKA extracts the minimal hierarchy that subsumes all the leaf values from the
ontology. That is, for each leaf, it extracts the hypernym tree from WordNet.
Then, all branches are merged into the “base” VGH. This VGH forms the basis
for all other candidate VGHs, which will be later derived from it. Using the
subsumption hierarchy is appropriate in our scenario as it reflects the principle
of specialization/generalization used by data generalization techniques.
(3) Automatic Adjustments. This step consists in applying a series of
automatic transformations to the “base” VGH with the objective of deriving
multiple candidate VGHs that can be used to fulfill the requirements of different
use cases. This is because the released anonymized data is intended to be used by
multiple parties for different purposes. In general, such transformations vary the
taxonomic structure and degree of data semantics of the “base” VGH, hence the
characteristics of the derived candidate VGHs are diversified. Below, we describe
the different types of adjustments performed by the VGH constructor:
a. Reduce abstractness (Fig. 4a) prunes the hierarchy at the lowest level
where all the branches are connected. This adjustment naturally meets the mono-
tonicity property [13] extensively used in anonymization: if the generalization T ∗
at level i preserves privacy, then every generalization of T ∗ at level i + 1 also pre-
serves privacy. That is, all successors of an anonymous state are also anonymous.
b. Reduce outliers (shown in Fig. 4b) avoids over-generalizing the data
by reducing the possible outliers in the VGH (e.g., due to data sparseness).
The aim is to tailor the VGHs for a given syntactic privacy model (e.g., k-
value for k-anonymity) so that the privacy condition is satisfied at the lowest
possible level (where the information loss is lower). When it is possible (i.e., the
frequency sum of the outliers is ≥k and the semantic consistency of the VGH
is respected), the outliers can be aggregated into groups so that k is satisfied.
The new node (common ancestor of a group of outliers) can be one of three
possibilities: one of the parents of the outliers; one of the outliers, promoted as
parent; or the root node replicated (implying the full suppression of the values).
All these alternatives are viable depending on the data anonymization scenario.
6 Vanessa Ayala-Rivera et al.
the user to indicate the importance of each aspect. The best VGH is the one
that maximizes E(V ) given the chosen weights. This is given by (2):
4 Experimental Evaluation
The experiments aimed three objectives: (1) to assess the benefits of using AIKA
(i.e., its capability to create good quality VGHs and estimate their effectiveness
in anonymization); (2) to assess the costs of using AIKA (in terms of computa-
tional resources); and (3) to compare AIKA’s benefits and costs against those of
manually generated VGHs. As evaluation data, we used four publicly available
datasets: Adult [14] consists of census information; German Credit [14] contains
credit applicants information; Chicago Homicide [1] has information about homi-
cides filed by the Chicago police; Insurance [2] contains personal information use-
ful for risk assessment. For each dataset, we chose the categorical attributes with
the most heterogeneous values as QIDs (Table 1) to diversify the tested domains;
then, we generated VGHs for them using AIKA. To assess the performance of
the VGHs in anonymization, we used the commonly-used anonymization algo-
rithm Datafly [21] (from the UTD Anonymization Toolbox [3]). We also tested
a broad range of privacy levels, varying the k-values ∈ [2..100]. All experiments
were done in a computer with an Intel Core i7-4702HQ CPU at 2.20Ghz, 8GB
of RAM, Windows 8.1 64-bit, and HotSpot JVM 1.7 with a 1GB heap. Finally,
AIKA’s prototype was developed in Java, internally using the WS4J library [4].
isize aspects, respectively. To measure the data disclosure risk (DR), we used
record similarity [15], associated with the priv aspect. Due to space constraint,
we only present the most relevant results (as this experiment involved the gener-
ation/evaluation of approximately 1.4K VGHs and 138K anonymized solutions).
To assess how well the properties of the VGHs were captured by AIKA’s eval-
uator, we calculated the degree of correlation between the VGH quality scores
and the quality of the anonymized datasets. For this purpose, we used the Spear-
man’s rank order correlation (rSpm ), which measures the strength of a monotonic
(but not necessarily linear) relationship between paired data. rSpm can take val-
ues from -1 to +1. The closer the value is to ±1, the stronger the relationship.
The results showed that AIKA worked well (Fig. 6), as a strong level of corre-
lation (i.e., rSpm ≥ 0.60) was achieved by all metric/aspect combinations when
a high weight was used (e.g., 75% and 100%). Fig. 6a shows the results of the
semq aspect. There, it can also be noticed how the correlation level gradually
decreases following a trend similar to the decrease in the semq weight. This is
consequence of considering other aspects and exemplifies the trades-off that are
experienced in anonymization (i.e., one sacrifices utility to enforce privacy). This
behavior is also reflected in the standard deviations of the low weights, which
tend to be higher than those of higher weights. Figs. 6b, 6c, and 6d depict the
results of the other aspects. It can be noticed how the aspects behaved similarly,
as they achieved comparable levels (and trends) of correlations.
To complement this analysis, an example of the correlation plots is shown
in Fig. 7. It can be noticed how the VGH quality rankings closely resemble the
1 1
Spearman Correlation
Spearman Correlation
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 25 50 75 100 0 25 50 75 100
semq weights (%) priv weights (%)
(a) Corr. semq and SSE (b) Corr. priv and DR
1 1
Spearman Correlation
Spearman Correlation
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 25 50 75 100 0 25 50 75 100
distrn weights (%) isize weights (%)
(c) Corr. distrn and GenILoss (d) Corr. isize and CAV G
Fig. 6: Correlations between VGH evaluator criteria and data quality metrics
10 Vanessa Ayala-Rivera et al.
0.8 adult
1 german
rSpm= 0.73
Spearman Correlation
homicide
0.6 insurance
Disclosure Risk 0.8
0.6
0.4
0.4
0.2
0.2
0 0
20 40 60 80 100 0 25 50 75 100
VGHs ranked by priv semq weights (%)
Fig. 7: priv (w2 =100%) vs DR Fig. 8: Corr. semq and SSE per dataset
Winning Percentage
80%
102
60%
101
40%
100
20%
10-1
0% 10-2
Att3 Att4 Att5 Att6 AIKA AIKA Manual Manual
(C+E) (C+E) unit avg all avg
Attributes of Insurance Dataset unit avg all avg
VGHs Design Strategies
To compare the quality of the two VGH sets, we firstly evaluated them using
AIKA (with the 35 sets of weights previously discussed) and analyzed their cor-
responding quality rankings. This analysis showed that the A-VGHs drastically
outperformed the M-VGHs, as in more than 95% of the 140 cases, an A-VGH
was ranked #1. This is depicted in Fig. 9, which shows the number of wins
(i.e., ranks #1) achieved by each VGH type. We also compared their differences
in rankings and reward scores. This showed that when an A-VGH was not the
best (i.e., did not win), the ranking difference was minimal (only 1 place). On
the contrary, M-VGHs always lost by several places (an average of 14). The same
behavior was observed in terms of reward scores. Also, in the few cases where
M-VGHs won, those VGHs were created by the participants who invested the
longest time designing the VGHs (meaning that they were expensive wins).
Next, we assessed the time-savings gained by AIKA. First, the time required
by AIKA to create/evaluate (C+E) one VGH was compared against the time re-
ported by the participants. This comparison showed that AIKA offers significant
time-savings, as its unitary cost was 99.99% smaller. We also compared the time
required to create all VGHs of each type. This also proved AIKA’s usefulness,
as the time-savings were also significant (an average decrease of 99.95%). These
results are depicted in Fig. 10. It is also worth noting that: (i) The manual effort
only considers the intrinsic evaluation performed during the construction of the
VGHs. If any extrinsic evaluation would be performed, the time-savings would
be higher; (ii) AIKA created/evaluated more VGHs (an average of 100) than the
participants (16), meaning that the domains were more exhaustively explored.
can be broader. Thus, we plan to apply it to other areas where concepts are
hierarchically ordered and data semantics is the main property to be preserved.
References
1. Chicago Homicides. https://data.cityofchicago.org
2. Insurance. https://github.com/ucd-pel/Datasets/tree/master/Insurance
3. UTD ToolBox. http://cs.utdallas.edu/dspl/cgi-bin/toolbox/
4. WS4J library. https://code.google.com/p/ws4j/
5. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A Systematic Com-
parison and Evaluation of k -Anonymization Algorithms for Practitioners. Trans.
Data Priv. 7(3), 337–370 (2014)
6. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: Ontology-Based Qual-
ity Evaluation of Value Generalization Hierarchies for Data Anonymization. In:
PSD (2014)
7. Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disam-
biguation Using WordNet. In: CICLing. pp. 136–145 (2002)
8. Campan, A., Cooper, N., Truta, T. M.: On-the-fly generalization hierarchies for
numerical attributes revisited. In: Secur. Data Manag. pp. 18–32 (2011)
9. D’Aquin, M., Natalya NF.: Where to Publish and Find Ontologies? A Survey of
Ontology Libraries. Web semantics (Online) 11, 96–111 (2012)
10. Domingo-Ferrer, J., Sánchez, D., Rufian-Torrell, G.: Anonymization of nominal
data based on semantic marginality. Information Sciences 242, 35–48 (2013)
11. Kröll, M., Fukazawa, Y., Ota, J., Strohmaier, M.: Concept Hierarchies of Health-
Related Human Goals. In: KSEM. pp. 124–135 (2011)
12. Lee, S., Huh, S.-Y., McNiel, R. D.: Automatic generation of concept hierarchies
using WordNet. Expert Syst. Appl. 35(3), 1132–1144 (2008)
13. LeFevre, K., DeWitt, DJ., Ramakrishnan, R.: Incognito: Efficient full-domain k-
anonymity. In: Int. Conf. Manag. Data. pp. 49–60 (2005)
14. Lichman M.: UCI Machine Learning Repository (2013)
15. Martı́nez, S., Sánchez, D., Valls, A., Batet, M.: Privacy protection of textual at-
tributes through a semantic-based masking method. Inf. Fusion 13, 304–314 (2012)
16. Meng, L., Huang, R., Gu, J.: A Review of Semantic Similarity Measures in Word-
Net. Int. J. of Hybrid Information Tech 6(1), 1–12 (2013)
17. Peffers, K., Tuunanen, T., Gengler, C.E., Rossi, M., Hui, W., Virtanen, V., Bragge,
J.: The Design Science Research Process: A Model for Producing and Presenting
Information Systems Research. In: DESRIST. vol. 24, pp. 83–106 (2006)
18. Portillo-Dominguez, A.O., Wang, M., Magoni, D., Perry, P., Murphy, J.: Load
balancing of java applications by forecasting garbage collections. In: ISPDC (2014)
19. Sánchez, D., Batet, M., Martı́nez, S., Domingo-Ferrer, J.: Semantic variance: An
intuitive measure for ontology accuracy evaluation. EAAI 39, 89–99 (2015)
20. Solé-Ribalta, A., Sánchez, D., Batet, M., Serratosa, F.: Towards the estimation
of feature-based semantic similarity using multiple ontologies. Knowledge-Based
Systems 55, 101–113 (2014)
21. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and
suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst 10(05), 571–588 (2002)
22. Wang, Y., Liu, W., Bell, D.: A Concept Hierarchy Based Ontology Mapping Ap-
proach. In: KSEM. pp. 101–113 (2010)