Data Scrambling: A Net 2000 Ltd. White Paper
Data Scrambling: A Net 2000 Ltd. White Paper
Implementation Details
This paper is a general discussion of the issues and techniques involved in the preparation of
anonymous Oracle HR test instances. Net 2000 Ltd. sells a software tool called Data Masker
which resolves the issues discussed herein and makes the masking of the information in Oracle
HR schemas a simple and repeatable process. Pre-built rule sets, and evaluation copies of the
software are available. Typical run times are between one and four hours for a complete
masking operation.
Having said that, this paper really is a generic survey of the issues involved in sanitizing the
data in an Oracle HR schema and there will be no further reference to any software. If you wish
to know more, or have any questions about the issues and techniques described below please
contact us.
Table of Contents
Disclaimer ................................................................................................................................. 1
Why Provide Anonymous Information in Test Oracle HR Schemas? ..................................... 2
Sanitizing the Oracle HR Schema............................................................................................. 3
Overview............................................................................................................................... 3
Techniques ............................................................................................................................ 3
Issues..................................................................................................................................... 3
Oracle HR Tables Which Require Masking ............................................................................. 7
Important Note ...................................................................................................................... 7
The PER_ALL_PEOPLE_F Table ........................................................................................... 8
Overview............................................................................................................................... 8
Decisions............................................................................................................................... 8
Specific actions on PER_ALL_PEOPLE_F Columns.......................................................... 9
The PER_ALL_ASSIGNMENTS_F Table............................................................................ 10
Overview............................................................................................................................. 10
Specific actions on PER_ALL_ASSIGNMENTS_F Columns .......................................... 11
The HR_COMMENTS Table................................................................................................. 11
Overview............................................................................................................................. 11
Disclaimer
The contents of this document are for general information purposes only and
are not intended to constitute professional advice of any description. The
provision of this information does not create a business or professional services
relationship. Net 2000 Ltd. makes no claim, representation, promise,
undertaking or warranty regarding the accuracy, timeliness, completeness,
suitability or fitness for any purpose, merchantability, up-to-dateness or any
other aspect of the information contained in this paper, all of which is provided
"as is" and "as available" without any warranty of any kind.
Oracle HR schemas vary widely and each has a unique configuration. Readers
should take appropriate professional advice prior to performing any actions.
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
on the data then a useful technique to provide enhanced security is to modify the data
so that no sensitive information remains. The process of modifying the data to remove
data sensitivity issues is known by a number of names data masking, data
sanitization, data scrubbing or data cleansing.
Irregardless of the name used, the general technique is to modify the existing data in
such a way as to remove all identifiable distinguishing characteristics thus rendering
the data anonymous yet still usable as a test system.
In practise, masking the data in such a way that it remains functional requires a
variety of techniques. These techniques are discussed in detail in the companion white
paper available from Net 2000 Ltd. entitled Data Sanitization Techniques This paper
will concentrate on the specific issues found in Oracle HR Schemas and the actions
needed to sanitize the data in a selected group of tables. As the data in many of the
tables is intricately interlinked and highly denormalized each operation applied to the
schema must preserve the relationships. The methodology required is, in many cases,
extremely subtle.
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
range and distribution of values in the column within viable limits. For example, a
column of birth dates might have a random variance of 10% placed on it. Some
values would be higher, some lower but all would be not too far from their
original range.
Issues
Relevant Data. In test Oracle HR systems, most data will eventually appear on a front
end screen in one form or another. To be useful, the sanitised values must
resemble the look-and-feel of the original information. For example, surnames
should be replaced with random surnames. Usually it is not acceptable to insert
random collections of meaningless text.
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
issue either generate data items that meet the validity standard or shuffle the data
in the column among the rows so that no row contains its original data but each
data item is valid internally. It is a matter of judgement whether shuffling the
column data among the rows provides sufficient sanitization for the data.
Free Format Data. Textual data such as letters, memos, disciplinary notes etc are
practically impossible to sanitize in-situ. Unless the masking algorithms are
extremely clever, or the format of the text is fixed it is probable that some
information will be missed during the sanitization process. The usual way of
dealing with free format data is to replace all values with randomly generated
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
meaningless text (or simply null them) and then update certain selected data items
with carefully hand sanitized examples. This will give the users of the test system
some realistic looking information to work on while preserving anonymity in the
remainder.
Consistent Masking. A common requirement for an Oracle HR sanitization process is
to ensure that the output is consistent across multiple runs. In practice this means
that if the name of employee Joe Smith gets changed to Bill Jones then the next
time the database is cloned and sanitized Joe Smith should again appear as Bill
Jones (not Jim Williams). Training teams, in particular, tend to require this feature
as they use a lot of pre-scripted examples.
Isolated Case Phenomena. The end result of the sanitization on the Oracle HR schema
is to preserve the privacy of the individual records. In general, anonymity is
derived from the presence of a large number of similar records. If a record stands
out in any way it could be attributable to an individual. For example, could the
record for the organizational CEO be determined by finding the largest salary in
the table? Sometimes each record is its own special case. An unmasked birth date
could readily be attributable to a specific individual. Whether this issue is
important depends largely on how the remainder of the information is masked.
Meta Information. Sometimes even if information is not attributable to a specific
individual, the collection of information might well be sensitive. For example,
salary figures may be anonymous because the associated employee names have
been masked, but does it matter if someone can add up the salary figures for a
department? Its a judgement which has to be made considering the specific
circumstances.
WHERE Clause Skips. When conducting masking operations be careful how the data
to be sanitized is selected with WHERE clauses. It is easy, by making assumptions
about the content of data in the row, to leave data in some rows in its original
state. As an example, consider a table with a FIRST_NAME column and a GENDER
column. Dont replace all the FIRST_NAME fields where GENDER=M with male
first names and the FIRST_NAME fields where GENDER=F with female first
names unless you are absolutely sure that the GENDER column can contain only
M or F. It is entirely possible that the GENDER field may contain some other
character (including null). Masking only the M and F GENDER fields will
leave the FIRST_NAME field in some rows unmasked. It is far better to mask all
rows with one option (Male First Names for example) and then go through a
second time to mask every FIRST_NAME fields where GENDER=F with female
first names. This ensures that all rows have some sort of masking operation
applied irregardless of the state of the GENDER field. Where Clause Skips can
lead to some quite insidious omissions be sure to use full coverage to ensure
every record gets masked.
Granularity. Is it necessary to sanitize absolutely everything? Or is masking enough
data to prevent attribution sufficient. For example, do job titles have to be
masked? Perhaps just removing a few examples of the Isolated Case Phenomena
is sufficient. Either way, decisions have to be made as to the depth of cleansing
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
requirements for the storage of information and implemented flex fields for this
purpose. These fields store site specific information and their usage varies widely
between implementations. An analysis of the flex field contents is required in
order to ensure the data is completely scrubbed clean of personal details.
Speed. Some of the tables are big. One has to be careful how the masking operation is
performed otherwise it will take an inordinate amount of time to complete. For
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
As a bare minimum, the data in the above tables will need to be sanitized. Usually
other tables will also be required as well. The advice in the following section is only a
suggestion as to a masking approach for some of the more tricky columns in the above
tables. Doubtless there are other methodologies. Please be aware the discussion of the
columns for the above tables is not complete! There are numerous important columns
that have not been discussed for space reasons for the most part their data
sanitization requirements are pretty straightforward.
A careful analysis of each table and its contents will be required in order to
completely mask the data. Typically, multiple operations are required for each table.
The PER_ALL_PEOPLE_F table, for example, is a particularly complex case. Be sure
to check the flex fields in the tables to see if they require masking. Since the usage of
these columns are implementation defined, the only way to determine if they require
sanitization is to look at them and find out what sort of data is in there.
Important Note
Needless to say, (but we are going to say it anyways), only mask rows in test
instances. Even then, only mask schemas that you can recreate when necessary. In
general, masking operations are not reversible - there is no undo other than a
complete restore from backup.
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
classic Isolated Case Phenomena to worry about. There are a number of sites (police
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
forces, military and other para-military) that have a variety of titles, the quantity of
which gets progressively fewer as the rank increases. For example, there is probably
no need to mask the title Constable but the single Chief Constable record might
get re-titled back to an already existing type, or simply just deleted, to prevent
identification.
Specific actions on PER_ALL_PEOPLE_F Columns
PERSON_ID This column forms part of the primary key. It is an internal Oracle
identifier and is generally meaningless. It can be masked, but there will be TableTable Data Synchronization issues with over 60 tables. These tables will not be listed
here, but if you wish to have this information just execute the query below:
select table_name from user_tab_columns
where column_name='PERSON_ID' order by table_name;
SEX Decide if this column is to be masked. If it is, then typically every record gets
set to M then a defined percentage (50% perhaps) is updated to an F value.
Typically the PREVIOUS_LAST_NAME field needs to be synchronized so that the
majority of not null PREVIOUS_LAST_NAME columns are associated with F entries
since this mirrors reality. It may also be necessary to update other tables containing
details (such as pregnancy leave taken) to correlate with the re-gendered rows.
EMPLOYEE_NUMBER Usually this column is rendered anonymous to prevent simple
lookups on known values. This is a varchar2 field and its structure is site defined
any replacement value must conform to the same formatting otherwise the Intelligent
Key issue will manifest itself and many things in the resulting schema will break. Care
must also be taken not to update these values to collide with existing ranges otherwise
unrelated rows can become associated with each other. The EMPLOYEE_NUMBER will
need to be synchronized Table-Internal and also Table-Table with the
PER_ALL_ASSIGNMENTS_F.ASSIGNMENT_NUMBER table at minimum and usually
others depending on the implementation of the system.
FIRST_NAME This sensitive column needs Row-Internal synchronization with the
SEX column so that M records get male first names and F records get female first
names. Also requires Table-Internal synchronization with the other rows with the
same PERSON_ID.
MIDDLE_NAMES Always masked and usually needs Row-Internal synchronization
with the SEX column so that M records get male names and F records get female
names. Also requires Table-Internal synchronization with the other rows with the
same PERSON_ID.
LAST_NAME This column requires Table-Internal synchronization with the other
rows with the same PERSON_ID.
KNOWN_AS A sparsely populated column that is always sanitized. A useful way of
approaching this column is to null all values and substitute in a random percentage.
Usually needs Row-Internal synchronization with the SEX column so that M records
Anonymous Data for Test Oracle HR Schemas A Net 2000 Ltd. White Paper
Copyright Net 2000 Ltd. 2005 http://www.Net2000Ltd.com
get male first names and F records get female first names. May require TableInternal synchronization with the other rows with the same PERSON_ID depending on
the previous last name of married (or divorced) females. A useful way of approaching
this column is to null all values and substitute in a random percentage. May require
Table-Internal synchronization with the other rows with the same PERSON_ID.
NATIONAL_IDENTIFIER Usually a governmental ID data item which must be
rendered anonymous (for example: SSN in the USA and NI number in the UK). This
column will require Table-Internal synchronization with the other rows. Masking
these types of ID usually involves coping with the Intelligent Key problem discussed
previously.
EMAIL_ADDRESS Needs to be cleansed in some manner. Probably should be
updated with something random that looks like an email address.
DATE_OF_BIRTH Often overlooked, but readily known and is a unique identifier in
many cases. Probably should be masked, but take care not to put in values which are
too old or too young. It is unlikely there are many 5 year olds on most HR systems
and values which are out of range may well introduce validity issues on the front end
screens.
START_DATE A possible candidate. Make sure that this date is not sanitized to be
unreasonable given the DATE_OF_BIRTH. It makes the internal HR validity checks
unhappy if people are employed before they are born.
TITLE May or may not be required to be masked see the discussion in the
Decisions section above. Masked TITLE columns may have to synchronized with the
SEX field as appropriate and will require Table-Internal synchronization based on
distinct PERSON_IDs.
FULL_NAME A denormalized field requiring Row-Internal synchronization which
contains the formatted contents of the LAST_NAME, FIRST_NAME, KNOWN_AS and
TITLE fields. This data item must be built after the masking operations on the
dependent fields are complete see the previous discussion of the Sequential
Operations issue.
Typically the first line is given a realistic looking random street address and the
remainder set to null unless there is a requirement for multi-line street addresses.
TOWN_OR_CITY It is much more useful to the end users if this column can be set
from a list of town or city names rather than just random text.
REGION_[1,2,3] These are usually information such as a state or county
designator. As with the street address group of columns, the first value is given a
realistic looking state name and the remainder set to null.
COUNTRY It is debatable whether this column needs to be masked and is probably a
decision that can be taken at implementation time. If this column is masked, be aware
that the value used is required by the front end screens to be present in a pre-approved
list. Usually this field is updated to a common value to eliminate all occurrences of
the Isolated Case Phenomena.
POSTAL_CODE This field needs to be masked and is usually an Intelligent Key. This
means any replacement data must satisfy the validity checks or the front end screens
will work improperly.
TELEPHONE_NUMBER_[1,2,3] These are the telephone numbers of the employee
and as such are highly sensitive. Typically the first line is given a realistic looking
phone number and the rest are given null values.
Summary
Given the legal and organizational operating environment of today, many test and
development HR databases will require some form of sanitization in order to render
the informational content anonymous.
There are a variety of techniques available, and an even larger number of issues of
which to be aware. Some of the most critical issues are the Row-Internal, TableInternal and Table-Table Data Synchronization requirements.
The demands of the Oracle HR schema require a sophisticated approach to the
problem. In this paper the PER_ALL_PEOPLE_F table is the focus for the majority of
the really complex masking requirements. A number of decisions for the
PER_ALL_PEOPLE_F table were discussed and the outcome of these decisions has a
major effect on the type of sanitization performed.
Some of the other important tables in the Oracle HR schema (from a data sanitization
point of view) were listed and a discussion of how masking operations might be
performed on selected columns from these tables was undertaken.__