Unit 5-DBP
Unit 5-DBP
DATABASE SECURITY
Database Security Issues – Discretionary Access Control Based on Granting and Revoking Privileges – Mandatory Access Control and
Role-Based Access Control for Multilevel Security –SQL Injection – Statistical Database Security – Flow Control – Encryption and Public
Key Infrastructures – Preserving Data Privacy – Challenges to Maintaining Database Security – Database Survivability – Oracle Label-
Based Security.
There are numerous incidents where hackers have targeted companies dealing with personal customer details. Equifax,
Facebook, Yahoo, Apple, Gmail, Slack, and eBay data breaches were in the news in the past few years, just to name a few.
Such rampant activities raised the need for cybersecurity software and web app testing which aims to protect the data that
people share with online businesses. If these measures are applied, the hackers will be denied all access to the records and
documents available on the online databases. Also, complying with GDPR will help a lot on the way to strengthening user data
protection.
Here’s a list of top 10 vulnerabilities that are commonly found in the database-driven systems and our tips for how to eliminate
them.
Irregularities in Databases
It is inconsistencies that lead to vulnerabilities. Test website security and assure data protection on the regular basis. In case
any discrepancies are found, they have to be fixed ASAP. Your developers should be aware of any threat that might affect the
database. Though this is not an easy work but through proper tracking, the information can be kept secret.
In spite of being aware of the need for security testing, numerous businesses still fail to implement it. Fatal mistakes usually
appear during the development stages but also during the app integration or while patching and updating the database.
Cybercriminals take advantage of these failures to make a profit and, as a result, your business is under risk of being busted.
3. Revoking of Privileges
In some cases it is desirable to grant a privilege to a user temporarily. For example, the owner of a relation may want to grant
the SELECT privilege to a user for a specific task and then revoke that privilege once the task is completed. Hence, a
mechanism for revoking privileges is needed. In SQL a REVOKE command is included for the purpose of canceling
privileges.
The CREATETAB (create table) privilege gives account A1 the capability to create new database tables (base relations) and
is hence an account privilege. This privilege was part of earlier versions of SQL but is now left to each individual system
imple-mentation to define.
In SQL2 the same effect can be accomplished by having the DBA issue a CREATE SCHEMA command, as follows:
User account A1 can now create tables under the schema called EXAMPLE. To con-tinue our example, suppose
that A1 creates the two base relations EMPLOYEE and DEPARTMENT shown in Figure 24.1; A1 is then the owner of these
two relations and hence has all the relation privileges on each of them.
Next, suppose that account A1 wants to grant to account A2 the privilege to insert and delete tuples in both of these relations.
However, A1 does not want A2 to be able to propagate these privileges to additional accounts. A1 can issue the following
com-mand:
Notice that the owner account A1 of a relation automatically has the GRANT OPTION, allowing it to grant privileges on the
relation to other accounts. However, account A2 cannot grant INSERT and DELETE privileges on
the EMPLOYEE and DEPARTMENT tables because A2 was not given the GRANT OPTION in the preceding command.
Next, suppose that A1 wants to allow account A3 to retrieve information from either of the two tables and also to be able to
propagate the SELECT privilege to other accounts. A1 can issue the following command:
The clause WITH GRANT OPTION means that A3 can now propagate the privilege to other accounts by using GRANT. For
example, A3 can grant the SELECT privilege on the EMPLOYEE relation to A4 by issuing the following command:
GRANT SELECT ON EMPLOYEE TO A4;
Notice that A4 cannot propagate the SELECT privilege to other accounts because the GRANT OPTION was not given to A4.
Now suppose that A1 decides to revoke the SELECT privilege on the EMPLOYEE relation from A3; A1 then can issue this
command:
REVOKE SELECT ON EMPLOYEE FROM A3;
The DBMS must now revoke the SELECT privilege on EMPLOYEE from A3, and it must also automatically
revoke the SELECT privilege on EMPLOYEE from A4. This is because A3 granted that privilege to A4, but A3 does not have
the privilege any more.
Next, suppose that A1 wants to give back to A3 a limited capability to SELECT from the EMPLOYEE relation and wants to
allow A3 to be able to propagate the privilege. The limitation is to retrieve only the Name, Bdate, and Address attributes and
only for the tuples with Dno = 5. A1 then can create the following view:
CREATE VIEW A3EMPLOYEE AS
SELECT Name, Bdate, Address
FROM EMPLOYEE
WHERE Dno = 5;
After the view is created, A1 can grant SELECT on the view A3EMPLOYEE to A3 as follows:
GRANT SELECT ON A3EMPLOYEE TO A3 WITH GRANT OPTION;
Finally, suppose that A1 wants to allow A4 to update only the Salary attribute of EMPLOYEE; A1 can then issue the
following command:
GRANT UPDATE ON EMPLOYEE (Salary) TO A4;
The UPDATE and INSERT privileges can specify particular attributes that may be updated or inserted in a relation. Other
privileges (SELECT, DELETE) are not attrib-ute specific, because this specificity can easily be controlled by creating the
appro-priate views that include only the desired attributes and granting the corresponding privileges on the views. However,
because updating views is not always possible (see Chapter 5), the UPDATE and INSERT privileges are given the option to
specify the particular attributes of a base relation that may be updated.
The discretionary access control technique of granting and revoking privileges on relations has traditionally been the main
security mechanism for relational database systems. This is an all-or-nothing method: A user either has or does not have a
certain privilege. In many applications, an additional security policy is needed that classifies data and users based on security
classes. This approach, known as mandatory access control (MAC), would typically be combined with the discretionary
access control mechanisms . It is important to note that most commercial DBMSs currently provide mechanisms only for
discretionary access control. However, the need for multilevel security exists in government, military, and intelligence
applications, as well as in many industrial and corporate applications. Some DBMS vendors—for example, Oracle—have
released special versions of their RDBMSs that incorporate mandatory access control for government use.
Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the highest level
and U the lowest. Other more complex security classification schemes exist, in which the security classes are organized in a
lattice. For simplicity, we will use the system with four security classification levels, where TS ≥ S ≥ C ≥ U, to illustrate our
discussion. The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies
each subject (user, account, program) and object (relation, tuple, column, view, operation) into one of the security
classifications TS, S, C, or U. We will refer to the clearance (classification) of a subject S as class(S) and to
the classification of an object O as class(O). Two restrictions are enforced on data access based on the subject/object
classifications:
1. A subject S is not allowed read access to an object O unless class(S) ≥ class(O). This is known as the simple security
property.
2. A subject S is not allowed to write an object O unless class(S) ≤ class(O). This is known as the star property (or *-
property).
The first restriction is intuitive and enforces the obvious rule that no subject can read an object whose security classification is
higher than the subject’s security clearance. The second restriction is less intuitive. It prohibits a subject from writing an object
at a lower security classification than the subject’s security clearance. Violation of this rule would allow information to flow
from higher to lower classifications, which violates a basic tenet of multilevel security. For example, a user (subject) with TS
clearance may make a copy of an object with classification TS and then write it back as a new object with classification U,
thus making it visible throughout the system.
To incorporate multilevel security notions into the relational database model, it is common to consider attribute values and
tuples as data objects. Hence, each attribute A is associated with a classification attribute C in the schema, and each attribute
value in a tuple is associated with a corresponding security classification. In addition, in some models, a tuple
classification attribute TC is added to the relation attributes to provide a classification for each tuple as a whole. The model
we describe here is known as the multilevel model, because it allows classifications at multiple security levels. A multilevel
relation schema R with n attributes would be represented as:
R(A1, C1, A2, C2, ..., An, Cn, TC)
where each Ci represents the classification attribute associated with attribute Ai.
The value of the tuple classification attribute TC in each tuple t—which is the highest of all attribute classification values
within t—provides a general classification for the tuple itself. Each attribute classification Ci provides a finer security
classification for each attribute value within the tuple. The value of TC in each tuple t is the highest of all attribute classification
values Ci within t.
The apparent key of a multilevel relation is the set of attributes that would have formed the primary key in a regular (single-
level) relation. A multilevel relation will appear to contain different data to subjects (users) with different clearance levels. In
some cases, it is possible to store a single tuple in the relation at a higher classification level and produce the corresponding
tuples at a lower-level classification through a process known as filtering. In other cases, it is necessary to store two or more
tuples at different classification levels with the same value for the apparent key.
This leads to the concept of polyinstantiation, where several tuples can have the same apparent key value but have different
attribute values for users at different clearance levels.
We illustrate these concepts with the simple example of a multilevel relation shown in Figure 24.2(a), where we display the
classification attribute values next to each attribute’s value. Assume that the Name attribute is the apparent key, and consider
the query SELECT * FROM EMPLOYEE. A user with security clearance S would see the same relation shown in Figure
24.2(a), since all tuple classifications are less than or equal to S. However, a user with security clearance C would not be
allowed to see the values for Salary of ‘Brown’ and Job_performance of ‘Smith’, since they have higher classification. The
tuples would be filtered to appear as shown in Figure 24.2(b), with Salary and Job_performance appearing as null. For a user
with security clearance U, the filtering allows only the Name attribute of ‘Smith’ to appear, with all the other
attributes appearing as null (Figure 24.2(c)). Thus, filtering introduces null values for attribute values whose security
classification is higher than the user’s security clearance.
In general, the entity integrity rule for multilevel relations states that all attributes that are members of the apparent key must
not be null and must have the same security classification within each individual tuple. Additionally, all other attribute values
in the tuple must have a security classification greater than or equal to that of the apparent key. This constraint ensures that a
user can see the key if the user is permitted to see any part of the tuple. Other integrity rules, called null
integrity and interinstance integrity, informally ensure that if a tuple value at some security level can be filtered (derived)
from a higher-classified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation.
To illustrate polyinstantiation further, suppose that a user with security clearance C tries to update the value
of Job_performance of ‘Smith’ in Figure 24.2 to ‘Excellent’; this corresponds to the following SQL update being submitted
by that user:
UPDATE EMPLOYEE
SQL Injection
SQL Injection
SQL injection is a code injection technique that might destroy your database.
SQL injection is the placement of malicious code in SQL statements, via web page input.
Look at the following example which creates a SELECT statement by adding a variable (txtUserId) to a select string. The variable is
fetched from user input (getRequestString):
Example
txtUserId=getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;
The rest of this chapter describes the potential dangers of using user input in SQL statements.
SQL Injection Based on 1=1 is Always True
Look at the example above again. The original purpose of the code was to create an SQL statement to select a user, with a given
user id.
If there is nothing to prevent a user from entering "wrong" input, the user can enter some "smart" input like this:
105 OR 1=1
UserId:
The SQL above is valid and will return ALL rows from the "Users" table, since OR 1=1 is always TRUE.
Does the example above look dangerous? What if the "Users" table contains names and passwords?
SELECT UserId, Name, Password FROM Users WHERE UserId = 105 or 1=1;
A hacker might get access to all the user names and passwords in a database, by simply inserting 105 OR 1=1 into the input field.
Username:
John Doe
Password:
myPass
Example
uName=getRequestString("username");
uPass=getRequestString("userpassword");
sql = 'SELECT * FROM Users WHERE Name ="' + uName + '" AND Pass ="' + uPass + '"'
Result
SELECT * FROM Users WHERE Name ="John Doe" AND Pass ="myPass"
A hacker might get access to user names and passwords in a database by simply inserting " OR ""=" into the user name or password
text box:
UserName:
Password:
The code at the server will create a valid SQL statement like this:
Result
SELECT * FROM Users WHERE Name ="" or ""="" AND Pass ="" or ""=""
The SQL above is valid and will return all rows from the "Users" table, since OR ""="" is always TRUE.
A batch of SQL statements is a group of two or more SQL statements, separated by semicolons.
The SQL statement below will return all rows from the "Users" table, then delete the "Suppliers" table.
Example
SELECT * FROM Users; DROP TABLE Suppliers
Example
txtUserId = getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;
105; DROP TA
User id:
Result
SELECT * FROM Users WHERE UserId = 105; DROP TABLE Suppliers;
SQL parameters are values that are added to an SQL query at execution time, in a controlled manner.
Another Example
txtNam=getRequestString("CustomerName");
txtAdd=getRequestString("Address");
txtCit=getRequestString("City");
txtSQL="INSERT INTO Customers(CustomerName,Address,City)Values(@0,@1,@2)";
db.Execute(txtSQL,txtNam,txtAdd,txtCit
Examples
The following examples shows how to build parameterized queries in some common web languages.
txtUserId=getRequestString("UserId");
sql="SELECT*FROM Customers WHERE CustomerId=@0";
command=new SqlCommand(sql);
command.Parameters.AddWithValue("@0",txtUserId);
command.ExecuteReader();
Flow Control:
Measures of Control
The measures of control can be broadly divided into the following categories −
Access Control − Access control includes security mechanisms in a database management system to protect against
unauthorized access. A user can gain access to the database after clearing the login process through only valid user
accounts. Each user account is password protected.
Flow Control − Distributed systems encompass a lot of data flow from one site to another and also within a site. Flow
control prevents data from being transferred in such a way that it can be accessed by unauthorized agents. A flow
policy lists out the channels through which information can flow. It also defines security classes for data as well as
transactions.
Data Encryption − Data encryption refers to coding data when sensitive data is to be communicated over public
channels. Even if an unauthorized agent gains access of the data, he cannot understand it since it is in an
incomprehensible format.
Key Management
It goes without saying that the security of any cryptosystem depends upon how securely its keys are managed. Without secure
procedures for the handling of cryptographic keys, the benefits of the use of strong cryptographic schemes are potentially
lost.
It is observed that cryptographic schemes are rarely compromised through weaknesses in their design. However, they are
often compromised through poor key management.
There are some important aspects of key management which are as follows −
Cryptographic keys are nothing but special pieces of data. Key management refers to the secure administration of
cryptographic keys.
Key management deals with entire key lifecycle as depicted in the following illustration −
There are two specific requirements of key management for public key cryptography.
o Secrecy of private keys. Throughout the key lifecycle, secret keys must remain secret from all parties except
those who are owner and are authorized to use them.
o Assurance of public keys. In public key cryptography, the public keys are in open domain and seen as public
pieces of data. By default there are no assurances of whether a public key is correct, with whom it can be
associated, or what it can be used for. Thus key management of public keys needs to focus much more
explicitly on assurance of purpose of public keys.
The most crucial requirement of ‘assurance of public key’ can be achieved through the public-key infrastructure (PKI), a key
management systems for supporting public-key cryptography.
Digital Certificate
For analogy, a certificate can be considered as the ID card issued to the person. People use ID cards such as a driver's license,
passport to prove their identity. A digital certificate does the same basic thing in the electronic world, but with one difference.
Digital Certificates are not only issued to people but they can be issued to computers, software packages or anything else that
need to prove the identity in the electronic world.
Digital certificates are based on the ITU(International Telecommunication Union) standard X.509 which defines a
standard certificate format for public key certificates and certification validation. Hence digital certificates are
sometimes also referred to as X.509 certificates.
Public key pertaining to the user client is stored in digital certificates by The Certification Authority (CA) along with
other relevant information such as client information, expiration date, usage, issuer etc.
CA digitally signs this entire information and includes digital signature in the certificate.
Anyone who needs the assurance about the public key and associated information of client, he carries out the signature
validation process using CA’s public key. Successful validation assures that the public key given in the certificate
belongs to the person whose details are given in the certificate.
The process of obtaining Digital Certificate by a person/entity is depicted in the following illustration.
As shown in the illustration, the CA accepts the application from a client to certify his public key. The CA, after duly verifying
identity of client, issues a digital certificate to that client.
Key Functions of CA
The key functions of a CA are as follows −
Generating key pairs − The CA may generate a key pair independently or jointly with the client.
Issuing digital certificates − The CA could be thought of as the PKI equivalent of a passport agency − the CA issues
a certificate after client provides the credentials to confirm his identity. The CA then signs the certificate to prevent
modification of the details contained in the certificate.
Publishing Certificates − The CA need to publish certificates so that users can find them. There are two ways of
achieving this. One is to publish certificates in the equivalent of an electronic telephone directory. The other is to send
your certificate out to those people you think might need it by one means or another.
Verifying Certificates − The CA makes its public key available in environment to assist verification of his signature
on clients’ digital certificate.
Revocation of Certificates − At times, CA revokes the certificate issued due to some reason such as compromise of
private key by user or loss of trust in the client. After revocation, CA maintains the list of all revoked certificate that
is available to the environment.
Classes of Certificates
There are four typical classes of certificate −
Class 1 − These certificates can be easily acquired by supplying an email address.
Class 2 − These certificates require additional personal information to be supplied.
Class 3 − These certificates can only be purchased after checks have been made about the requestor’s identity.
Class 4 − They may be used by governments and financial organizations needing very high levels of trust.
Hierarchy of CA
With vast networks and requirements of global communications, it is practically not feasible to have only one trusted CA
from whom all users obtain their certificates. Secondly, availability of only one CA may lead to difficulties if CA is
compromised.
In such case, the hierarchical certification model is of interest since it allows public key certificates to be used in environments
where two communicating parties do not have trust relationships with the same CA.
The root CA is at the top of the CA hierarchy and the root CA's certificate is a self-signed certificate.
The CAs, which are directly subordinate to the root CA (For example, CA1 and CA2) have CA certificates that are
signed by the root CA.
The CAs under the subordinate CAs in the hierarchy (For example, CA5 and CA6) have their CA certificates signed
by the higher-level subordinate CAs.
Certificate authority (CA) hierarchies are reflected in certificate chains. A certificate chain traces a path of certificates from
a branch in the hierarchy to the root of the hierarchy.
The following illustration shows a CA hierarchy with a certificate chain leading from an entity certificate through two
subordinate CA certificates (CA6 and CA3) to the CA certificate for the root CA.
Verifying a certificate chain is the process of ensuring that a specific certificate chain is valid, correctly signed, and
trustworthy. The following procedure verifies a certificate chain, beginning with the certificate that is presented for
authentication −
A client whose authenticity is being verified supplies his certificate, generally along with the chain of certificates up
to Root CA.
Verifier takes the certificate and validates by using public key of issuer. The issuer’s public key is found in the issuer’s
certificate which is in the chain next to client’s certificate.
Now if the higher CA who has signed the issuer’s certificate, is trusted by the verifier, verification is successful and
stops here.
Else, the issuer's certificate is verified in a similar manner as done for client in above steps. This process continues till
either trusted CA is found in between or else it continues till Root CA.
Preserving Data Privacy
Abstract
Incredible amounts of data is being generated by various organizations like hospitals, banks, e-commerce, retail and supply
chain, etc. by virtue of digital technology. Not only humans but machines also contribute to data in the form of closed circuit
television streaming, web site logs, etc. Tons of data is generated every minute by social media and smart phones. The
voluminous data generated from the various sources can be processed and analyzed to support decision making. However data
analytics is prone to privacy violations. One of the applications of data analytics is recommendation systems which is widely
used by ecommerce sites like Amazon, Flip kart for suggesting products to customers based on their buying habits leading to
inference attacks. Although data analytics is useful in decision making, it will lead to serious privacy concerns. Hence privacy
preserving data analytics became very important. This paper examines various privacy threats, privacy preservation techniques
and models with their limitations, also proposes a data lake based modernistic privacy preservation technique to handle privacy
preservation in unstructured data.
Introduction
There is an exponential growth in volume and variety of data as due to diverse applications of computers in all domain areas.
The growth has been achieved due to affordable availability of computer technology, storage, and network connectivity. The
large scale data, which also include person specific private and sensitive data like gender, zip code, disease, caste, shopping
cart, religion etc. is being stored in public domain. The data holder can release this data to a third party data analyst to gain
deeper insights and identify hidden patterns which are useful in making important decisions that may help in improving
businesses, provide value added services to customers , prediction, forecasting and recommendation . One of the prominent
applications of data analytics is recommendation systems which is widely used by ecommerce sites like Amazon, Flip kart for
suggesting products to customers based on their buying habits. Face book does suggest friends, places to visit and even movie
recommendation based on our interest. However releasing user activity data may lead inference attacks like identifying gender
based on user activity . We have studied a number of privacy preserving techniques which are being employed to protect
against privacy threats. Each of these techniques has their own merits and demerits. This paper explores the merits and demerits
of each of these techniques and also describes the research challenges in the area of privacy preservation. Always there exists
a trade off between data utility and privacy. This paper also proposes a data lake based modernistic privacy preservation
technique to handle privacy preservation in unstructured data with maximum data utility.
Privacy threats in data analytics
Privacy is the ability of an individual to determine what data can be shared, and employ access control. If the data is in public
domain then it is a threat to individual privacy as the data is held by data holder. Data holder can be social networking
application, websites, mobile apps, ecommerce site, banks, hospitals etc. It is the responsibility of the data holder to ensure
privacy of the users data. Apart from the data held in public domain, knowing or unknowingly users themself contribute to
data leakage. For example most of the mobile apps, seek access to our contacts, files, camera etc. and without reading the
privacy statement we agree for all terms and conditions, there by contributing to data leakage.
Hence there is a need to educate the smart phone users regarding privacy and privacy threats. Some of the key privacy threats
include (1) Surveillance; (2) Disclosure; (3) Discrimination; (4) Personal embracement and abuse.
Surveillance
Many organizations including retail, e-commerce, etc. study their customers buying habits and try to come up with various
offers and value added services . Based on the opinion data and sentiment analysis, social media sites does provide
recommendations of the new friends, places to visit, people to follow etc. This is possible only when they continuously monitor
their customer’s transactions. This is a serious privacy threat as no individual accepts surveillance.
Disclosure
Consider a hospital holding patient’s data which include (Zip, gender, age, disease) . The data holder has released data to a
third party for analysis by anonymizing sensitive person specific data so that the person cannot be identified. The third party
data analyst can map this information with the freely available external data sources like census data and can identify person
suffering with some disorder. This is how private information of a person can be disclosed which is considered to be a serious
privacy breach.
Discrimination
Discrimination is the bias or inequality which can happen when some private information of a person is disclosed. For instance,
statistical analysis of electoral results proved that people of one community were completely against the party, which formed
the government. Now the government can neglect that community or can have bias over them.
Data analytics activity will affect data Privacy. Many countries are enforcing Privacy preservation laws. Lack of awareness is
also a major reason for privacy attacks. For example many smart phones users are not aware of the information that is stolen
from their phones by many apps. Previous research shows only 17% of smart phone users are aware of privacy threats .
Privacy preservation methods
Many Privacy preserving techniques were developed, but most of them are based on anonymization of data. The list of privacy
preservation techniques is given below.
K anonymity
L diversity
T closeness
Randomization
Data distribution
Cryptographic techniques
Multidimensional Sensitivity Based Anonymization (MDSBA).
K anonymity
Anonymization is the process of modifying data before it is given for data analytics , so that de identification is not possible
and will lead to K indistinguishable records if an attempt is made to de identify by mapping the anonymized data with external
data sources. K anonymity is prone to two attacks namely homogeneity attack and back ground knowledge attack. Some of
the algorithms applied include, Incognito , Mondrian to ensure Anonymization. K anonymity is applied on the patient data
shown in Table 1. The table shows data before anonymization.
6 57906 47 Cancer
8 57673 36 Cancer
9 57607 32 Cancer
K anonymity algorithm is applied with k value as 3 to ensure 3 indistinguishable records when an attempt is made to identify
a particular person’s data. K anonymity is applied on the two attributes viz. Zip and age shown in Table 1. The result of
applying anonymization on Zip and age attributes is shown in Table 2.
8 576** 3* Cancer
9 576** 3* Cancer
The above technique has used generalization to achieve Anonymization. Suppose if we know that John is 27 year old and lives
in 57677 zip codes then we can conclude John to have Cardiac problem even after anonymization as shown in Table 2. This is
called Homogeneity attack. For example if John is 36 year old and it is known that John does not have cancer, then definitely
John must have Cardiac problem. This is called as background knowledge attack. Achieving K anonymity can be done either
by using generalization or suppression. K anonymity can optimized if the minimal generalization can be done without huge
data loss . Identity disclosure is the major privacy threat which cannot be guaranteed by K anonymity . Personalized privacy
is the most important aspect of individual privacy .
L diversity
To address homogeneity attack, another technique called L diversity has been proposed. As per L diversity there must be L
well represented values for the sensitive attribute (disease) in each equivalence class.
Implementing L diversity is not possible every time because of the variety of data. L diversity is also prone to skewness attack.
When overall distribution of data is skewed into few equivalence classes attribute disclosure cannot be ensured. For example
if the entire records are distributed into only three equivalence classes then semantic closeness of these values may lead to
attribute disclosure. Also L diversity may lead to similarity attack. From Table 3 it can be noticed that if we know that John is
27 year old and lives in 57677 zip, then definitely John is under low income group because salaries of all three persons in
576** zip is low compare to others in the table. This is called as similarity attack.
T closeness
Another improvement to L diversity is T closeness measure where an equivalence class is considered to have ‘T
closeness’ if the distance between the distributions of sensitive attribute in the class is no more than a threshold
and all equivalence classes have T closeness . T closeness can be calculated on every attribute with respect to
sensitive attribute.
From Table 4 it can be observed that if we know John is 27 year old, still it will be difficult to estimate whether
John has Cardiac problem or not and he is under low income group or not. T closeness may ensure attribute
disclosure but implementing T closeness may not give proper distribution of data every time.
Randomization technique
Randomization is the process of adding noise to the data which is generally done by probability distribution .
Randomization is applied in surveys, sentiment analysis etc. Randomization does not need knowledge of other
records in the data. It can be applied during data collection and pre processing time. There is no anonymization
overhead in randomization. However, applying randomization on large datasets is not possible because of time
complexity and data utility which has been proved in our experiment described below.
We have loaded 10k records from an employee database into Hadoop Distributed File System and processed
them by executing a Map Reduce Job. We have experimented to classify the employees based on their salary
and age groups. In order apply randomization we added noise in the form of 5k records which are randomly
added to make a database of 15k records and following observations were made after running Map Reduce job.
More number of Mappers and Reducers were used as data volume increased.
Results before and after randomization were significantly different.
Some of the records which are outliers remain unaffected with randomization and are vulnerable to
adversary attack.
Privacy preservation at the cost of data utility is not appreciated and hence randomization may not be
suitable for privacy preservation especially attribute disclosure.
Data distribution technique
In this technique, the data is distributed across many sites. Distribution of the data can be done in two ways:
Horizontal distribution When data is distributed across many sites with same attributes then the distribution is
said to be horizontal distribution which is described in Fig. 1.
Distribution of sales data across different sites
Horizontal distribution of data can be applied only when some aggregate functions or operations are to be applied
on the data without actually sharing the data. For example, if a retail store wants to analyse their sales across
various branches, they may employ some analytics which does computations on aggregate data. However, as
part of data analysis the data holder may need to share the data with third party analyst which may lead to privacy
breach. Classification and Clustering algorithms can be applied on distributed data but it does not ensure privacy.
If the data is distributed across different sites which belong to different organizations, then results of aggregate
functions may help one party in detecting the data held with other parties. In such situations we expect all
participating sites to be honest with each other .
Vertical distribution of data When Person specific information is distributed across different sites under
custodian of different organizations, then the distribution is called vertical distribution as shown in Fig. 2. For
example, in crime investigations, the police officials would like to know details of a particular criminal which
include health, profession, financial, personal etc. All this information may not be available at one site. Such a
distribution is called vertical distribution where each site holds few set of attributes of a person. When some
analytics has to be done data has to be pooled in from all these sites and there is a vulnerability of privacy breach.
Cryptographic techniques
The data holder may encrypt the data before releasing the same for analytics. But encrypting large scale data using conventional
encryption techniques is highly difficult and must be applied only during data collection time. Differential privacy techniques
have already been applied where some aggregate computations on the data are done without actually sharing the inputs. For
example, if x and y are two data items then a function F(x, y) will be computed to gain some aggregate information from both
x and y without actually sharing x and y. This can be applied on when x and y are held with different parties as in the case of
vertical distribution. However, if the data is at single location under the custodian of a single organization, then differential
privacy cannot be employed. Another similar technique called secure multiparty computation has been used but proved to be
inadequate in privacy preservation. Data utility will be less if encryption is applied during data analytics. Thus encryption is
not only difficult to implement but it reduces the data utility .
Multidimensional Sensitivity Based Anonymization is an improved Anonymization technique such that it can be applied on
large data sets with reduced loss of information and predefined quasi identifiers. As part of this technique Apache MAP
REDUCE framework has been used to handle large data sets. In conventional Hadoop Distributed Files System, the data will
be divided into blocks of either 64 MB or 128 MB each and distributed across different nodes without considering the data
inside the blocks. As part of Multidimensional Sensitivity Based Anonymization technique the data is split into different bags
based on the probability distribution of the quasi identifiers by making use of filters in Apache Pig scripting language.
Multidimensional Sensitivity Based Anonymization makes use of bottom up generalization but on a set of attributes with
certain class values where class represents a sensitive attributes. Data distribution was made effectively when compared to
conventional method of blocks. Data Anonymization was done using four quasi identifiers using Apache Pig.
Since the data is vertically partitioned into different groups, it can protect from background knowledge attack if the bag contains
only few attributes. This method also makes it difficult to map the data with external sources to disclose any person specific
information.
In this method, the implementation was done using Apache Pig. Apache Pig is a scripting language, hence development effort
is less. However, code efficiency of Apache Pig is relatively less when compared to Map Reduce job because ultimately every
Apache Pig script has to be converted into a Map Reduce job. Multidimensional Sensitivity Based Anonymization is more
appropriate for large scale data but only when the data is at rest. Multidimensional Sensitivity Based Anonymization cannot
be applied for streaming data.
Analysis
Various privacy preservation techniques have been studied with respect to features including, type of data, data utility, attribute
preservation and complexity. The comparison of various privacy preservation techniques is shown in Table 5.
As part of systematic literature review, it has been observed that all existing mechanisms of privacy preservation are with
respect to structured data. More than 80% of data being generated today is unstructured . As such, there is a need to address
following challenges.
1. i.Develop concrete solution to protect privacy in both structured and unstructured data.
2. ii.Scalable and robust techniques to be developed to handle large scale heterogeneous data sets.
3. iii.Data should be allowed to stay in its native form without need for transformation and data analytics can be carried
out while ensuring privacy preservation.
4. iv.New techniques apart from Anonymization must be developed to ensure protection against key privacy threats
which include identity disclosure, discrimination, surveillance etc.
5. v.Maximizing data utility while ensuring data privacy.
Conclusion
No concrete solution for unstructured data has been developed yet. Conventional data mining algorithms can be applied for
classification and clustering problems but cannot be used in privacy preservation especially when dealing with person specific
information. Machine learning and soft computing techniques can be used to develop new and more appropriate solution to
privacy problems which include identity disclosure that can lead to personal embarrassment and abuse.
There is a strong need for law enforcement by governments of all countries to ensure individual privacy. European Union is
making an attempt to enforce privacy preservation law. Apart from technological solutions, there is a strong need to create
awareness among the people regarding privacy hazards to safeguard themselves form privacy breaches. One of the serious
privacy threats is smart phone. Lot of personal information in the form of contacts, messages, chats and files are being accessed
by many apps running in our smart phone without our knowledge. Most of the time people do not even read the privacy
statement before installing any app. Hence there is a strong need to educate people on the various vulnerabilities which can
contribute to leakage of private information.
We propose a novel privacy preservation model based on Data Lake concept to hold variety of data from diverse sources. Data
lake is a repository to hold data from diverse sources in their raw format . Data ingestion from variety of sources can be done
using Apache Flume and an intelligent algorithm based on machine learning can be applied to identify sensitive attributes
dynamically . The algorithm will be trained with existing data sets with known sensitive attributes and rigorous training of the
model will help in predicting the sensitive attributes in a given data set . Accuracy of the model can be improved by adding
more layers of training leading to deep learning techniques . Advanced computing techniques like Apache Spark can be used
in implementing privacy preserving algorithms which is a distributed massive parallel computing with in memory processing
to ensure very fast processing. The proposed model is shown in Fig. 3.
Fig. 3
In Data lake the data can remain in its native form which is either structured or unstructured. When data has to be processed,
it can be transformed into HIVE tables. A Hadoop map reduce job using machine learning can be executed on the data to
classify the sensitive attributes . The data can be vertically distributed to separate the sensitive attributes from rest of the data
and apply tokenization to map the vertically distributed data. The data without any sensitive attributes can be published for
data analytics.
Abbreviations
CCTV:
closed circuit television
MDSBA:
The need for more sophisticated controls on access to sensitive data is becoming increasingly important as
organizations address emerging security requirements around data consolidation, privacy and compliance.
Maintaining separate databases for highly sensitive customer data is costly and creates unnecessary
administrative overhead. However, consolidating databases sometimes means combining sensitive financial,
HR, medical or project data from multiple locations into a single database for reduced costs, easier
management and better scalability. Oracle Label Security provides the ability to tag data with a data label
or a data classification to allow the database to inherently know what data a user or role is authorizedfor
and allows data from different sources to be combined in the same table as a larger data set without
compromising security.
Access to sensitive data is controlled by comparing the data label with therequesting user’s label or access
clearance. A user label or access clearance can be thought of as an extension to standard database privileges
and roles. Oracle Label Security is centrally enforced within thedatabase, below the application layer,
providing strong security and eliminating the need for complicated application views.
Financial companies with customers that span multiple countries with strong
government privacycontrols
Complying with U.S. State Department’s International Traffic in Arms (ITAR) regulations
Restrict data processing, tracking consent and handling right to erasure requests under EU GDPR
The User label consist of three components – a level, zero or more compartments
and zero or more groups. This label is assigned as part of the user authorization and
is not modifiable by the user.
Session labels also consists of the same three components and are different from the
user label based on the session that was established by the user. For example if the user
has a Top Secret level component, but the user logged in from a Secret workstation,
the session label level would be Secret.
Data security labels have the same three components as the User and Session
Levels indicate the sensitivity level of the data and the authorization for a user to access sensitive
data. The user (and session) level must be equal or greater than the data level to
access that record.
Data can have zero or more groups in the group component. The user/session label
needs to have at least one group that matches a data record’s group(s) to access the
data record. For example, if the data record had Boston, Chicago and New York for
groups, then the session label needs only Boston (or one of the other 2 groups) to
access the data.
Label Security policies are a combination of User labels, Data labels and protected objects.
A VPD policy can be written so that it only becomes active when a certain column
(the 'sensitive' column) is part of a SQL statement against a protected table. With
'column sensitivity' switch on, VPD either returns only those rows that include
information in the sensitive column the user is allowed to see, or it returns all rows,
with all cells in the sensitive column being empty, except those values the user is
allowed to see.
Trusted stored procedures are procedures that are either granted the OLS privilege
'FULL' or 'READ'. When a trusted stored program unit is carried out, the policy
privileges in force are a union of the invoking user's privileges and the program unit's
privileges.
Beginning with Oracle Database 11gR1, the functionality of Oracle Policy Manager
(and most other security related administration tools) has been made available in
Oracle Enterprise Manager Cloud Control, enabling administrators to create and
manage OLS policies in a modern, convenient and integrated environment.
Only apply sensitivity labels to those tables that really need protection. When
multiple tables are joined to retrieve sensitive, look for the driving table
Usually, there is only a small set of different data classification labels; if the table
is mostly used for READ operations, try building an Bitmap Index over the (hidden)
OLS column, and add this index to existing indexes in that table.
Review the Oracle Label Security whitepaper available in the product OTN
webpage as it contains a thorough discussion of performance considerations with
Oracle Label Security.
Yes. Oracle Label Security can provide user labels to be used as factors within
Oracle Database Vault and security labels can be assigned to Real Application
Security users. It also integrates with oracle advanced security data
redaction,enabling security clearances to be used in data redactionpolici