Big Data Privacy
Big Data Privacy
Each of the data sets, EI, QI, and SD, are matrices with m rows and
i, j, and k columns, respectively.
We need to keep an eye on the index j (representing QI), which
plays a major role in keeping the data confidential.
Protecting Sensitive Data
Apart from assuring their customers’ privacy, organizations also
have to comply with various regulations in that region/country, as
mentioned earlier.
Most countries have strong privacy laws to protect citizens’
personal data.
Organizations that fail to protect the privacy of their customers or
do not comply with the regulations face stiff financial penalties,
loss of reputation, loss of customers, and legal issues.
This is the primary reason organizations pay so much attention to
data privacy
data protection techniques, such as cryptography and
anonymization, are used prior to sharing data.
Privacy and Anonymity
Anonymization is a process of logically separating the identifying
information (PII) from sensitive data.
Referring to Table 1.3, the anonymization approach ensures that EI
and QI are logically separated from SD.
As a result, an adversary will not be able to easily identify the
record owner from his sensitive data.
privacy and anonymity are flip sides of the same coin
Privacy and Anonymity
Privacy and Anonymity
There is a subtle difference between privacy and anonymity.
The word privacy is also used in a generic way to mean anonymity,
and there are specific use cases for both of them.
Table 1.4 illustrates an anonymized table where PII is protected
and sensitive data are left in their original form.
Sensitive data should be in original form so that the data can be
used to mine useful knowledge.
Anonymization is a two-step process: data masking and de-
identification.
Privacy and Anonymity
• Data masking is a technique applied to systematically substitute,
suppress, or scramble data that call out an individual, such as
names, IDs, account numbers, SSNs, etc.
• Masking techniques are simple techniques that perturb original
data.
De-identification is applied on QI fields. QI fields such as date of
birth, gender, and zip code have the capacity to uniquely identify
individuals.
Combine that with SD, such as income, and a Warren Buffet or Bill
Gates is easily identified in the data set.
By de-identifying, the values of QI are modified carefully so that
the relationship is till maintained by identities cannot be inferred.
Privacy and Anonymity
• the original data set is D which is anonymized, resulting in data set