0% found this document useful (0 votes)
28 views

04 - Chapter 3 - Privacy

This document discusses information security and privacy concerns related to big data and data mining. It addresses the four key users in the data mining process: data providers, data collectors, data miners, and decision makers. For data providers, major concerns are controlling data sensitivity and privacy, while data collectors aim to collect useful data while preserving privacy. Techniques like limiting access, trading privacy for benefits, and providing false data can help address these issues. Privacy-preserving data mining also aims to safeguard sensitive information during data mining.

Uploaded by

Taif Alkaabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

04 - Chapter 3 - Privacy

This document discusses information security and privacy concerns related to big data and data mining. It addresses the four key users in the data mining process: data providers, data collectors, data miners, and decision makers. For data providers, major concerns are controlling data sensitivity and privacy, while data collectors aim to collect useful data while preserving privacy. Techniques like limiting access, trading privacy for benefits, and providing false data can help address these issues. Privacy-preserving data mining also aims to safeguard sensitive information during data mining.

Uploaded by

Taif Alkaabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

INFORMATION SECURITY IN BIG

DATA ---PRIVACY AND DATA MINING


CONTENT
1.Introduction
2.Data Provider
3.Data Collector
4.Data Miner
5.Decision Maker
6. Future Research Areas
7.Conclusion
8.References

2
INTRODUCTION
 Data mining is the process of discovering interesting patterns
and knowledge from large amounts of data
 Data mining has been successfully applied to many domains,
such as business intelligence, Web search, scientific discovery,
digital libraries, etc.
 The term ``data mining'' is often treated as a synonym for
another term ``knowledge discovery from data'' (KDD) which
highlights the goal of the mining process.

3
4
2. Big data and privacy concerns

11
https://www.youtube.com/watch?v=uaaC57tcci0

Social Dillema
 Communications technologies and Big Data analysis have
facilitated the intrusion of privacy by devising and strengthening
audio-visual surveillance and “dataveilance”
 Governments have used these technologies for continuous and
massive collection and collation of data from our private spaces.
Big Data phenomena are a constellation of data storage and
processing extensions to modern communications technologies
that have given rise to further, new modes of privacy intrusions
that were not anticipated when much more primitive
communications and eavesdropping technologies gave rise to
the existing privacy laws.

13
Big data defined

What exactly is big data?


The definition of big data is data that contains greater variety, arriving in
increasing volumes and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data
sources. These data sets are so voluminous that traditional data processing
software just can’t manage them. But these massive volumes of data can be used
to address business problems you wouldn’t have been able to tackle before.
(Oracle)
The three Vs of big data

Volume
The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This
can be data of unknown value, such as Twitter data feeds, clickstreams on a web page or a mobile app, or sensor-
enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of
petabytes.

Velocity
Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest velocity of data streams
directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near
real time and will require real-time evaluation and action.

Variety
Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a
relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and
semistructured data types, such as text, audio, and video, require additional preprocessing to derive meaning and
support metadata.

https://www.oracle.com/ae/big-data/what-is-big-data/
https://www.marketsandmarkets.com/
Big data can be used:
- to identify more general trends and correlations

- it can also be processed in order to directly affect

individuals.
It is not the volume, velocity, variety or veracity what
worries me, but the uses of the information.
- The uses of the data are not determined before collection.

17
Risks

Big Data may also pose significant risks for the protection of
personal data and the right to privacy:
a) the sheer scale of data collection, tracking and profiling;
b) the security of data;
c) the transparency, which implies sufficient information given to
individuals;
d) inaccuracy, discrimination, exclusion and economic
imbalance;
e) increased possibilities of government surveillance

18
THE CHALLENGE OF BIG DATA FOR DATA
PROTECTION

It is no exaggeration to say that we are nothing more than a


collection of data to most of the institutions—and many of the
people—with whom we deal.
Big data poses enormous challenges for data protection— both by
processors and regulators. It simultaneously changes the
context and raises the stakes for Data protection.

19
Big data also shows the importance of
harmonization, or even standardization, in data
protection standards. As personal data are universally
collected and shared across sectorial and national
boundaries, inconsistent data protection laws pose
increasing threats to individuals, institutions, and
society

20
Perhaps the greatest impact of big data is the pressure it
brings for new thoughtful, informed, multinational
debate about the key principles that should
undergird data protection. Most data protection
laws continue to rely on the 1980 OECD Guidelines

21
Data Mining and Society

How does data mining impact society? What steps can data mining take to preserve
the
privacy of individuals? Do we use data mining in our daily lives without even knowing
that we do? These questions raise the following issues:
Social impacts of data mining: With data mining penetrating our everyday lives, it is
important to study the impact of data mining on society.How can we use data mining
technology to benefit society? How can we guard against its misuse? The improper
disclosure or use of data and the potential violation of individual privacy and data
protection rights are areas of concern that need to be addressed.

Privacy-preserving data mining: Data mining will help scientific discovery, business
management, economy recovery, and security protection (e.g., the real-time discovery
of intruders and cyberattacks). However, it poses the risk of disclosing an
individual’s personal information. Studies on privacy-preserving data publishing and
data mining are ongoing. The philosophy is to observe data sensitivity and preserve
people’s privacy while performing successful data mining.
Invisible data mining: We cannot expect everyone in society to learn and master
data mining techniques. More and more systems should have data mining functions
built within so that people can perform data mining or use data mining results
simply by mouse clicking, without any knowledge of data mining algorithms.
Intelligent search engines and Internet-based stores perform such invisible data
mining by incorporating data mining into their components to improve their
functionality and performance. This is done often unbeknownst to the user. For
example, when purchasing items online, users may be unaware that the store is
likely collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.
 Individual's privacy may be violated due to the
unauthorized access to personal data.
 To deal with the privacy issues in data mining, a sub-field
of data mining, referred to as privacy preserving data
mining (PPDM) .
 The aim of PPDM is to safeguard sensitive information
from unsanctioned disclosure, and preserve the utility of
the data.

24
The 4 type of users in Data Mining process-
 Data Provider: the user who owns some data that are desired

by the data mining task.


 Data Collector: the user who collects data from data

providers and then publish the data to the data miner.


 Data Miner: the user who performs data mining tasks on the

data.
 Decision Maker: the user who makes decisions based on

the data mining results in order to achieve certain goals

25
26
DATA PROVIDER
 CONCERN
 The major concern of a data provider is whether he can control
the sensitivity of the data he provides to others.

 On one hand, the provider should be able to make his very


private data, inaccessible to the data collector.

 On the other hand, if the provider has to provide some data to


the data collector, he wants to hide his sensitive information as
much as possible and get enough compensations for the
possible loss in privacy.

27
APPROACHES TO PRIVACY PROTECTION

1. LIMIT THE ACCESS

 Security tools that are developed for internet


environment to protect data
 Anti-Tracking extensions. Popular anti-tracking extensions
include Disconnect , Do Not Track Me ,Ghostery etc
 Advertisement and script blockers -Example tools
include AdBlock Plus, NoScript, FlashBlock, etc.
 Encryption tools-MailCloack and TorChat

28
2.TRADE PRIVACY FOR BENEFIT

 The data provider maybe willing to hand over some of his


private data in exchange for certain benefit.

 Such as better services or monetary rewards. The data provider


needs to know how to negotiate with the data collector, so that
he will get enough compensation for any possible loss in privacy

29
3. PROVIDE FALSE DATA

 Using ``sockpuppets'' to hide one's true activities

 Using a fake identity to create phony information

 Using security tools to mask one's identity

30
DATA COLLECTOR
 CONCERN

 The major concern of data collector is to guarantee


that the modified data contain no sensitive
information but still preserve high utility.

31
APPROACHES

 1. BASICS OF privacy preserving data publication


PPDP
 PPDP mainly studies anonymization approaches for publishing
useful data while preserving privacy. Each record consists of the
following 4 types of attributes:
 Identifier (ID): Attributes that can directly and uniquely identify
an individual, such as name, ID number and mobile no.
 Quasi-identifier (QID): Attributes that can be linked with external
data to re-identify individual records, such as gender, age and zip
code.
 Sensitive Attribute (SA): Attributes that an individual wants to
conceal, such as disease and salary.
 Non-sensitive Attribute (NSA): 32 Attributes other than ID,QID and
Original Table 2-Anonymous Table
2.PRIVACY PRESERVING PUBLISHING OF SOCIAL NETWORK DATA

 PPDP in the context of social networks mainly deals with


anonymizing graph data

 Which is much more challenging than anonymizing relational


table data.

34
3. ATTACK MODEL

35
PRIVACY MODELS

 If a network satisfies k-NMF anonymity then for each edge e,


there will be at least k - 1 other edges with the same number of
mutual friends as e. It can be guaranteed that the probability of
an edge being identified is not greater than 1/k.

1
a= 2 mutual friends
a c
b=2 mutual friends
b
c=2 mutual friends
2 3 d=2 mutual friends
d e=2 mutual friends
f=2 mutual friends
f e
4
So 6-NMF
36
DATA MINER

 CONCERN

 The primary concern of data miner is how to prevent sensitive


information from appearing in the mining results.

 To perform a privacy-preserving data mining, the data miner


usually needs to modify the data he got from the data collector.

37
APPROACHES

38
1. PRIVACY PRESERVING ASSOCIATION RULE MINING

 Various kinds of approaches have been proposed to perform


association rule hiding .
 Heuristic distortion approaches

 Heuristic blocking approaches

 Probabilistic distortion approaches

 Exact database distortion approaches

 Reconstruction-based approaches
39
2. PRIVACY PRESERVING CLASSIFICATION

 Classification is a form of data analysis that extracts models


describing important data classes

 To realize privacy-preserving decision tree mining,


 Dowd et al. proposed a data perturbation technique based on
random substitutions.
 Brickell and Shmatikov present a cryptographically secure
protocol for privacy-preserving construction of decision trees .

41
DECISION MAKER

 CONCERN

• The privacy concerns of the decision maker are following:


 how to prevent unwanted disclosure of sensitive mining

results
 how to evaluate the credibility of the received mining results

42
APPROACHES

 Legal measures.
For example, making a contract with the data miner to forbid
the miner from disclosing the mining results to a third party

 The decision maker can utilize methodologies from data


provenance, credibility analysis of web information, or other
related research fields

43
DATA PROVENANCE
 The information that helps determine the derivation history of
the data, starting from the original source

 Two kinds of information


 the ancestral data from which current data evolved

 the transformations applied to ancestral data that helped to


produce current data.

 With such information, people can better understand the data


and judge the credibility of the data.

44
WEB INFORMATION CREDIBILITY

 5 ways Internet users to differentiate false information from the


truth:
1. Authority: the real author of false information is usually unclear.
2. Accuracy: false information does not contain accurate data
3. Objectivity: false information is often prejudicial.
4. Currency: for false information, the data about its source, time and place
of its origin is incomplete, out of date, or missing.
5. Coverage: false information usually contains no effective links to other
information online.

45
Privacy and Security Constraints

 Individual Privacy
 Nobody should know more about any entity after the data
mining than they did before
 Approaches: Data Obfuscation, Value swapping
 Organization Privacy
 Protect knowledge about a collection of entities
 Individual entity values may be known to all parties
 Which entities are at which site may be secret

46
Privacy constraints don’t prevent data mining

 Goal of data mining is summary results


 Association rules
 Classifiers
 Clusters
 The results alone need not violate privacy
 Contain no individually identifiable values
 Reflect overall results, not individual organizations
The problem is computing the results without access to
the data!

47
Example:
Association Rules

 Assume data is horizontally partitioned


 Each site has complete information on a set of entities
 Same attributes at each site
 If goal is to avoid disclosing entities, problem is easy
 Basic idea: Two-Phase Algorithm
 First phase: Compute candidate rules
 Frequent globally  frequent at some site
 Second phase: Compute frequency of candidates

49
Privacy-Preserving Data Mining: Who?

 Government / public agencies. Example:


 The Centers for Disease Control want to identify disease outbreaks
 Insurance companies have data on disease incidents, seriousness, patient
background, etc.
 But can/should they release this information?
 Industry Collaborations / Trade Groups. Example:
 An industry trade group may want to identify best practices to help
members
 But some practices are trade secrets
 How do we provide “commodity” results to all (Manufacturing using
chemical supplies from supplier X have high failure rates), while still
preserving secrets (manufacturing process Y gives low failure rates)?

50
Privacy-Preserving Data Mining: Who?

 Multinational Corporations
 A company would like to mine its data for globally valid
results
 But national laws may prevent transborder data sharing
 Public use of private data
 Data mining enables research studies of large populations
 But these populations are reluctant to release personal
information

51
Outline

 Privacy and Security Constraints


 Types: Individual, collection, result limitation
 Sources: Regulatory, Contractual, Secrecy
 Classes of solutions
 Data obfuscation
 Summarization
 Data separation
 When do we address these issues?

52
Technical Solutions

 Data Obfuscation based techniques


 Reconstructing distributions for developing classifiers
 Association rules from modified data
 Data Separation based techniques
 Overview of Secure Multiparty Computation
 Secure decision tree construction
 Secure association rules
 Secure clustering
 What if the secrets are in the results?

53
Individual Privacy:
Protect the “record”

 Individual item in database must not be disclosed


 Not necessarily a person
 Information about a corporation
 Transaction record
 Disclosure of parts of record may be allowed
 Individually identifiable information

54
Individually Identifiable Information

 Data that can’t be traced to an individual not viewed


as private
 Remove “identifiers”
 But can we ensure it can’t be traced?
 Candidate Key in non-identifier information
 Unique values for some individuals
Data Mining enables such tracing!

55
Collection Privacy

 Disclosure of individual data may be okay


 Telephone book
 De-identified records
 Releasing the whole collection may cause problems
 Trade secrets – corporate plans
 Rules that reveal knowledge about the holder of data

56
Collection Privacy Example:
Corporate Phone Book
 Telephone Directory discloses
how to contact an individual
 Intended use
 Data Mining can find more
 Relative sizes of departments
 Use to predict corporate plans?
Data
 Possible Solution: Obfuscation Mining
 Fake entries in phone book
 Doesn’t prevent intended use
 Key: Define Intended Use
 Not always easy! Unexpectedly High
Number of
Energy Traders
Sources of Constraints

 Regulatory requirements
 Contractual constraints
 Posted privacy policy
 Corporate agreements
 Secrecy concerns
 Secrets whose release could jeopardize plans
 Public Relations – “bad press”

58
European Union Data Protection Directives

 Directive 95/46/EC
 Passed European Parliament 24 October 1995
 Goal is to ensure free flow of information
 Must preserve privacy needs of member states
 Effective October 1998
 GDPR - General Data Protection Regulation
 seeks to regulate the use and disclosure of the personal data of all individuals within the 28 EU
member states. Though passed into law in May 2016, it does not become enforceable until May
25, 2018.
 Unlike most privacy regulations in the U.S., the EU defines the term “personal data” broadly—
it includes “any information relating to an identified or identifiable natural person (the ‘data
subject’).”
 This means that even the most basic contact information, such as business card details or simply a
name and email address, falls under the GDPR’s protections. Public sources of information, such
as a residential phone listing, are not exempted from the GDPR’s restrictions.

59
Technology Threats to Data Privacy

• The growing popularity and development of data mining technologies


bring serious threat to the security of individual's sensitive information.

• An emerging research topic in data mining, known as privacy-preserving


data mining (PPDM), has been extensively studied in recent years.

• The basic idea of PPDM is to modify the data in such a way so as to


perform data mining algorithms effectively without compromising the
security of sensitive information contained in the data.
REFERENCES

 Lei Xu , Chunxiao Jiang , Jian Wang, Jain Yuan and Yong


Ren, Information Security in Big Data-Privacy and Data Mining
,Access, IEEE (Volume:2)

 J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and


Techniques.San Mateo, CA, USA: Morgan Kaufmann, 2006.

61
Thank you

62

You might also like