WDS Unit 5 Notes
WDS Unit 5 Notes
Privacy Preserving Data Mining (PPDM) aims to extract useful knowledge from large
datasets while ensuring that sensitive information remains protected. The techniques
in PPDM can be broadly categorized into three main approaches: data
anonymization, data perturbation, and secure multi-party computation.
1. Data Anonymization
Data anonymization techniques modify the data in a way that prevents the
identification of individual records while preserving the data's utility for analysis. Key
methods include:
k-Anonymity
l-Diversity
t-Closeness
Data perturbation techniques involve altering the data in a way that prevents
disclosure of sensitive information while allowing meaningful patterns to be mined.
Noise Addition
Data Swapping
Method: Swaps values between records in the dataset to break the link
between the data and the individuals.
Application: Used to protect categorical and numerical data.
Advantages: Preserves the overall data distribution.
Challenges: Can introduce biases if not done carefully.
Randomization
Homomorphic Encryption
Secret Sharing
Privacy vs. Utility Trade-off: All PPDM techniques need to balance the
trade-off between maintaining the utility of the data and ensuring privacy.
Evaluation Metrics: Common metrics for evaluating privacy include
information loss, data utility, and the risk of re-identification.
Context-specific Techniques: The choice of technique often depends on the
specific context and requirements of the data mining task and the sensitivity
of the data involved.
Applications of PPDM
Key Concepts
1. Bayesian Inference
Prior Distribution: Represents the initial beliefs about the data before
observing any evidence.
Posterior Distribution: Updated beliefs after considering the observed
data.
Importance: Helps in understanding how new data affects the
probability of re-identification.
3. Privacy Metrics
Risk Assessment: Measures the risk that a published dataset allows re-
identification of individuals.
Example Metric: Re-identification probability.
Utility Metrics: Evaluates the usefulness of the anonymized data for
analysis purposes.
Example Metric: Information loss or data utility score.
4. Differential Privacy
2. Probabilistic Modeling
3. Privacy-Preserving Queries
Practical Approaches
1. Bayesian Networks for Data Anonymization
Key Concepts
2. Location Anonymization
3. Policy Enforcement
Spatial Cloaking:
Concept: User’s location is reported as a larger, less specific
area.
Example: Instead of exact GPS coordinates, report a city or
neighborhood.
Benefit: Reduces the risk of precise tracking while allowing
service functionality.
Location Obfuscation:
Concept: Introduce small, random errors to the reported
location.
Example: Adding random noise to the exact coordinates.
Benefit: Makes it difficult for adversaries to pinpoint the exact
location.
Pseudonymization:
Concept: Replace real user identities with pseudonyms.
Application: Users interact with LBS using temporary
pseudonyms.
Benefit: Prevents long-term tracking of individuals.
Mix Zones:
Concept: Areas where users can change pseudonyms to break
the link between old and new identities.
Implementation: Strategic placement of mix zones where users'
movements are harder to correlate.
Benefit: Enhances privacy by periodically changing user
identifiers.
3. Privacy-preserving Protocols
Dummy Requests:
Concept: Users send multiple fake location requests alongside
the real one.
Benefit: Obscures the real query among several dummies,
making it harder to identify the actual location.
Spatial and Temporal Cloaking:
Concept: Delay or alter the precision of location data to hide
exact movements.
Benefit: Provides a balance between service accuracy and user
privacy.
Practical Implementations
Balance Between Privacy and Utility: The primary goal is to protect user
privacy while maintaining the utility of location-based services.
Dynamic and Context-aware: Access control policies need to be dynamic
and adaptable to changing contexts to effectively protect privacy.
User Empowerment: Users should have control over their location data and
be able to manage privacy settings according to their preferences.
Robust Enforcement: Ensuring that privacy policies are strictly enforced
through continuous monitoring and audits is crucial for maintaining trust and
compliance.
Applications
Key Concepts
1. Data Anonymization
2. Anonymization Frameworks
3. Privacy Models
k-Anonymity
Definition: Ensures that each record is indistinguishable from at
least 𝑘−1k−1 other records based on certain identifying
attributes.
Techniques: Generalization and suppression to achieve
equivalence classes.
Limitations: Vulnerable to attacks if the quasi-identifiers lack
diversity.
l-Diversity
Definition: Enhances k-anonymity by ensuring that each
equivalence class has at least 𝑙l diverse values for sensitive
attributes.
Techniques: Grouping records to ensure sensitive attribute
diversity.
Limitations: May be challenging to implement in datasets with
skewed distributions.
t-Closeness
Definition: Ensures that the distribution of a sensitive attribute
in any equivalence class is close to the distribution of the
attribute in the overall dataset.
Techniques: Measures and minimizes the distance between
distributions using metrics like Earth Mover's Distance.
Advantages: Provides stronger privacy guarantees by
preserving overall distribution.
Key Techniques in Anonymization
1. Generalization
Concept: Replace specific values with more general ones to reduce the
risk of re-identification.
Example: Replacing exact ages with age ranges (e.g., 30-35 instead of
32).
Application: Effective for categorical and numerical data.
2. Suppression
Concept: Remove specific values or entire records that pose a high risk
of re-identification.
Example: Suppressing rare or unique combinations of attributes.
Application: Used selectively to protect high-risk data points.
3. Data Perturbation
Principles of Anonymization
2. Data Utility
Goal: Ensure the anonymized data remains useful for its intended
purposes.
Metrics: Information loss metrics, data utility scores, and usability
evaluations.
3. Privacy Guarantee
Practical Implementations
1. Anonymization Algorithms
2. Anonymization Tools
Applications
By applying these frameworks and principles, organizations can publish data that is
both useful for analysis and safe from privacy breaches, adhering to ethical standards
and regulatory requirements.
Key Concepts
1. Pseudonymization
Concept: Special areas where users can change pseudonyms to break the link
between their old and new identities.
Implementation: Deploying mix zones at strategic locations such as
intersections or public transport hubs.
Advantages: Increases anonymity by preventing continuous tracking across
pseudonym changes.
Challenges: Placement and density of mix zones need careful consideration to
be effective.
Definition: Generates fake location data alongside real data to obscure actual
user movements.
Techniques: Sending multiple fake location requests to LBS providers.
Advantages: Makes it difficult for adversaries to distinguish between real and
fake locations.
Challenges: Balancing the number of dummy locations to avoid excessive
resource use.
3. Location Obfuscation
Practical Implementations
3. Privacy-preserving Algorithms
Applications
Healthcare: Protecting the privacy of patients using location-based health monitoring
services.
Urban Mobility: Enhancing privacy for users of public transport and ride-sharing
services.
Social Networking: Providing location-based social networking features without
compromising user privacy.
By applying these techniques, LBS providers can ensure that users enjoy the benefits of
location-based services while maintaining their privacy and trust.
Key Concepts
2. Security Policies
Access Control: Rules determining who can access what resources and
under what conditions.
Data Encryption: Protecting data in transit and at rest to prevent
unauthorized access.
Authentication: Verifying user identity before granting access to
resources.
3. Privacy Policies
2. Data Encryption
1. Optimized Algorithms
2. Context-aware Mechanisms
3. Resource Management
Energy-efficient Protocols: Design protocols that minimize energy
consumption, such as reducing the frequency of cryptographic
operations.
Caching and Preprocessing: Cache frequent data access requests and
preprocess data to reduce the computational burden on mobile
devices.
Applications