unit -4
unit -4
1. Data Governance
Data governance is the framework that ensures data is
managed as a valuable organizational asset. It encompasses
policies, processes, and standards to ensure data quality,
security, and compliance.
Key Principles
Accountability: Assigning roles and responsibilities for
data ownership and stewardship.
Data Quality: Maintaining accurate, complete, and timely
data.
Transparency: Ensuring clear documentation of data-
related policies and processes.
Compliance: Adhering to laws, regulations, and
standards relevant to the organization's industry.
Core Components
Policies and Standards: Rules governing data access,
usage, and sharing.
Data Stewardship: Designated personnel responsible for
specific data assets.
Technology: Tools like data catalogs, governance
platforms, and metadata management systems.
Benefits
Improved decision-making.
Enhanced regulatory compliance.
Mitigation of risks like data breaches.
2. Data Lineage
Data lineage refers to the life cycle of data, detailing its
origin, transformations, and movements across systems.
Importance
Transparency: Helps stakeholders understand how data is
processed and transformed.
Debugging and Issue Resolution: Identifying errors and
tracking their sources.
Compliance: Demonstrating the data's journey to
regulators.
Core Elements
Source Data: The origin of the data.
Transformations: Changes made to data during
processing.
Destinations: Endpoints where data is stored or used.
Tools
Open-source tools: Apache Atlas, Amundsen.
Proprietary tools: Informatica, Collibra.
4. Compliance Requirements
Compliance in data engineering ensures adherence to legal,
ethical, and regulatory standards.
Key Regulations
GDPR (General Data Protection Regulation): Enforces
data privacy and security in the EU.
CCPA (California Consumer Privacy Act): Grants
California residents rights over their personal data.
HIPAA (Health Insurance Portability and Accountability
Act): Protects health information in the US.
SOX (Sarbanes-Oxley Act): Regulates financial data
integrity.
Compliance Strategies
Data minimization and encryption.
Access controls and audit trails.
Regular compliance audits.
5. Metadata Management
Metadata refers to "data about data" and is critical for
understanding, managing, and utilizing data.
Types of Metadata
1. Technical Metadata: Schema, file size, format.
2. Business Metadata: Definitions, business rules.
3. Operational Metadata: Data lineage, logs.
Benefits
Easier data discovery.
Enhanced data governance.
Better decision-making support.
Tools
Apache Atlas, Amundsen, Informatica, Collibra.
6. Data Cataloging
A data catalog is a curated inventory of data assets that helps
users discover, access, and understand data.
Features
Searchable Interface: Users can locate datasets easily.
Data Lineage Tracking: Shows the origin and
transformations of datasets.
Data Quality Indicators: Highlights the reliability of
datasets.
Advantages
Accelerates data democratization.
Enhances collaboration between teams.
Facilitates compliance efforts.
Tools
AWS Glue Data Catalog, Google Data Catalog, Azure
Data Catalog.