0% found this document useful (0 votes)
3 views

_Data_Engineering_101_1731168906

Uploaded by

Nidhi Ahuja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

_Data_Engineering_101_1731168906

Uploaded by

Nidhi Ahuja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA

ENGINEERING
IMPORTANT CONCEPTS TO
101
KNOW
DATA ENGINEERING 101

DATA WAREHOUSING
Centralized storage that combines
data from multiple sources, designed
for query and analysis.

Building a data warehouse using


Snowflake to integrate and store
sales, marketing, and CRM data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

ETL / ELT PROCESSES

Extract, Transform, Load (ETL) or


Extract, Load, Transform (ELT) for data
integration.

Using Azure Data Factory to ETL data


from on-prem SQL Server to Azure
Synapse Analytics.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA MODELING

Designing data models such as star


schema and snowflake schema to
optimize for queries.

Creating star schemas for a retail


company to model sales and
products.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

BATCH PROCESSING

Processing data in chunks or batches


on a scheduled basis.

Using Apache Spark to process sales


data from yesterday's transactions.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

STREAM PROCESSING

Processing data in real time or near-


real time as it's produced.

Using Apache Kafka with Apache


Flink for real-time fraud detection on
transaction data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA LAKES

Centralized repositories for storing


structured, semi-structured, and
unstructured data at scale.

Building a data lake using Azure Data


Lake to store IoT data, social media
feeds, and logs.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA LAKEHOUSE

Architecture that combines data


lakes and data warehouses for
unified analytics.

Using Delta Lake to enable both


batch and real-time analytics on the
same data in Azure.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA PARTITIONING

Splitting large datasets into smaller


partitions to improve query
performance and manageability.

Partitioning S3 bucket files by year,


month, day for better query
performance using AWS Athena.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA SHARDING

Breaking down large datasets


horizontally across multiple
databases to improve scalability.

Sharding user data across multiple


PostgreSQL instances to distribute
the load.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA PIPELINES

Automated workflows that move


data between systems in a
scheduled or triggered manner.

Building an Airflow DAG to automate


the ETL process for daily sales data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA QUALITY

Ensuring accuracy, completeness,


consistency, and validity of data.

Implementing data validation rules


to check null values or duplicates
using Great Expectations.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA LINEAGE

Tracking the flow of data from source


to destination for audit and
troubleshooting.

Using tools like Apache Atlas or


DataHub to track data lineage in an
ETL pipeline.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

SCHEMA EVOLUTION

The ability to adapt to changes in the


structure of data sources and their
schemas.

Handling new columns in a table in a


Spark job without breaking the
existing pipeline.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

CHANGE DATA CAPTURE (CDC)

Capturing and tracking changes in


source data for real-time updates.

Using Debezium to track changes in


MySQL database and load them into
a Kafka topic.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA GOVERNANCE

Defining rules, policies, and


standards for managing and
accessing data.

Implementing data governance with


data catalog tools like Collibra or
Alation.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA SECURITY

Protecting data at rest, in transit, and


ensuring secure access.

Encrypting sensitive data using AES-


256 while stored in an AWS S3 bucket.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA ANONYMIZATION

Removing personally identifiable


information to maintain user privacy.

Masking user IDs and phone


numbers before sending them to the
analytics team.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

METADATA MANAGEMENT

Storing and managing information


about the data, like source, structure,
and usage.

Using Hive Metastore to manage


metadata for tables in an Apache
Hadoop cluster.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

SCALABLE STORAGE

Designing storage solutions to


handle growing data volumes.

Using Amazon S3 or Azure Blob


Storage for scalable storage of raw
data files.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

COLUMNAR STORAGE

Storing data in columnar format for


analytics use, improving
performance.

Using Parquet file format for efficient


data processing with Apache Spark.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA NORMALIZATION

Organizing data to reduce


redundancy and improve integrity.

Structuring customer data in a 3NF


schema in an RDBMS to eliminate
redundancy.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

NOSQL DATABASES

Non-relational databases for storing


semi-structured or unstructured
data.

Using MongoDB to store JSON


documents for product catalog data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DISTRIBUTED SYSTEMS

Systems that divide tasks among


multiple machines for performance
and availability.

Using Hadoop Distributed File System


(HDFS) for storing petabytes of data
across nodes.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA REPLICATION

Creating redundant copies of data


for failover and availability.

Using Azure Cosmos DB's geo-


replication for ensuring high
availability across regions.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA BACKUP

Creating backup copies of data for


disaster recovery purposes.

Scheduling PostgreSQL backups


using AWS RDS automated backup
feature.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

LOAD BALANCING

Distributing data processing


workloads across nodes to avoid
bottlenecks.

Using Kubernetes to distribute data


processing pods in a Spark cluster.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

WORKFLOW ORCHESTRATION

Managing dependencies, scheduling,


and monitoring data pipeline
workflows.

Using Apache Airflow to orchestrate


data pipeline tasks with DAGs.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA VALIDATION

Testing data for expected values and


types before processing.

Writing data validation scripts in


Python to verify incoming data for
missing or incorrect values.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA TRANSFORMATION

Converting raw data into a usable


format (e.g., aggregation, joins).

Using PySpark to join user and


transaction data for downstream
analysis.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

OLAP VS. OLTP

OLAP is used for analytics, OLTP for


day-to-day transaction processing.

Using OLAP for querying historical


sales data and OLTP for processing
real-time purchase orders.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

SQL QUERY OPTIMIZATION

Improving SQL queries for better


performance and faster execution.

Adding indexes and rewriting SQL


joins to optimize query performance
in a relational database.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

SERVERLESS DATA PROCESSING

Processing data without managing


underlying servers.

Using AWS Glue for running


serverless ETL jobs on demand.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

EVENT-DRIVEN PROCESSING

Reacting to events to trigger data


processing workflows.

Using AWS Lambda to trigger data


transformation based on new S3
object uploads.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

JOB SCHEDULING

Automating the execution of jobs at


specific intervals.

Scheduling nightly ETL jobs using


Azure Data Factory's pipeline
scheduling feature.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA SCHEMA REGISTRY

Storing schema definitions for data


to facilitate consistent data
exchange.

Using Confluent Schema Registry for


managing Avro schemas in Kafka
topics.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA VERSIONING

Keeping track of changes to data


and datasets for reproducibility.

Versioning datasets using Delta Lake


to track changes to the data over
time.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA CATALOGING

Indexing data assets to make them


searchable and discoverable.

Using Azure Purview to catalog all


datasets in a data warehouse
environment.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA INGESTION

Pulling in data from different data


sources for storage or processing.

Using Apache NiFi for ingesting


sensor data into an HDFS data lake.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA LAKE STORAGE

Using data lakes for flexible data


storage, especially for unstructured
data.

Storing IoT sensor data in Azure Data


Lake for processing later with Spark.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

DATA CLEANSING

Correcting or removing corrupt,


incomplete, or inaccurate data.

Removing null values and outliers


from customer data using Python
pandas.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

MASTER DATA MANAGEMENT

Creating a single, trusted source of


key business data.

Maintaining a unified view of


customer data using Informatica
MDM.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING 101

CLOUD DATA SERVICES

Leveraging cloud services for data


storage and processing.

Using AWS Redshift for data


warehousing and Azure Synapse for
data analytics.

Shwetank Singh
GritSetGrow - GSGLearn.com
Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like