0% found this document useful (0 votes)
181 views

data engineering design patterns

The document discusses data engineering patterns and principles, emphasizing the importance of standardization and leveraging existing knowledge to improve engineering processes. It outlines the hierarchy of needs for data projects, including tools, architecture, and culture, while also addressing challenges in data ingestion, storage, and preparation. Additionally, it highlights the significance of adopting DataOps practices and the Last Responsible Moment principle in cloud analytics.

Uploaded by

Ravi Sankar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views

data engineering design patterns

The document discusses data engineering patterns and principles, emphasizing the importance of standardization and leveraging existing knowledge to improve engineering processes. It outlines the hierarchy of needs for data projects, including tools, architecture, and culture, while also addressing challenges in data ingestion, storage, and preparation. Additionally, it highlights the significance of adopting DataOps practices and the Last Responsible Moment principle in cloud analytics.

Uploaded by

Ravi Sankar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Data Engineering

Patterns and Principles


Valdas Maksimavičius
Software Development Data Projects
Software Development Data Projects
Would you be
confident in a
self-driving car ...

… knowing that
there is your
software running
it?
Standardize and increase the descriptive power
of engineering processes
by applying patterns

Or in other words

stand on the shoulders of giants

and stop reinventing the wheel


Why does my brain need patterns?
● Left side of your brain is responsible for
analytical thinking, science, math, etc.

● It uses known building blocks to model the


surrounding world

● If you like table representation of data, you


will try to model everything as a table

● As an engineer, expand your tool belt by


learning new patterns and new building
blocks to solve business problems better.

Source: https://www.health.harvard.edu/blog/right-brainleft-brain-right-2017082512222
About me
● IT Architect at Cognizant
● Data Engineering, Data Science,
Cloud Computing, Agile teams
● Financial, Manufacturing,
Logistics, Retail industries
● Organizer of Vilnius Microsoft Data
Platform Meetup & Hack4Vilnius Hackathon
● Blogging on www.valdas.blog
Maslow’s hierarchy of needs
Self-actualization
Personal growth and fulfillment

Experience purpose and meaning


Realising all inner potentials

Esteem need
Unique individual, self-respect, etc.

Love and belonging needs


Receive and give love, appreciation, friendship

Safety needs
security, employment, protection against hunger and violence

Biological and Physiological needs


Basic life needs - air, food, drink, shelter, warmth, sex, sleep, etc.
X
Maslow’s hierarchy of needs for data projects
Design patterns, tools &
principles

Existing team skillset


Databases, programming, etc

Data strategy & architecture


Defensive vs offensive strategy, use cases

Enterprise architecture
Buy vs build, cloud readiness

Business drivers
Business goals and objectives

Culture
Core values, way of working
Maslow’s hierarchy of needs for data projects -
simplified view for today’s presentation

Tools & principles


Best practices, naming, patterns

Data architecture
Ingestion, storage consumption, how data is collected,
stored, transformed, distributed, and consumed

Culture
Core values, way of working
Culture, way of working, values
DevOps culture
1. Foster a Collaborative Environment
2. Impose End-to-End Responsibility - you build it you ship it
3. Encourage Continuous Improvement
4. Automate (Almost) Everything
5. Focus on the Customer’s Needs
6. Embrace Failure, and Learn From it
7. Unite Teams — and Expertise

Source: https://www.cmswire.com/information-management/7-key-principles-for-a-successful-devops-culture/
Data architecture
If you are building a data platform in the
cloud, remember that ...

low barrier-to-entry overshadows


complexity
Big Data cloud architecture references

Source: https://azure.microsoft.com/en-in/solutions/architecture/modern-data-warehouse/
Architecture example
Digital portals
LOB

CRM Core systems


INGEST STORE PREP & DEPLOY &
Graph TRAIN SERVE
CRM
Image
Data
Social orchestration Big data store Transform, Results
and monitoring Clean & Train External systems
IoT

Cloud Reporting
Data ingestion
Digital portals
LOB

CRM Core systems


INGEST STORE PREP & DEPLOY &
Graph TRAIN SERVE
CRM
Image
Data
Social orchestration Big data store Transform, Results
and monitoring Clean & Train External systems
IoT

Cloud Reporting
Application integration approaches
File Transfer
Have each application produce files of shared data for others to consume, and consume files that others have produced.

Shared Database
Have the applications store the data they wish to share in a common database.

Remote Procedure Invocation


Have each application expose some of its procedures so that they can be invoked remotely, and have applications invoke
those to run behavior and exchange data.

Messaging
Have each application connect to a common messaging system, and exchange data and invoke behavior using messages.
Ingestion challenges

● Multiple data source load and prioritization -> push vs pull strategy

● Ingested data indexing and tagging -> metadata collection is mandatory

● Data validation and cleansing -> separate business from processing logic

● Data transformation and compression -> different compression and file types
Choose privacy protection patterns
Privacy protection at the ingress Privacy protection at the
egress

Source: https://www.valdas.blog/2019/08/06/privacy-gdpr-implementation-in-azure/
Data storage
Digital portals
LOB

CRM Core systems


INGEST STORE PREP & DEPLOY &
Graph TRAIN SERVE
CRM
Image
Data
Social orchestration Big data store Transform, Results
and monitoring Clean & Train External systems
IoT

Cloud Reporting
Use cloud storage offerings instead of Hadoop
Data Warehouse vs Data Lake
Data Warehouse Data Lake
Requirements Relational requirements Diverse data, scalability, low cost
Data Value Data of recognised high value Candidate data of potential value
Data Processing Mostly refined calculated data Mostly detailed source data
Business Entities Known entities, tracked over time Raw material for discovering entities and facts

Data Standards Data conforms to enterprise Fidelity to original format and condition
standards

Data Integration Data integration upfront Data prep on demand


Transformation Data transformed, in principle Data repurposed later, as needs arise

Schema Definition Schema-on-write Schema-on-read


Metadata Management Metadata improvement Metadata developed on read
Data Warehouse vs Data Lake

Source: Microsoft
Data Warehouse vs Data Lake

Source: Microsoft
Data Warehouse vs Data Lake

Source: Microsoft
Data preparation & training
Digital portals
LOB

CRM Core systems


INGEST STORE PREP & DEPLOY &
Graph TRAIN SERVE
CRM
Image
Data
Social orchestration Big data store Transform, Results
and monitoring Clean & Train External systems
IoT

Cloud Reporting
Offer self-service tools
Collect raw Train & Take Insights
Curate data Score Into Actions
data
Make
hypothesis

Validate
model
Identify
SQL variables

Build
Automated pipeline model
Split
data
Self service exploration
Use on-demand resources
Serve results to end consumers
Digital portals
LOB

CRM Core systems


INGEST STORE PREP & DEPLOY &
Graph TRAIN SERVE
CRM
Image
Data
Social orchestration Big data store Transform, Results
and monitoring Clean & Train External systems
IoT

Cloud Reporting
Apply domain and product thinking

● Model to describe a domain


● Unified language
● Raw or transformed datasets
● Domain team is responsible for its lifecycle, SLA
● Discoverable, addressable, trustworthy,
self-describing, interoperable, secure
● Each producer is responsible of sharing data
products to organization
Principles, best practices, tools
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps - Examples
Delay commitments and keep important
decisions open

● The principle of Last Responsible


Moment originates from Lean
Software Development

● It emphasises holding on taking


important actions and crucial
decisions for as long as possible.
Why Last Responsible
Moment is important in
cloud analytics?

Expect new improvements and


upgrades all the time
[email protected]
https://www.linkedin.com/in/valdasm/
Twitter: @VMaksimavicius

You might also like