0% found this document useful (0 votes)

11 views71 pages

Practical Implementation Of A Data Lake Translating Customer Expectations Into Tangible Technical Goals 1st Edition Nayanjyoti Paul instant download

Ebook

Uploaded by

vitmangleda79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views71 pages

Practical Implementation Of A Data Lake Translating Customer Expectations Into Tangible Technical Goals 1st Edition Nayanjyoti Paul instant download

Ebook

Uploaded by

vitmangleda79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Practical Implementation Of A Data Lake

Translating Customer Expectations Into Tangible

Technical Goals 1st Edition Nayanjyoti Paul
download
https://ebookbell.com/product/practical-implementation-of-a-data-
lake-translating-customer-expectations-into-tangible-technical-
goals-1st-edition-nayanjyoti-paul-52726706

Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.

A New Hypothesis On The Anisotropic Reynolds Stress Tensor For

Turbulent Flows Volume Ii Practical Implementation And Applications Of
An Anisotropic Hybrid Komega Shearstress Transportstochastic
Turbulence Model 1st Ed Lszl Knzsy
https://ebookbell.com/product/a-new-hypothesis-on-the-anisotropic-
reynolds-stress-tensor-for-turbulent-flows-volume-ii-practical-
implementation-and-applications-of-an-anisotropic-hybrid-komega-
shearstress-transportstochastic-turbulence-model-1st-ed-lszl-
knzsy-22497524

A Practical Guide For Simulation And Fpga Implementation Of Digital

Design 3rd Edition Hajji

https://ebookbell.com/product/a-practical-guide-for-simulation-and-
fpga-implementation-of-digital-design-3rd-edition-hajji-42278912

Electromagnetic Imaging For A Novel Generation Of Medical Devices

Fundamental Issues Methodological Challenges And Practical
Implementation Francesca Vipiana

https://ebookbell.com/product/electromagnetic-imaging-for-a-novel-
generation-of-medical-devices-fundamental-issues-methodological-
challenges-and-practical-implementation-francesca-vipiana-50686192

The Art Of Hospitality Implementation Guide A Practical Guide For A

Ministry Of Radical Welcome Nixon

https://ebookbell.com/product/the-art-of-hospitality-implementation-
guide-a-practical-guide-for-a-ministry-of-radical-welcome-
nixon-59279974
Using The Iso 56002 Innovation Management System A Practical Guide For
Implementation And Building A Culture Of Innovation H James Harrington
Sid Benraouane

https://ebookbell.com/product/using-the-iso-56002-innovation-
management-system-a-practical-guide-for-implementation-and-building-a-
culture-of-innovation-h-james-harrington-sid-benraouane-46774092

Vectorization A Practical Guide To Efficient Implementations Of

Machine Learning Algorithms 1st Edition Edward Dongbo Cui

https://ebookbell.com/product/vectorization-a-practical-guide-to-
efficient-implementations-of-machine-learning-algorithms-1st-edition-
edward-dongbo-cui-217790910

Vectorization A Practical Guide To Efficient Implementations Of

Machine Learning Algorithms 1st Edition Edward Dongbo Cui

https://ebookbell.com/product/vectorization-a-practical-guide-to-
efficient-implementations-of-machine-learning-algorithms-1st-edition-
edward-dongbo-cui-184934852

Operations Strategy In Action A Guide To The Theory And Practice Of

Implementation Kim Hua Tan

https://ebookbell.com/product/operations-strategy-in-action-a-guide-
to-the-theory-and-practice-of-implementation-kim-hua-tan-1741664

Modern Geotechnical Design Codes Of Practice Implementation

Application And Development 1st Edition P Arnold G A Fenton M A Hicks

https://ebookbell.com/product/modern-geotechnical-design-codes-of-
practice-implementation-application-and-development-1st-edition-p-
arnold-g-a-fenton-m-a-hicks-51707640
Practical
Implementation
of a Data Lake
Translating Customer Expectations
into Tangible Technical Goals
—
Nayanjyoti Paul
Practical
Implementation of a
Data Lake
Translating Customer
Expectations into Tangible
Technical Goals

Nayanjyoti Paul
Practical Implementation of a Data Lake: Translating Customer
Expectations into Tangible Technical Goals
Nayanjyoti Paul
Edison, NJ, USA

ISBN-13 (pbk): 978-1-4842-9734-6 ISBN-13 (electronic): 978-1-4842-9735-3

https://doi.org/10.1007/978-1-4842-9735-3

Copyright © 2023 by Nayanjyoti Paul

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark
symbol with every occurrence of a trademarked name, logo, or image we use the names, logos,
and images only in an editorial fashion and to the benefit of the trademark owner, with no
intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal
responsibility for any errors or omissions that may be made. The publisher makes no warranty,
express or implied, with respect to the material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Celestin Suresh John
Development Editor: James Markham
Coordinating Editor: Mark Powers
Cover designed by eStudioCalamar
Cover image by Arek Socha on Pixabay (www.pixabay.com)
Distributed to the book trade worldwide by Apress Media, LLC, 1 New York Plaza, New York, NY
10004, U.S.A. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected],
or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member
(owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance
Inc is a Delaware corporation.
For information on translations, please e-mail [email protected]; for
reprint, paperback, or audio rights, please e-mail [email protected].
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook
versions and licenses are also available for most titles. For more information, reference our Print
and eBook Bulk Sales web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is
available to readers on GitHub (https://github.com/Apress). For more detailed information,
please visit https://www.apress.com/gp/services/source-code.
Paper in this product is recyclable.
Table of Contents
About the Author��vii

About the Technical Reviewer��ix

Preface��xi

Introduction��xiii

Chapter 1: Understanding “the Ask”��1

Objective: Asking the Right Questions��1
The Recommendations��2
Decide on the Migration Path, Modernization Techniques,
Enhancements, and the Cloud Vendor��2
Assess the Current Challenges��6
Understand Why Modernizing Data Platforms Is Hard��7
Determine the Top Five Issues to Solve��9
Determine What Is Available On-Premise vs. on the Cloud��10
Create the Meetings Needed Throughout the Project��12
Define Common Terms and Jargon��16
Key Takeaways��17

Chapter 2: Enabling the Security Model��19

Objective: Identifying the Security Considerations��19
The Recommendations��20
PII Columns: RBAC, ABAC Features��21
Central Access Control��27

iii
Table of Contents

Authentication and Authorization (SAML vs. PING, etc.)��31

Strategy for Data Obfuscation��44
GDPR and Other Data Privacy��46
Ownership of the Platform, Interaction with Other
Stakeholders (CISO, Legal Teams, etc.)��48
Legal/Contractual Obligations on Getting/Connecting Data from
a Third Party on the Cloud��50
Key Takeaways��51

Chapter 3: Enabling the Organizational Structure��53

Objective: Identifying the Organizational Structure and Role��53
The Recommendations��54
Example Template for the Project��54
Key Takeaways��62

Chapter 4: The Data Lake Setup��63

Objective: Detailed Design of the Data Lake��63
The Recommendations��63
Structuring the Different Zones in the Data Lake��64
Defining the Folder Structure of the Zones with a Hierarchy��67
Managing Data Sensitivity as Part of the Folder Structure Design��71
Setting the Encryption/Data Management Keys for Organizing Data��73
Looking at Data Management Principles��77
Understanding Data Flows��81
Setting the Right Access Control for Each Zone��145
Understanding File Formats and Structures in Each Zone��147
Key Takeaways��150

iv
Table of Contents

Chapter 5: Production Playground��151

Objective: Production Playground��151
The Recommendations��152
What Is a Production Playground?��153
What Issues Will This Address?��154
What Is a Production Playground Not ?��155
What Does the Production Playground Consist Of?��155
Key Takeaways��158

Chapter 6: Production Operationalization��159

Objective: Production Operationalization��159
The Recommendations��160
Key Takeaways��165

Chapter 7: Miscellaneous��167
Objective: Advice to Follow��167
Recommendations��167
Managing a Central Framework Along with
Project-Specific Extensions��167
Allowing Project Teams to Build “User-Defined Procedures” and
Contribute to the Central Framework��168
Advantages and Disadvantages of a Single vs. Multi-account Strategy��169
Creating a New Organizational Unit AWS Account vs. Onboard Teams
to a Central IT Managed AWS Account��171
Considerations for Integrating with Schedulers��172
Choosing a Data Warehouse Technology��173
Managing Autoscaling��174
Managing Disaster Recovery��175

v
Table of Contents

AWS Accounts Used for Delivery��176

Data Platform Cost Controls��178
Common Anti-patterns to Avoid��187
Poor Metadata Management��189
Key Takeaways��193

Index��195

vi
About the Author
Nayanjyoti Paul is an associate director and
chief Azure architect for GenAI and LLM
CoE for Accenture. He is the product owner
and creator of patented assets. Presently, he
leads multiple projects as a lead architect
around generative AI, large language models,
data analytics, and machine learning. Nayan
is a certified master technology architect,
certified data scientist, and certified Databricks
champion with additional AWS and Azure
certifications. He has been a speaker at conferences like Strata Conference,
Data Works Summit, and AWS Reinvent. He also delivers guest lectures at
universities.

vii
About the Technical Reviewer
Arunkumar is an architect with 20+ years of
experience in the IT industry. He has worked
with a wide variety of technologies in the
data, cloud, and AI spaces. He has experience
working in a variety of industries such as
banking, telecom, healthcare, and avionics.
As a lifelong learner, he enjoys taking on new
fields of study and challenging himself to
master the necessary skills and knowledge.

ix
Preface
This book explains how to implement a data lake strategy, covering the
technical and business challenges architects commonly face. It also
illustrates how and why client requirements should drive architectural
decisions.
Drawing upon a specific case from my own experience, I begin with
the consideration from which all subsequent decisions should flow: what
does your customer need?
I also describe the importance of identifying key stakeholders and the
key points to focus on when starting a project. Next, I take you through
the business and technical requirements-gathering process and how to
translate customer expectations into tangible technical goals.
From there, you’ll gain insight into the security model that will allow
you to establish security and legal guardrails, as well as different aspects of
security from the end user’s perspective. You’ll learn which organizational
roles need to be onboarded into the data lake, their responsibilities,
the services they need access to, and how the hierarchy of escalations
should work.
Subsequent chapters explore how to divide your data lakes into zones,
organize data for security and access, manage data sensitivity, and use
techniques for data obfuscation. Audit and logging capabilities in the
data lake are also covered before a deep dive into designing data lakes to
handle multiple file formats and access patterns. The book concludes by
focusing on production operationalization and solutions to implement a
production setup.

xi
Preface

After completing this book, you will understand how to implement a

data lake and the best practices to employ while doing so, and you will be
armed with practical tips to solve business problems.

What You Will Learn

Specifically, by reading this book, you will

• Understand the challenges associated with

implementing a data lake

• Explore the architectural patterns and processes used

to design a new data lake

• Design and implement data lake capabilities

• Associate business requirements with technical

deliverables to drive success

Who This Book Is For

This book was written for data scientists and architects, machine learning
engineers, and software engineers.

xii
Introduction
I landed at the airport and took an Uber to my customer’s office. I was
supposed to meet with the program manager on the customer side. After
the initial process and getting myself “checked in,” I entered the conference
room that was booked for our team. I knew most of the team from other
projects, but I was meeting a few of them for the first time. After the usual
greetings and a few of my colleagues congratulating me on my new role, I
was ready for the day to unfold.
This customer was a big organization, and there was a clear
“separation of concerns” from multiple teams. The schedule was set up,
and our first tasks were to get acquainted with the different organizational
units, identify the key stakeholders, and understand the stakeholders’
primary “asks.” It was important for my team to understand the key
organizational units and have one-on-one initial discussions. We needed
to connect with the following people and teams:

–– We needed to know the owner of this platform. This

typically includes who will own this data lake as a
platform from the customer’s point of view. Who will
pay the bills and eventually be the key decision-maker
for all technical and business decision-making? We
identified the senior VP of engineering as the key
stakeholder. We set up a one-hour call with him to
understand his expectations and his vision of the
future-state data lake.

xiii
Introduction

–– We wanted to know the team that was handling all the

data and analytics today. As the customer had an
on-premise footprint, we wanted to know the engineer-
ing team who had been managing the entire data and
analytics platform on-premise up to now. Eventually
they would be cross-trained and be the data engineer-
ing team in the cloud after we delivered the data lake.
As all the information of source systems, data onboard-
ing processes, current business reporting needs, etc.,
were managed by them, we needed to understand the
current business process of this team and document
them so that we could draw some parallels for what it
might take to transition those workload and business
requirements into the cloud as part of this journey. We
invited the engineering leads to an initial one-hour call.
–– We needed to connect with the chief information
security officer (CISO) and her team. Venturing into the
cloud was a new entity for my customer. Apart from the
technical questions and recommendations, we needed
to understand the business, contractual, and general
organizational obligations of what was permitted (and
what was not) from a security standpoint. We knew that
every organization has a set of predefined policies that
must be followed. Some of these guidelines come from
geography (like GDPR), some come from industry (like
HIPAA or financial data restrictions), and others may
come from data residency (like data sitting in the
customer’s own on-premise data center versus the
public cloud). Nevertheless, we needed to connect with
this team and understand what these policies meant for

xiv
Introduction

this customer and what considerations we needed to

take when we designing the platform as a whole. We
ended up setting up another one-hour call with
this team.

–– Next we set up a call with the “cloud engineering” team.

This was a new team, and they had started some
groundwork in laying out the “laws of the land,” mostly
in terms of network, services whitelisted, getting access
to a cloud network, access and onboarding of resources
to the cloud system, etc. We wanted to be acquainted
with the current process. Also, from a delivery point of
view, this project was a shared responsibility. Some of
the key aspects that our customer would still be “own-
ing” was the platform management and onboarding
part. Additionally, the strategies around disaster
recovery, high availability, etc., were going to be a
“shared responsibility.” Hence, it was critical for us to
work closely with the cloud engineering team, so we
scheduled a one-hour initial discussion with them.

–– Next was the DBA team. The DBA team currently

owned the databases on-premise but was also respon-
sible for eventually owning any databases, data marts,
and data warehouses that would be set up on the cloud
as part of this program. We set up a one-hour meeting
with them too.

–– Next was the data governance team. One of the key

reasons to move into the cloud (apart from the obvious
reasons of low-lost, easy maintenance, and limitless
storage and compute capacity) was to keep track of and
audit everything that was going on. We believed in a

xv
Introduction

“governance-first” approach, and our customer

believed in that too. They wanted to keep an audit and
lineage trail of everything that would be happening on
the cloud so that the data lake (lake house) did not
become a swamp. An easy and centralized governance
process would make “things” in the data lake very
organized. Additionally, it would introduce data dis-
covery and search capability that would become a
crucial feature for building and establishing a data
marketplace and catalog to “shop for” all the data
(products) hosted on the data lake (lake house).

–– We also connected with the “business” users who were

the key stakeholders of the system. They were sup-
posed to use and consume data or analytics outcomes
from the platform. We had teams like data science,
business intelligence, C-suite executives, etc., who were
waiting to be onboarded onto the platform for different
reasons and rationales. We set up independent calls
with them to understand what “success” meant
for them.
–– Lastly, we wanted to quickly connect with our partner
teams. For example, the public cloud offering was from
AWS, and we wanted to connect with the AWS leads to
understand what was currently in discussion for this
implementation. Similarly, we connected with the
Collibra team that was providing the Collibra software
as an enterprise data catalog solution. Coming from a
consulting company, we have partnerships with both
vendors, and hence it was critical for us to be in sync
with them.

xvi
Introduction

With the key stakeholders identified and meetings set up, it was time
for business. Having dedicated sessions with each key member was critical
to get “buy-in” from each of them for the platform architecture (more on
this to follow in the coming chapters).

Understanding the Requirements

from Multiple Stakeholders’ Viewpoints
In general, implementing a greenfield data lake has many technical and
business challenges. The following are a few challenges that we needed to
think through:

• Establishing a clear understanding of the customer

requirements for a data lake implementation can be a
challenge because of the complexity of the area.

• It can be difficult to determine exactly what data

is required, as well as how it should be stored and
retrieved.

• It is difficult to understand the customer’s desired

outcomes and how they will use the data lake.
• It can be challenging to ensure that the data lake
is secure and conforms to industry standards and
regulations.

• Connecting the data lake with other systems can be a

challenge because of the complexity of the integration.

• It can be difficult to determine the best way to

structure the data lake, as well as how to optimize it for
performance.

xvii
Introduction

• It is difficult to ensure that the data lake is designed for

scalability so that it can accommodate future growth.

• Determining the most effective way to ingest data into

the data lake can be a challenge because of the volume
and variety of data sources.

• It can be difficult to ensure that the data is of high

quality, as well as how to monitor and maintain the
data within the data lake.

• Since the customer requirements will vary from one

organization to the next, it can be difficult to have an
accurate understanding of what is needed and build a
generalized solution.

• Understanding the customer’s security and privacy

requirements can be difficult to interpret, especially if
they are not adequately documented.

• Establishing the necessary data governance

frameworks and policies can be a challenge if there
is not sufficient detail regarding the customer’s
requirements.
• Understanding the customer’s desired access and
usage policies can be difficult to discern without
an appropriate level of detail in the customer’s
requirements.

• Establishing the necessary data quality requirements

can be a challenge if the customer’s requirements are
not met.

xviii
Introduction

The following diagram represents how “success” means different

things to different stakeholders. This illustration depicts an example
of what it means for this particular customer. This is to ensure that we
address and keep each of these success criterion in mind as we move
ahead and start the platform design.
Security was a key stakeholder.
Business was one key IT was another stakeholder who They wanted “right
stakeholder. They had CTO was another stakeholder wanted cloud native solution separation” of duties, manage
challenges to run business who wanted a modern data to keep a clean architecture and “blast radius” and ensure
insights over longer period of platform on-cloud minimize integration issues proper controls
data.

1.0 2.0 3.0 4.0

Data Scientists were key Data Engineering team were

stakeholders
holders who wanted a key stakeholders. They wanted CISO was another stakeholder.
seamless
less data access with a solution around re-usable, They want right data
ty to be ab
capability aable
le to perfo
f rm
perform repeatable &“low-code” fo fforr governancee via classifi
governanc classification
f cation
self-service
service withoutt IT plumbing
entire data plumb
m ing and Role based access control

7.0 6.0 5.0

Cloud Engineering Tea
T
Teamm were
Business Analyst were key key stakeholders. They wanted
stakeholders who wanted to right ‘guardrails” , process
analyze and build reports of controls and Operations
single source of truth
““single truth”” management

8.0 9.0

If we look closely, the first stakeholders are from the business side. For
them, the objective is outcome focused. The technology is secondary for
them as long as we continue delivering high-quality business insights in a
repeatable and predictable time frame.
Second are the stakeholders from the CTO’s office. They want to design
the platform (data lake) as a future-ready solution. For them it is important
to make the right technical decisions and adopt a cloud-first approach.
They want to focus on a modern data stack that centers around cloud-
native and software-as-a-service (SaaS) offerings.
Next, the customer’s IT organization is a key stakeholder. Their focus is
to incorporate technical solutions that are easy to maintain, cloud native,
and based on the principles of keeping the integrations minimal.
Next in line as a key stakeholder is the security office team. They
want to ensure that we design a system that has the right “separation of
concerns” and has the right security guardrails so that confidential and
personally identifiable information (PII) data can be safe and secure.
xix
Introduction

Next in line is the CISO’s team for whom the data access policies, data
governance and auditability, etc., are primary concerns. They want to
ensure that the data is available only to the right resources at the right time
through role-, tag-, and attribute-based access controls.
Next in line is the data engineering team who will eventually “own”
the applications and system for maintenance. For them it was important
that the data engineering solution built on the data lake has reusability,
extensibility, and customizability, and is based on a solid programming
framework and design that will be easy to manage and use in the long run.
Next in line is the data scientist community who needs the right access
to the data and right access to the tools to convert the data into insights.
They also want “self-service” as a capability where they have the right
permissions to work on ideas that can help the business get value.
Next in line is the business analyst community who want to be
onboarded into this new data lake platform as soon as possible with access
to a “single source of truth” so that they can start building the mission-
critical application that the business is waiting for.
Finally, the cloud engineering team is a key stakeholder. This team
wants the whole platform to be secure, controlled, user friendly, reliable,
and durable.
As you might have imagined by now, I will be using my experience to
explain the end-to-end process of designing and implementing a data lake
strategy in the following chapters.
This book will (in broad strokes) cover concepts such as how to
understand and document the business asks, define the security model,
define the organization structure, design and implement the data lake
from end to end, set up a production playground, and operationalize the
data lake. Finally, I will present some lessons learned from my experience.
Chapter 1 will focus on each of these points and how each resulted in
the design of a small part of the key problem (platform design) and how
little by little things fell into place for me and my team. Let’s get started.

xx
CHAPTER 1

Understanding “the
Ask”
Objective: Asking the Right Questions
In the introduction of the book, I set the stage for the project we’ll start
discussing in this chapter. When I took up the solution architect and
delivery lead role, I had no idea what vision my customer had, other than
a very general understanding of the final product my customer was after.
The intention was to build a modern, cloud-centric data and analytics
platform (called a lake house). So, at this point, it was important for me
and my team to ask the right questions, gather the requirements in detail,
and start peeling back the layers of the onion. In short, we needed to
understand “the ask.”
The first ask (for my team and me) was to be aligned to the customer’s
vision. To understand this vision, we set up a meeting with the VP of
engineering (the platform owner) to establish the direction of the project
and the key decisions that needed to be made.

© Nayanjyoti Paul 2023 1

N. Paul, Practical Implementation of a Data Lake,
https://doi.org/10.1007/978-1-4842-9735-3_1
Chapter 1 Understanding “the Ask”

The Recommendations
I used the following checklist as part of the vision alignment, and you can
use this for your project too. Also, be open to bringing your own questions
to the meeting based on your customer’s interests and their maturity.
• What are the migration path, modernization
techniques, enhancements, and cloud vendor that will
be used?

• What are the current challenges?

• Why is modernizing data platforms hard?

• What are the top five issues that we want to solve?

• What is available on-premise and on the cloud already?

• What meetings will be needed throughout the project?

• What common terms and jargon can we define?

My team and I started the first round of discussions with the key
customer stakeholders. We then understood the requirements better and
had a better appreciation of the direction our customer wanted to go in.
Each of the seven topics listed previously are detailed in the remainder of
the chapter.

ecide on the Migration Path, Modernization

D
Techniques, Enhancements, and the Cloud
Vendor
After the usual greetings and formal introduction, we sat down to start
documenting the vision. We understood that the requirement was to build
a cloud-native and future-proof data and analytics platform. Having said
that, the high-level objective was very clear. The data lake design was

2
Chapter 1 Understanding “the Ask”

supposed to be sponsored by the business, and they had strict timelines to

ensure we could get 25 highly important reports ready. Of the 25 reports,
most of them were to be built on business logic after bringing in data from
the system of records, but a few of them were to be powered by machine
learning predictive models. For us, that meant that the business had a very
specific “success criteria” in mind, and as long as we could deliver on the
business promise (through a technical capability), we could deliver value.
Even though the outcome was business focused, the enabler was
technology. We wanted to design the architecture “right” so that we
could have a sustainable and adaptive platform for data and analytics for
the future.
We started asking specific questions focused on whether the customer
had already chosen a cloud partner. This was critical as we wanted to be
cloud-native and leverage the capabilities each cloud vendor provided. In
this case, the customer already had decided on AWS. Questions around
whether the requirement was to modernize an existing architecture,
migrate a similar technology, or enhance an existing setup were important
for us to understand. Table 1-1 provides a quick reference for each
question we asked and why it was important.
These questions can add value to any project during the initial
understanding phase. Feel free to use Table 1-1 as a baseline for
documenting the basic premises of the offering you are planning to deliver
for your customer.

3
Chapter 1 Understanding “the Ask”

Table 1-1. Assessment Questions

Questions Why Was the Question Important? What Was Decided?

What cloud Each year cloud vendors introduce new The customer’s
platform to use? capabilities, features, and integrations. decision to go with
By being aligned to a cloud vendor’s AWS ensured (for
capabilities, we can understand the example) that we
“out-of-box” offerings versus gaps for could leverage its
that specific vendor. Also this means ML capabilities on
a correct estimation for time and cost Sagemaker, their
based on the maturity of the vendor and centralized RBAC and
the capabilities they currently offer. TBAC policies through
lake formation, and
many more (more on
those later).
(continued)

4
Chapter 1 Understanding “the Ask”

Table 1-1. (continued)

Questions Why Was the Question Important? What Was Decided?

Do you want Each of these solutions needs separate The customer was
to implement handling and enablement from a very clear that they
a lift-and-shift, technical point of view. wanted a data lake
modernization, or For example, lift and shift should focus in the cloud, which
migration solution on a path of least resistance to have the meant they were
strategy? same capability available in the cloud. ready to open up new
So, an Oracle system on-premise can possibilities, new
be deployed as an Oracle system on the personas, new kinds
cloud. of use cases, and new
opportunities for the
Migration is slightly different; for
whole organization.
example, the same Oracle system
can be migrated to a Redshift system
on the cloud leveraging native cloud
capabilities but keeping the basics
intact.
However, modernization can mean
replacing an on-premise system like
Oracle with a data lake or a lake
house architecture where we can
enable different personas such as data
engineers, analysts, BI team, and the
data science team to leverage the data
in different ways and with different
forms to get value.

5
Chapter 1 Understanding “the Ask”

Assess the Current Challenges

Even though the vision from the customer was for a new, modern data
platform, it is always important to understand why the customer has
decided to take that initiative now, including what challenges have become
important enough that they could not sustain the existing solution. Also,
documenting their current challenges provides a great way to evaluate
“success” and measure the outcomes. The following were some of the
critical challenges that were high priority for our customer in this example:

–– The current setup was costly. The software vendors for the
commercial off-the-shelf (COTS) products were charging a
license fee based on the number of machines. As the
organization was growing, so was their user base.

–– The current setup could not scale up based on the

organization’s needs, seasonality, user personas, etc.

–– As the data volume was growing, the current trend of analytics

was very slow and restrictive. There was no option for machine
learning, predictive modeling, unstructured data analysis, etc.

–– As the organization was gearing up for the future, they had

started investing in data scientists, data analysts, etc. The
organization had started recruiting new talent, and it was
important to build a platform that helped them bring in value.

–– Time to market was essential, and a process that can provide

“self-service” capabilities and quick prototyping features can
unlock a lot of capabilities for the customer.

–– They wanted to be future ready. Peer pressure is a huge

motivation. As other organizations in the same space were
adapting to the world of cloud-native and cloud-centric
solutions, it was important for our customer to not fall behind.

6
Chapter 1 Understanding “the Ask”

Some of these points were critical for the customer, and hence we ensured
that when we designed the solution, we considered the people who would
be using the platform and what capabilities the final platform should have.

nderstand Why Modernizing Data Platforms

U
Is Hard
Along with identifying the customer’s challenges and issues that they
were currently facing, it was important to have an open discussion on the
challenges other customers have faced (in similar domains) and what we
had learned through our experiences (lessons learned). Figure 1-1 provides
a quick reference for our past experience, which we thought would help
this current customer to see a pattern and help us avoid common gotchas.

Identify key
Identify Current What are the current Identify technical and
stakeholders and
Challenges limitations business issues
owners

Identify key Identify and create

Identify High Priority stakeholders and
Schedule meetings
buckets of
Assign prioritization to Identify Long and
with key stakeholders buckets Short-term priorities
Items owners requirements

Identify stakeholders Identify scope to Document key Identify critical decision

Identify Key within business, CISO, ensure outcome has technology decisions paths for each
Stakeholders security etc. business value and document debts stakeholders

Divide the project

Decide on Pilot/ POC
Time and Effort for scope and high-level
Identify project scope between Identify project
Project Plans structure business & technical management style
roadmap
requirements

Create actionable work Setup deployment

Start project Setup daily Document key
items and start Get sign-off on artifacts Build and test strategy along with
implementation whiteboarding sessions decisions
assigning work devops

Realign with key Plan for phase 2 with

Continuous Monitoring Stakeholders re-prioritization
Plan for user training

Figure 1-1. High-level step-by-step process of organizing the project

through different phases

Along with the Figure 2-1 pointers on what we should focus on while
delivering an enterprise-scale data platform solution, Figure 1-2 provides
guidelines for a target-state implementation as part of an end-to-end data
platform implementation.

7
Chapter 1 Understanding “the Ask”

Storage, zones based Roles, Permissions Right governance and

Design of right Design of Exploration
Platform & number of accounts
on data sensitivity and access
Zone for analytics and
team to manage the
and organizational (interactive vs
Infrastructure for data-lake
units automation)
data science
platform and manage
operations

Metadata, Data Rules Engine for

Data Grooming and Classification and Curation, Data Quality Engine
Pipeline scaling and Policy Driven Pipeline
Data Supply Chain automation
Scaling of Data integration with Transformation, for checks and
Configuration
Acquisition catalog systems, data Enrichment, validation
marketplace etc. Compliance etc.

ML
Framework around
Fit for purpose data Decision of right Operationalization –
ETL for data model management
models that support technology for data
Data & ML Products business usage products
processing
from identifying data
to building models @
and collaboration of
data science team
Scale

Integration with Scheduling and

Governance around
Operations & Operations automation around
approvals and
Cleanup, Purge and
Management management and error deploying pipelines
promotion
Archive
management (data and ml)

Figure 1-2. Holistic view of end-to -end capabilities needed for a

data strategy project (from a technical view)

At the minimum, we think that an enterprise-level implementation

should focus on four things: the platform and infrastructure, the data
supply chain, the data and ML product creation, and finally the operations
management. Within each of these four verticals, there are some specific
requirements and capabilities that need to be addressed.
For example, the platform and infrastructure should focus on the right
“account strategy,” data sensitivity issues, roles, permissions, zonal designs,
governance models, etc. The data supply chain should focus on pipeline
generation, scaling of data engineering processes, metadata management,
rules and the data quality engine, etc. The data and ML product creation
should focus on ETL, fit-for-purpose solutions, optimizations, etc. Finally,
the operations management should focus on scheduling, orchestration,
etc. Figure 1-2 is just a representation, but it provides a blueprint in
ensuring that we think through all these capabilities while designing and
implementing the end-to-end enterprise data platform.

8
Chapter 1 Understanding “the Ask”

Determine the Top Five Issues to Solve

This is similar to the previous point discussed. However, the key
differentiation is the process of collecting this information. To understand
the top five issues, we started interviewing almost 50+ key associates
and documenting the top issues they faced. We collected and collated
the answers to our questions across different organization units based
on where the key stakeholders were aligned. The result of the interview
process was a list of common issues faced across the organization.
When we looked at the report, we found many repetitive and common
challenges. Those challenges surely impacted a lot of business units and
hence were high on the priority list for us. Here are a few examples of the
common challenges:

–– The customer needed a central data hub. Different business

units had independent silos of data repositories, which were
either stale or out of sync.

–– There was no single 360-degree view of data. Every business

unit could see only their own side of data.

–– Governance was nonexistent. There was no central catalog

view of organization-wide datasets.
–– Time to market was slow because of the limitations of the
technology.

–– The cost of management and maintenance was a

fundamental issue.

What we discovered from this process was aligned to what we

expected from the previous step, but seeing a repetitive pattern gave
us the confidence that we had been following the right path so far and
documenting the right issues. Figure 1-3 gives a pictorial view of this.

9
Chapter 1 Understanding “the Ask”

Figure 1-3. A pie chart of what we saw were the driving factors for
the need to build a data lake solution

etermine What Is Available On-Premise vs. on

D
the Cloud
It is important to understand our current situation and assess customer
maturity before committing and undertaking any journey. Next, what we
did with our key stakeholders was to understand from them where they
stood and assess where they were currently in their vision. This would help
us to offer the right help and guidance to reach the goal.
First, we sat down with the cloud security and infrastructure team to
understand if the customer had started any journey toward AWS (their
chosen platform). Next, we wanted to understand if any guidelines,
corporate policies, and/or best practices were documented. Table 1-2
summarizes what details we got from the team. (Use this as a guide for
your project, but ensure you have a customer document these, as they will
become the rules of the game.)

10
Chapter 1 Understanding “the Ask”

Table 1-2. Maturity Assessment Questionnaire

Questions Maturity Assessment

Has the organization In this case, the customer had a well-established

started the journey cloud engineering practice. However, they had not
toward the cloud in implemented any large-scale implementation in the
practice, or is it still a cloud. It had only a few small proofs of concept for a
“paper exercise”? smaller group within the organization.
Does the organization The customer had documentation around cloud policies
have any standard and best practices. However, the customer wanted us to
security or cloud review them, find gaps, and propose a better approach
practices documented for the future.
already?
Who are the personas The customer wanted the platform to be built to be
(teams) with access to future proof and ready for other organizational units to
the current on-premise feel secure enough with it to onboard their analytics
data warehouse the workload. This meant that we had to think beyond what
customers are hosting? the current on-premise systems provided in terms of
role-based, attribute-based, and domain-based access to
Is the intention of the
data and build a solution that would provide a separation
customer to onboard
of concerns for each team who would use the platform
other personas in the
and onboard their data for analytics.
new data platform (when
ready), and will this imply
a different set of access
policies and practices?
(continued)

11
Chapter 1 Understanding “the Ask”

Table 1-2. (continued)

Questions Maturity Assessment

Have the consumption The simple answer was yes. A major focus was to
patterns changed? Are onboard the data science teams and enable them to
there new parties and build cutting-edge use cases to help do predictive
use cases that would insights on data rather than reactive ones. Similarly, a
be adopted on the new new kind of data analytics and BI teams would need
platform? instant and live access to the data to build and refresh
metrics for the business to help in quick decision-
making. Those personas and their set of use cases were
completely new and unknown and would surely need a
different design approach.
Do you want to be Most customers start with the idea of setting up a
provider agnostic or cloud-based system targeting a specific cloud provider
multicloud (from a for partnership. However, soon clients decide to have
strategy point)? a multicloud strategy that is provider agnostic. These
decisions do not impact the solution strategy in the short
to medium run, but they do have implications in the long
run. For this customer, they did not have any preference
about this, and we were supposed to focus on the AWS-
specific solution for now.

reate the Meetings Needed Throughout

C
the Project
Implementing a large-scale project is always challenging. Typically when
we have sprint-based programs and each sprint is 2 weeks, it is important
to think ahead and plan for the upcoming tasks. So, we wanted to
identify important meetings and get them on the calendar. This included

12
Chapter 1 Understanding “the Ask”

identifying the priority and ordering of tasks and ensuring we got calendar
time from each stakeholder so that we did not have to wait for any
important decisions from our customers.
We enabled three workstreams. I ensured we had dedicated teams for
each of the three workstreams, and each had specific responsibility areas,
as listed in Table 1-3. You can use this table to plan ahead for important
meetings with the right stakeholders.

Table 1-3. High-Level Workstreams with Their Typical

Responsibilities for a Technical Data Lake Implementation
Workstream Main Responsibilities

Business − Identify and prioritize source systems that need to be onboarded

analysis and into the new platform.
grooming − Identify which datasets from which sources need to be priority 1.
− For each source, “groom” the dataset based on data profile, types
of data, type of interactions, frequency of loads, and special data
handling needs (version of data versus snapshot, etc.).
− For each dataset, document basic data quality checks, common
issues, common standardization needs, and common enrichment
needs required.
− From a consumption point of view, clearly document the ask,
expected business outcome, and samples of output.
− From a consumption point of view, clearly document the business
logic for converting source datasets into the expected outcome.
(continued)

13
Chapter 1 Understanding “the Ask”

Table 1-3. (continued)

Workstream Main Responsibilities

Data security − Work with the data security teams, CISO teams, and cloud
engineering teams, and have a common understanding of how
many AWS accounts are needed, how many environments are
needed (dev/UAT/prod), how to separate out the concerns of “blast
radius,” how to manage data encryption, how to manage PII data,
how to implement network security on data onboarding and IAM
policies, etc.
− Identify and document processes to define how to onboard a new
source system and what access and security should be in place.
− Identify and document processes to define a user onboarding process
through AD integrations, IAM policies, and roles to be applied.
− Have separate capabilities between interactive access and
automated access and have different policies, services, and
guardrails for both types.
− Understand and document life-cycle policies and practices for
data and processes.
− Understand and document a role-based matrix of who will be
getting access to this new platform and what will be their access
privileges.
− Define and document a DR strategy (hot-hot, hot-cold, cold-cold,
etc.).
− Define and document how third-party tools will be authenticated
and how they will access data within the platform (temp
credentials, SSO etc.).
− Define and document role-based, attribute-based, domain-based,
tag-based data access, and sharing needs.
− Define and document data consumption roles and policies, etc.
(continued)
14
Chapter 1 Understanding “the Ask”

Table 1-3. (continued)

Workstream Main Responsibilities

Data − Design and document architecture for building a cloud-native

engineering and cloud-centric data lake strategy.
− Design a framework for a reusable and repeatable data
ingestion mechanism.
− Design and document ingestion patterns and processes based
on source types, source systems interactions, frequency (batch
versus streaming etc.), data formats, and data types.
− Design and document a framework for data cleansing, data
quality assessment, and data validation and checks in an
automated and reusable way.
− Design and document a framework for data enrichment,
data standardization, data augmentation, and data curation
in a reusable and repeatable way.
− Design and document a framework to capture the metadata
of a business, operational, and technical nature and sync up
with a catalog of choice.
− Design and document a data reconciliation and audit balance
framework for validating data loaded into the system.
− Design and document a framework for building a
data-reconciliation process for versioned datasets that might
have changing dimensions.
− Design and document a framework for building a business
outcome (ETL) process in an automated and reusable way.
− Define and coordinate with other teams to understand existing
and “to be” engineering processes for DR strategy.
− Define and coordinate with other teams to understand and
engineer processes for the data access in an automated way.
− Design and coordinate with third-party tools for data catalog, data
governance, scheduling, monitoring, etc.

15
Chapter 1 Understanding “the Ask”

Define Common Terms and Jargon

Probably the single most important activity to kick off any project is
the task that is needed to bring everyone on the same page. I have had
challenges in my previous projects where we did not have a chance to
align on the common terms and jargon. That always led to multiple issues
and challenges for any technical discussion and architecture process
throughout the project.
Here are a few examples where we wanted to align on this project:

–– A common definition of data storage zones. Examples are raw

versus curated versus provisioned, or bronze versus silver ver-
sus gold.

–– Clear responsibility and features for the zones. Examples include

what controls these zones should have versus what kind of data
and life-cycle policies should the zones have.

–– Common definitions for tenant versus hub versus spoke.

–– Common definitions for dev versus UAT versus prod versus

sandbox versus playground.

–– ETL versus ELT with regard to the cloud platform.

–– Common philosophy of loading data on-demand versus loading
all data and processing on an ad hoc basis.

–– Common philosophy for default personas and intended access

control to data within the data lake.

This was an important alignment where we as a team not only

interacted with customers for the first time, but we made great progress
in terms of clearly documenting what was to be delivered in the
subsequent weeks.

16
Chapter 1 Understanding “the Ask”

Key Takeaways
To recap, we met with all the key stakeholders including our sponsor
for the data strategy work. We interviewed key personnel and identified
key areas (to prioritize), and we understood the current landscape and
maturity. We devised a strategy to work on three workstreams and defined
key meetings and whiteboard sessions for the next few weeks (putting
meetings on calendars for key personnel). Last but not least, we defined
common terms and presented what our focus would be and the possible
measure of success for this project.
Based on the series of discussions, in general our goal for the next steps
were as follows:

Understand the customer’s requirements: The

first step is to understand the customer’s specific
requirements and goals to develop a plan to achieve
them. This includes understanding the data sources,
data types, data volume, and other factors that may
affect the design of the data lake.

Design the data lake architecture: After

understanding the customer’s requirements, the
next step is to design the data lake architecture.
This includes selecting the appropriate storage
technology, selecting the data ingestion and
transformation tools, and designing the data flow
and data management framework.

Develop the data lake: Once the architecture

is designed, the team can start to develop the
data lake. This includes setting up the necessary
infrastructure, building the data ingestion and
processing pipelines, and managing the data lake.

17
Chapter 1 Understanding “the Ask”

Test and deploy the data lake: After the data lake is
developed, it needs to be tested and deployed. This
includes testing the data lake to ensure it meets
the customer’s requirements and deploying it in a
production environment.

Monitor and optimize the data lake: Once the

data lake is deployed, it’s important to monitor
its performance to ensure it’s meeting the
customer’s goals.

18
CHAPTER 2

Enabling the Security

Model
Objective: Identifying
the Security Considerations
My responsibility as part of workstream was to define, design, and
implement a holistic security model for the data platform.
My fundamental objective was to work closely with the customer’s
security and cloud engineering teams and with the AWS team to define a
security blueprint that could help with the customer’s platform, data, and
application security considerations.
As we had already set up the important meetings ahead of time, we
started having initial one-on-one meetings with each of the key security
stakeholders (both internal and external) to document and design the
key decision points (through knowledge discovery in data [KDD]) needed
for designing the security blueprints. We eventually brought all the teams
together to agree on the common solution and socialized the outcomes.
This approach ensured we did not waste everyone’s time and ensured we
had targeted questions for specific groups and tangible outcomes designed
and approved by each group.

© Nayanjyoti Paul 2023 19

N. Paul, Practical Implementation of a Data Lake,
https://doi.org/10.1007/978-1-4842-9735-3_2
Chapter 2 Enabling the Security Model

The Recommendations
I used the following key design decisions to come up with a blueprint and
ensured that those KDDs addressed the needs of each stakeholder. The
objectives of the internal and external stakeholders were different. For
example, the internal teams wanted a security blueprint that focused on
a separation of concerns, the right access and security controls, and tight
integration with enterprise security principles and policies, whereas the
external stakeholders asked us to focus on cloud-native and best-of-breed
technologies and the right service model to build the solution.
The following checklist was part of the vision alignment, and you can
use this for your project too as a template. Be open to asking your own
questions based on your customer’s interest and their maturity (in other
words, use this as a starting guide).

• PII columns: RBAC, ABAC features

• Central access control

• SAML vs. PING, etc.

• Strategy for data obfuscation

• GDPR and other data privacy

• Ownership of the platform, interaction with other

stakeholders (CISO, legal teams, etc.)

• Legal/contractual obligations on getting/connecting

data from a third party on the cloud

Each of these is detailed in the remainder of the chapter.

20
Chapter 2 Enabling the Security Model

PII Columns: RBAC, ABAC Features

As we were bringing in data from third-party sources and external vendors,
the chances of bringing in sensitive data is high. On top of that, the data is
owned by different organizational units, which begs the question, is it OK
for a group of people to have access to certain PII data that is specific to
that organizational unit but cannot be accessed by other units?
The following are both the challenges and the requirements for PII
column mapping from the requirements we received from the customer:

• Customers needed a single source of truth for all their

data and analytical needs. Currently the data across the
organization was siloed, which was one of the major
reasons for this customer to venture into a data lake in
the cloud. Hence, it was important for the data strategy
to have open access to the datasets. However, the PII
columns should be treated differently. The access to PII
data should be based on a “need-to-know” basis.

• Each dataset needs to be tagged with a classification

level, typically ranging from Confidential to Public.
Confidential-tagged datasets have different encryption
keys, and access to those datasets were on an on-
demand basic (not open for all).

• Each column has a sensitivity level (typically L001,

L002, etc.). These sensitivity levels should govern which
columns of which datasets can be accessed by default
versus which ones need special access.

• Datasets are organized as data domains (within business

domains). Some of these datasets should be handled
with the utmost care. For example, the HR team or
finance team can access salary information, but other
organizational/business units should not have access to it.

21
Chapter 2 Enabling the Security Model

• Special roles and access grants should be allowed

for accessing sensitive data. As this is an open data
lake platform, personas such as data scientists or
data analysts can request access to certain sensitive
information based on business case and justification.
Policies and processes should be in place to enable
users and roles access to sensitive information for a
specific duration of time.

• Access permissions should also be controlled based on

the consumption and interaction pattern. For example,
automated access to data for processing might have full
access to all columns and datasets to ensure a quick
and repeatable way of building data transformation
and ETL jobs. However, the ad hoc and interactive
access should be restricted based on the role and
persona group the person/resource belongs to.

• Third-party tools that access data from the data lake

should also respect the access control and guardrails
defined in the data lake. They should impersonate
the user who needs access to data via the third-party
tools or have SSO integration for specific service role–
based access.

Figure 2-1 provides a glimpse of the overall process that was followed
for this customer based on the AWS stack selected for the project. The idea
was to have a data strategy design (more to follow in the next chapters)
of organizing the structure of data into Raw (or Bronze), Curated (or
Silver), and Provisioned (or Gold) for the Automated (ETL jobs, etc.) and
Playground (ad hoc or interactive) access perspective. For the interactive
access process, the access control was defined at a granular level (tag,
sensitivity, and domain level) and was based on AWS Lake Formation

22
Chapter 2 Enabling the Security Model

(based on the AWS technology stack selected). Access to any “curated”

datasets had to be done based on automated policies registered while
onboarding data in the data lake through the data pipelines. All these
automated access policies were stored in the Lake Formation service of
AWS centrally, and access to the data through any service (like Athena,
Redshift Spectrum, etc.) was done through the Glue catalog managed and
governed by the Lake Formation service.
let’s

Figure 2-1. A high-level view (with AWS technology stack) for a

governed data lake

We started the security and access control design by taking baby steps
and handling the architecture on a use case by use case basis. We wanted
to have a baseline architecture first and then test our hypothesis by laying
out additional use cases and validating whether our architecture could
stand the test of the same.

23
Chapter 2 Enabling the Security Model

The following were the measures we took and the guidelines we

followed to build the first iterations:

1. We created a multi-account AWS strategy to

maintain the blast radius. The multi-account
strategy can be based on three levels: ingestion
accounts, processing accounts, and consumption
accounts.

2. For the previous step, ingestion accounts were

isolated by source type such as external third-party
sources, internal on-premise sources (from within
organization), and cloud based (data loaded from
other existing AWS accounts).

3. For processing accounts, we kept a centralized

account but ensured those accounts did not store
any data. This processing account can assume one
of multiple service roles and process data.

4. The consumption accounts were more flexible.

We started by dividing AWS accounts based on
interactive access or automated access. However,
soon we had to expand automated access accounts
into multiple hub versus spoke architecture as
multiple organizational units wanted to own and
manage their own “data products.” Similarly, we
had to scale up interactive access into multiple AWS
accounts because of multiple independent teams
and their needs to have a “self-service” capability for
delivering business insights.

24
Chapter 2 Enabling the Security Model

5. Once we decided on the AWS account setup, we

tried to finalize the data encryption strategies. Each
account had a bunch of AWS KMS CMK keys. We
divided the keys into tier 1 to tier 3 keys. Based on
the sensitivity of the datasets identified, we pushed
data into independent buckets that had the default
CMK keys associated with them. The service roles
had access to those keys.

6. Once the encryption strategies were in place, we

ventured into role-based, domain-based, and tag-
based access control policies. Each dataset when
being onboarded into the data lake was associated
with three tags: business domain tags (like finance,
marketing, general, etc.), data sensitivity tags
(confidential, public, etc.), and column-level PII
tags (L001, L002, etc., where L001 meant no PII, and
L002 meant it has partial or entire PII information
such as date or birth along with full name). We spent
considerable time and effort discussing these with
business and the CISO to come up with the tags.
7. Once the tags were in place, we introduced AWS
Lake Formation. AWS Lake Formation is a service
that allows a central access control and governance
platform to enforce data access policies. Typically,
Lake Formation ensures that the client applications
(like AWS Athena, etc.) authenticates itself to access
any data in S3. The authentication process grants
temporary credentials based on the user’s role.
Internally, Lake Formation then returns only those
datasets, columns, etc., that the user has “grants” for.

25
Chapter 2 Enabling the Security Model

Hence, in our example, users who are from ROLE-A

that belongs to ORGANIZATION UNIT (or business
domain) B can query only those datasets that are
tagged for ORGANIZATION B usage (or tagged for
GENERAL usage). Additionally, the ROLE-A can
view only those columns of the mentioned datasets
that are tagged with either L001, L002, or L003 based
on the tags allowed for ROLE-A.

8. Once the tag and role-based access were set up,

we wrapped up the security and access control
based on the consumption pattern. In this case,
we focused only on AWS Redshift, and hence
we defined policies for Redshift access and data
sharing through IAM roles (more on Redshift in the
upcoming chapters). Redshift was used to register
data products that were owned by independent
domain/business organizations, and we ensured
the access control follows the same philosophy as
mentioned earlier.

9. Lastly, we enabled the Playground area as a logical

extension of the production setup. We enabled
guardrails and processes to access data and services
in the playground. This was mostly for data science
interactive access. Chapter 5 talks about enabling
the data science playground.

Figure 2-2 shows how the overall Lake Formation setup might look
(from a high level).

26
Chapter 2 Enabling the Security Model

Figure 2-2. How a central catalog and access control can be designed
for managing role-based access for interactive users

Central Access Control

Central access control is related (at least in this example) to the setup of
the Lake Formation (centralized access control) AWS account, as depicted
in Figure 2-3. Let’s deep dive into what it means and why we designed it
that way in this project.

27
Chapter 2 Enabling the Security Model

Ingestion Accounts Raw Data Account

Consumption Accounts
ACCOUNT #1
ACCOUNT #4
Ingeson Account for 3rd
Party data Account to only save data in Lake formation
ACCOUNT #8
RAW format (that will contain
Curated Account Central Account
PII and other sensive data)
Playground account for
interacve access
1.0
ACCOUNT #2
2.0
Ingeson Account for On- ACCOUNT #6 ACCOUNT #7
premise Producon data
ACCOUNT #5 Query/Curated Account Lake Formaon Central
where data is clean, 4.0 (master account) that 5.0
3.0 manages all data catalog
Account to process data into enriched and converted to
common format through ”single version of truth” globally
enrichment, augmentaon, ACCOUNT #9
ACCOUNT #3 data quality, validaon etc.
No data is saved here – only Purpose Driven account
Ingeson Account for Other automated process run here for scheduled workloads
Cloud Accounts to build business
outcomes

Orchestration Account

Production Data Platform

Figure 2-3. A sample multi-account strategy for access control and

separation of concerns to designing an enterprise-ready data lake

Table 2-1 explains the choices made in this project.

Table 2-1. Accounts Needed When Designing a Multi-account

Enterprise Data Lake
Account Type Account Account Purpose
Number

Ingestion Account #1 • Only connect and access third-party data.

account • No access by any users to this account. Only
ingestion jobs run in this account. No data is
saved here.
• Only ingestion-specific services are enabled.
(continued)

28
Chapter 2 Enabling the Security Model

Table 2-1. (continued)

Account Type Account Account Purpose
Number

Account #2 • Only connect and access on-premise data. Data

is saved with specific tiered encryption keys.
• No access by any users to this account. Only
ingestion jobs run in this account. No data is
saved here.
• Only ingestion-specific services are enabled.
Account #3 • Only connect and access other cloud data. Data
is saved with specific tiered encryption keys.
• There is no access by any users to this account.
Only ingestion jobs run in this account. No data
is saved here.
• Only ingestion-specific services are enabled.
Raw data Account #4 • Data is saved with specific tiered
account encryption keys.
• No access by any users to this account.
• No jobs run in this account; only cross-account
access is provided for accounts #1, #2, and #3
to save data into this account and account #5 to
read from this account.
Orchestration Account #5 • No data is saved into this account.
account • Only scheduled jobs run to clean up, enrich,
augment, and validate data from account #4 and
save to account #6.
Query/curated Account #6 • Data is saved with specific tiered encryption keys.
account • This account provides persona-based access
to data.
(continued)

29
Chapter 2 Enabling the Security Model

Table 2-1. (continued)

Account Type Account Account Purpose
Number

Lake Formation Account #7 • Central security and audit account.

account • All data catalog and tables are registered here.
• Policies for role-based, tag-based, and domain-
based access are maintained here.
• Central account to grant permissions as who can
access which tables/columns/data based, etc.
• Captures central audits.
Playground Account #8 • Enables interactive users to work with data.
account • Data scientists, data engineers, etc., have
access to this account and they get cross-
account access to account #6 based on the
policies and permissions defined in account #7.
Purpose-driven Account #9 • Account where final consumption ready
account datasets reside.
• All processes running here are scheduled and
have a business reason.
• No interactive or user-based access to this
account.

30
Chapter 2 Enabling the Security Model

uthentication and Authorization (SAML vs.

A
PING, etc.)
This section is important for two reasons. Initially, documenting the Active
Directory (AD) integration helps us map the users and roles to capabilities
within the data lake as what the user can and cannot do. The other (and
more important) part is the decision of who can see what data and how the
user’s role defines what domain/column-level data they can have access
to. Table 2-2 lists what we discussed with our customer to understand their
current approach and what kind of roles were needed for us to implement
the access control process.
You can use Table 2-2 as a template and have similar documentation
for your project scope for authentication and authorization policies.

31
Random documents with unrelated
content Scribd suggests to you:
[1945] It is described by Tylor in his Anahuac, ch. 9; by
Brocklehurst in his Mexico to-day, ch. 21; by Bandelier in the
American Antiquarian (1878), ii. 15; in Mayer’s Mexico; and in the
summary of information (fifteen years old, however) in Bancroft’s
Mexico, iv. 553, etc., with references, p. 565, which includes
references to the Uhde collection at Heidelberg, the Christy
collection in London (Tylor), that of the American Philosophical
Society in Philadelphia (Trans., iii. 570), not to name the Mexican
sections of the large museums of America and Europe. Henry
Phillips, Jr. (Proc. Amer. Philosophical Soc., xxi. p. 111) gives a list
of public collections of American Archæology. There are some
private collections mentioned in the Archives de la Soc. Amér. de
France, Nouv. Ser., vol. i. A. de Longperier’s Notice des
Monuments dans la Salle des Antiquités Américaines (Paris, 1880)
covers a part of the great Paris exhibition of that year. Something
is found in E. T. Stevens’s Flint Chips, a guide to prehistoric
archæology as illustrated in the Blackmore Museum [at Salisbury,
England], London, 1870.

[1946] There is an account of Mendoza in the Amer. Antiq. Soc.

Proc., April, 1888, p. 172.

[1947] Coleccion de las Antigüedades Mexicanas que ecsisten en el

Museo Nacional, litografiadas por Frederico Waldeck (Mexico,
1827—fol.); Sabin, iv. 15796. See miscellaneous references on
Mexican relics in Bancroft’s Nat. Races, iv. 565.
TRANSCRIBER’S NOTE:
—Obvious print and punctuation errors were corrected.
—The transcriber of this project created the book cover image using the title page of
the original book. The image is placed in the public domain.
*** END OF THE PROJECT GUTENBERG EBOOK NARRATIVE AND
CRITICAL HISTORY OF AMERICA, VOL. 1 (OF 8) ***

Updated editions will replace the previous one—the old editions

will be renamed.

Creating the works from print editions not protected by U.S.

copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE

THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the

free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and

Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only

be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project

Gutenberg:

1.E.1. The following sentence, with active links to, or other

immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United

States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is

derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is

posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project

Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute

this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,

performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or

providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who

notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of

any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project

Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend

considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except

for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you

discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set

forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the

Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission

of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500

West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws

regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states

where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot

make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current

donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About

Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several

printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge

connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and

personal growth every day!

ebookbell.com

Technology Made Simple for the Technical Recruiter, Second Edition: A Technical Skills Primer
From Everand
Technology Made Simple for the Technical Recruiter, Second Edition: A Technical Skills Primer
Obi Ogbanufe
2/5 (1)
Learn Power BI: A beginner's guide to developing interactive business intelligence solutions using Microsoft Power BI
From Everand
Learn Power BI: A beginner's guide to developing interactive business intelligence solutions using Microsoft Power BI
Greg Deckler
5/5 (2)
Learning Apex Programming
From Everand
Learning Apex Programming
Matt Kaufman
5/5 (1)
Apex Design Patterns
From Everand
Apex Design Patterns
Jitendra Zaa
4/5 (1)
The Rock Crusher: A Model for Flow-Based Backlog Management
From Everand
The Rock Crusher: A Model for Flow-Based Backlog Management
Steve Adolph
No ratings yet
Building A Modern Data Center Ebook
100% (1)
Building A Modern Data Center Ebook
263 pages
Starting a Tech Business: A Practical Guide for Anyone Creating or Designing Applications or Software
From Everand
Starting a Tech Business: A Practical Guide for Anyone Creating or Designing Applications or Software
Alex Cowan
3.5/5 (10)
The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises
From Everand
The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises
Len Silverston
No ratings yet
Splunk Developer's Guide
From Everand
Splunk Developer's Guide
Kyle Smith
No ratings yet
Schaum's Outline of Principles of Computer Science
From Everand
Schaum's Outline of Principles of Computer Science
Paul Tymann
No ratings yet
Fundamentals of Adopting the NIST Cybersecurity Framework
From Everand
Fundamentals of Adopting the NIST Cybersecurity Framework
David Moskowitz
No ratings yet
58687
No ratings yet
58687
41 pages
Management Strategies for the Cloud Revolution: How Cloud Computing Is Transforming Business and Why You Can't Afford to Be Left Behind
From Everand
Management Strategies for the Cloud Revolution: How Cloud Computing Is Transforming Business and Why You Can't Afford to Be Left Behind
Charles Babcock
3/5 (1)
MODERN ENTERPRISE Data Pipeline
No ratings yet
MODERN ENTERPRISE Data Pipeline
98 pages
CLOUDC~1-pages-3
No ratings yet
CLOUDC~1-pages-3
19 pages
Managing IaaS and DBaaS Clouds with Oracle Enterprise Manager Cloud Control 12c
From Everand
Managing IaaS and DBaaS Clouds with Oracle Enterprise Manager Cloud Control 12c
Ved Antani
No ratings yet
Implementing Cloud Design Patterns for AWS
From Everand
Implementing Cloud Design Patterns for AWS
Marcus Young
No ratings yet
Mastering Machine Learning with R - Second Edition
From Everand
Mastering Machine Learning with R - Second Edition
Cory Lesmeister
No ratings yet
Information System For Managers PDF
No ratings yet
Information System For Managers PDF
13 pages
The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI
From Everand
The Machine Learning Solutions Architect Handbook: Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI
David Ping
No ratings yet
Big Data Spectrum
No ratings yet
Big Data Spectrum
61 pages
Cloud Data Architectures Demystified: Gain the expertise to build Cloud data solutions as per the organization's needs (English Edition)
From Everand
Cloud Data Architectures Demystified: Gain the expertise to build Cloud data solutions as per the organization's needs (English Edition)
Ashok Boddeda
No ratings yet
R High Performance Programming
From Everand
R High Performance Programming
Aloysius Lim
4.5/5 (2)
Information and Communication Technologies in Healthcare 1st Edition Stephan Jones (Ed.) pdf download
No ratings yet
Information and Communication Technologies in Healthcare 1st Edition Stephan Jones (Ed.) pdf download
72 pages
Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS)
From Everand
Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS)
Michael J. Kavis
5/5 (1)
Information And Communication Technologies In Healthcare 1st Stepha
No ratings yet
Information And Communication Technologies In Healthcare 1st Stepha
81 pages
Enabling World-Class Decisions for Banks and Credit Unions: Making Dollars and Sense of Your Data
From Everand
Enabling World-Class Decisions for Banks and Credit Unions: Making Dollars and Sense of Your Data
Corey Barak
No ratings yet
System Design Interview: 300 Questions And Answers: Prepare And Pass
From Everand
System Design Interview: 300 Questions And Answers: Prepare And Pass
Rob Botwright
No ratings yet
Ebook - Operationalizing The Data Lake PDF
100% (3)
Ebook - Operationalizing The Data Lake PDF
173 pages
Big Data Analytics in Cybersecurity First Edition Deng - The ebook in PDF format with all chapters is ready for download
100% (2)
Big Data Analytics in Cybersecurity First Edition Deng - The ebook in PDF format with all chapters is ready for download
63 pages
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
From Everand
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
Shekhar Khandelwal
No ratings yet
Full Download Effective Cybersecurity A Guide to Using Best Practices and Standards 1st Edition William Stallings PDF DOCX
100% (1)
Full Download Effective Cybersecurity A Guide to Using Best Practices and Standards 1st Edition William Stallings PDF DOCX
62 pages
Essentials of Cloud Computing A Holistic Perspective Surianarayanan all chapter instant download
100% (5)
Essentials of Cloud Computing A Holistic Perspective Surianarayanan all chapter instant download
55 pages
Instant Download Advances in computers 82 1st Edition Marvin V. Zelkowitz (Editor) PDF All Chapters
No ratings yet
Instant Download Advances in computers 82 1st Edition Marvin V. Zelkowitz (Editor) PDF All Chapters
76 pages
Practitioner's Guide to Operationalizing Data Governance Mary Anne Hopperinstant download
100% (2)
Practitioner's Guide to Operationalizing Data Governance Mary Anne Hopperinstant download
53 pages
Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk
From Everand
Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk
Dmitry Anoshin
No ratings yet
Mastering Salesforce Experience Cloud: Strategies for creating powerful customer interactions
From Everand
Mastering Salesforce Experience Cloud: Strategies for creating powerful customer interactions
Lillie Beiting
No ratings yet
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
From Everand
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
Corey Barak
No ratings yet
Big Data in Practice: How 45 Successful Companies Used Big Data Analytics to Deliver Extraordinary Results
From Everand
Big Data in Practice: How 45 Successful Companies Used Big Data Analytics to Deliver Extraordinary Results
Bernard Marr
3.5/5 (8)
Enabling World-Class Decisions for Asia Pacific (APAC): The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions for Asia Pacific
From Everand
Enabling World-Class Decisions for Asia Pacific (APAC): The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions for Asia Pacific
Corey Barak
No ratings yet
Remaining Relevant in Your Tech Career: When Change Is the Only Constant
From Everand
Remaining Relevant in Your Tech Career: When Change Is the Only Constant
Robert Stackowiak
No ratings yet
Project to Product: How to Survive and Thrive in the Age of Digital Disruption with the Flow Framework
From Everand
Project to Product: How to Survive and Thrive in the Age of Digital Disruption with the Flow Framework
Mik Kersten
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Everyday Data Structures
From Everand
Everyday Data Structures
William Smith
No ratings yet
Instant Nokogiri
From Everand
Instant Nokogiri
Hunter Powers
No ratings yet
Executive's Guide to Cloud Computing
From Everand
Executive's Guide to Cloud Computing
Eric A. Marks
3.5/5 (1)
AI Product Manager's Handbook: Build, integrate, scale, and optimize products to grow as an AI product manager
From Everand
AI Product Manager's Handbook: Build, integrate, scale, and optimize products to grow as an AI product manager
Irene Bratsis
No ratings yet
Super Searchers on Competitive Intelligence: The Online and Offline Secrets of Top CI Researchers
From Everand
Super Searchers on Competitive Intelligence: The Online and Offline Secrets of Top CI Researchers
Margaret Metcalf Carr
4.5/5 (4)
Blockchain for Business with Hyperledger Fabric: A complete guide to enterprise blockchain implementation using Hyperledger Fabric
From Everand
Blockchain for Business with Hyperledger Fabric: A complete guide to enterprise blockchain implementation using Hyperledger Fabric
Nakul Shah
No ratings yet
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
From Everand
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Alok Kumar
No ratings yet
Open-Source Odyssey: Pioneering Data Engineering with AI Automation
From Everand
Open-Source Odyssey: Pioneering Data Engineering with AI Automation
Muthukrishnan Muthusubramanian
No ratings yet
Hands-On Network Forensics: Investigate network attacks and find evidence using common network forensic tools
From Everand
Hands-On Network Forensics: Investigate network attacks and find evidence using common network forensic tools
Nipun Jaswal
No ratings yet
Designing Machine Learning Systems with Python
From Everand
Designing Machine Learning Systems with Python
David Julian
No ratings yet
Unlocking Data with Generative AI and RAG: Enhance generative AI systems by integrating internal data with large language models using RAG
From Everand
Unlocking Data with Generative AI and RAG: Enhance generative AI systems by integrating internal data with large language models using RAG
Keith Bourne
No ratings yet
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
Mastering Symfony
From Everand
Mastering Symfony
Sohail Salehi
No ratings yet
On Top of the Cloud: How CIOs Leverage New Technologies to Drive Change and Build Value Across the Enterprise
From Everand
On Top of the Cloud: How CIOs Leverage New Technologies to Drive Change and Build Value Across the Enterprise
Hunter Muller
No ratings yet
Big Data Forensics – Learning Hadoop Investigations
From Everand
Big Data Forensics – Learning Hadoop Investigations
Joe Sremack
No ratings yet
Mess Dinner Etiquette V1 0
No ratings yet
Mess Dinner Etiquette V1 0
12 pages
Accounting
No ratings yet
Accounting
11 pages
Bot Act 2006
No ratings yet
Bot Act 2006
36 pages
Loan Management System C .Net Source Cod
No ratings yet
Loan Management System C .Net Source Cod
4 pages
Exclusive - Federal Agents Deliver Evidence Preservation Letter To Co
No ratings yet
Exclusive - Federal Agents Deliver Evidence Preservation Letter To Co
1 page
Big Cat Rescue Ruling in Federal Court
No ratings yet
Big Cat Rescue Ruling in Federal Court
19 pages
Chaitanya Resume
No ratings yet
Chaitanya Resume
6 pages
CLASS 10 ASSIGNMENT ch- R.D. & GST
No ratings yet
CLASS 10 ASSIGNMENT ch- R.D. & GST
2 pages
Regulation and Standard in Financial Techv1
No ratings yet
Regulation and Standard in Financial Techv1
65 pages
Siemens Gearless Drive Wincc Pictures
No ratings yet
Siemens Gearless Drive Wincc Pictures
27 pages
Unit-1 - EC&M-Intr To Civil Engg & Mechanics
No ratings yet
Unit-1 - EC&M-Intr To Civil Engg & Mechanics
46 pages
Round Dial Vs Glass Cockpit
No ratings yet
Round Dial Vs Glass Cockpit
3 pages
Datron c5 Brochure
No ratings yet
Datron c5 Brochure
2 pages
Surveying Assignment - Traverse - Chapter1
No ratings yet
Surveying Assignment - Traverse - Chapter1
2 pages
Demo 20062
No ratings yet
Demo 20062
31 pages
Ray Rae V JAY-Z
No ratings yet
Ray Rae V JAY-Z
14 pages
Unit 4 by Ravi Recent Trends
100% (2)
Unit 4 by Ravi Recent Trends
43 pages
Analysis of Free and Forced Vibration of A Cracked
No ratings yet
Analysis of Free and Forced Vibration of A Cracked
9 pages
Difference Between Spatial and Temporal Data Mining
No ratings yet
Difference Between Spatial and Temporal Data Mining
5 pages
MINI PROJECT SYNOPSI Raj
No ratings yet
MINI PROJECT SYNOPSI Raj
12 pages
Sinumerik Live Programming Dynamic 5 Axis Machining Directly in SINUMERIK Operate
No ratings yet
Sinumerik Live Programming Dynamic 5 Axis Machining Directly in SINUMERIK Operate
15 pages
Engine Remote Interface Installation and Programming Guide Dde & Mbe
No ratings yet
Engine Remote Interface Installation and Programming Guide Dde & Mbe
3 pages
Data Leakage Detection
100% (1)
Data Leakage Detection
81 pages
B SHIVANGI 1077 Labour Law Research Paper
No ratings yet
B SHIVANGI 1077 Labour Law Research Paper
13 pages
Coc Application Checklist
No ratings yet
Coc Application Checklist
1 page
The 80386 Microprocessors
No ratings yet
The 80386 Microprocessors
24 pages
Calculation of An Oil/air Cooler
No ratings yet
Calculation of An Oil/air Cooler
1 page
R16AMR ReleaseHighlights
No ratings yet
R16AMR ReleaseHighlights
27 pages
AFP V RTC
No ratings yet
AFP V RTC
1 page
Project Cost Course Guide Book
No ratings yet
Project Cost Course Guide Book
4 pages

Practical Implementation Of A Data Lake Translating Customer Expectations Into Tangible Technical Goals 1st Edition Nayanjyoti Paul instant download

Uploaded by

Practical Implementation Of A Data Lake Translating Customer Expectations Into Tangible Technical Goals 1st Edition Nayanjyoti Paul instant download

Uploaded by

Practical Implementation Of A Data Lake

Translating Customer Expectations Into Tangible

Explore and download more ebooks at ebookbell.com

A New Hypothesis On The Anisotropic Reynolds Stress Tensor For

A Practical Guide For Simulation And Fpga Implementation Of Digital

Electromagnetic Imaging For A Novel Generation Of Medical Devices

The Art Of Hospitality Implementation Guide A Practical Guide For A

Vectorization A Practical Guide To Efficient Implementations Of

Vectorization A Practical Guide To Efficient Implementations Of

Operations Strategy In Action A Guide To The Theory And Practice Of

Modern Geotechnical Design Codes Of Practice Implementation

ISBN-13 (pbk): 978-1-4842-9734-6 ISBN-13 (electronic): 978-1-4842-9735-3

Copyright © 2023 by Nayanjyoti Paul

About the Technical Reviewer�������������������������������������������������������������ix

Chapter 1: Understanding “the Ask”����������������������������������������������������1

Chapter 2: Enabling the Security Model���������������������������������������������19

Authentication and Authorization (SAML vs. PING, etc.)��������������������������������31

Chapter 3: Enabling the Organizational Structure������������������������������53

Chapter 4: The Data Lake Setup���������������������������������������������������������63

Chapter 5: Production Playground���������������������������������������������������151

Chapter 6: Production Operationalization����������������������������������������159

AWS Accounts Used for Delivery�����������������������������������������������������������������176

After completing this book, you will understand how to implement a

What You Will Learn

• Understand the challenges associated with

• Explore the architectural patterns and processes used

• Design and implement data lake capabilities

• Associate business requirements with technical

Who This Book Is For

–– We needed to know the owner of this platform. This

–– We wanted to know the team that was handling all the

this customer and what considerations we needed to

–– Next we set up a call with the “cloud engineering” team.

–– Next was the DBA team. The DBA team currently

–– Next was the data governance team. One of the key

“governance-­first” approach, and our customer

–– We also connected with the “business” users who were

Understanding the Requirements

• Establishing a clear understanding of the customer

• It can be difficult to determine exactly what data

• It is difficult to understand the customer’s desired

• Connecting the data lake with other systems can be a

• It can be difficult to determine the best way to

• It is difficult to ensure that the data lake is designed for

• Determining the most effective way to ingest data into

• It can be difficult to ensure that the data is of high

• Since the customer requirements will vary from one

• Understanding the customer’s security and privacy

• Establishing the necessary data governance

• Establishing the necessary data quality requirements

The following diagram represents how “success” means different

1.0 2.0 3.0 4.0

Data Scientists were key Data Engineering team were

7.0 6.0 5.0

© Nayanjyoti Paul 2023 1

• What are the current challenges?

• Why is modernizing data platforms hard?

• What are the top five issues that we want to solve?

• What is available on-premise and on the cloud already?

• What meetings will be needed throughout the project?

• What common terms and jargon can we define?

 ecide on the Migration Path, Modernization

supposed to be sponsored by the business, and they had strict timelines to

Table 1-1. Assessment Questions

Table 1-1. (continued)

Questions Why Was the Question Important? What Was Decided?

Assess the Current Challenges

–– The current setup could not scale up based on the

–– As the data volume was growing, the current trend of analytics

–– As the organization was gearing up for the future, they had

–– Time to market was essential, and a process that can provide

–– They wanted to be future ready. Peer pressure is a huge

 nderstand Why Modernizing Data Platforms

Identify key Identify and create

Identify stakeholders Identify scope to Document key Identify critical decision

Divide the project

About the Technical Reviewer��ix

Chapter 1: Understanding “the Ask”��1

Chapter 2: Enabling the Security Model��19

Authentication and Authorization (SAML vs. PING, etc.)��31

Chapter 3: Enabling the Organizational Structure��53

Chapter 4: The Data Lake Setup��63

Chapter 5: Production Playground��151

Chapter 6: Production Operationalization��159

AWS Accounts Used for Delivery��176

What You Will Learn

Who This Book Is For

“governance-first” approach, and our customer

Understanding the Requirements

ecide on the Migration Path, Modernization

Assess the Current Challenges

nderstand Why Modernizing Data Platforms

Determine the Top Five Issues to Solve

etermine What Is Available On-Premise vs. on

reate the Meetings Needed Throughout

Business − Identify and prioritize source systems that need to be onboarded

Data − Design and document architecture for building a cloud-native

Define Common Terms and Jargon

PII Columns: RBAC, ABAC Features

Central Access Control