Practical Implementation Of A Data Lake Translating Customer Expectations Into Tangible Technical Goals 1st Edition Nayanjyoti Paul instant download
Practical Implementation Of A Data Lake Translating Customer Expectations Into Tangible Technical Goals 1st Edition Nayanjyoti Paul instant download
https://ebookbell.com/product/a-practical-guide-for-simulation-and-
fpga-implementation-of-digital-design-3rd-edition-hajji-42278912
https://ebookbell.com/product/electromagnetic-imaging-for-a-novel-
generation-of-medical-devices-fundamental-issues-methodological-
challenges-and-practical-implementation-francesca-vipiana-50686192
https://ebookbell.com/product/the-art-of-hospitality-implementation-
guide-a-practical-guide-for-a-ministry-of-radical-welcome-
nixon-59279974
Using The Iso 56002 Innovation Management System A Practical Guide For
Implementation And Building A Culture Of Innovation H James Harrington
Sid Benraouane
https://ebookbell.com/product/using-the-iso-56002-innovation-
management-system-a-practical-guide-for-implementation-and-building-a-
culture-of-innovation-h-james-harrington-sid-benraouane-46774092
https://ebookbell.com/product/vectorization-a-practical-guide-to-
efficient-implementations-of-machine-learning-algorithms-1st-edition-
edward-dongbo-cui-217790910
https://ebookbell.com/product/vectorization-a-practical-guide-to-
efficient-implementations-of-machine-learning-algorithms-1st-edition-
edward-dongbo-cui-184934852
https://ebookbell.com/product/operations-strategy-in-action-a-guide-
to-the-theory-and-practice-of-implementation-kim-hua-tan-1741664
https://ebookbell.com/product/modern-geotechnical-design-codes-of-
practice-implementation-application-and-development-1st-edition-p-
arnold-g-a-fenton-m-a-hicks-51707640
Practical
Implementation
of a Data Lake
Translating Customer Expectations
into Tangible Technical Goals
—
Nayanjyoti Paul
Practical
Implementation of a
Data Lake
Translating Customer
Expectations into Tangible
Technical Goals
Nayanjyoti Paul
Practical Implementation of a Data Lake: Translating Customer
Expectations into Tangible Technical Goals
Nayanjyoti Paul
Edison, NJ, USA
Preface������������������������������������������������������������������������������������������������xi
Introduction���������������������������������������������������������������������������������������xiii
iii
Table of Contents
iv
Table of Contents
Chapter 7: Miscellaneous�����������������������������������������������������������������167
Objective: Advice to Follow�������������������������������������������������������������������������������167
Recommendations��������������������������������������������������������������������������������������������167
Managing a Central Framework Along with
Project-Specific Extensions�������������������������������������������������������������������������167
Allowing Project Teams to Build “User-Defined Procedures” and
Contribute to the Central Framework����������������������������������������������������������168
Advantages and Disadvantages of a Single vs. Multi-account Strategy�����169
Creating a New Organizational Unit AWS Account vs. Onboard Teams
to a Central IT Managed AWS Account��������������������������������������������������������171
Considerations for Integrating with Schedulers������������������������������������������172
Choosing a Data Warehouse Technology�����������������������������������������������������173
Managing Autoscaling���������������������������������������������������������������������������������174
Managing Disaster Recovery�����������������������������������������������������������������������175
v
Table of Contents
Index�������������������������������������������������������������������������������������������������195
vi
About the Author
Nayanjyoti Paul is an associate director and
chief Azure architect for GenAI and LLM
CoE for Accenture. He is the product owner
and creator of patented assets. Presently, he
leads multiple projects as a lead architect
around generative AI, large language models,
data analytics, and machine learning. Nayan
is a certified master technology architect,
certified data scientist, and certified Databricks
champion with additional AWS and Azure
certifications. He has been a speaker at conferences like Strata Conference,
Data Works Summit, and AWS Reinvent. He also delivers guest lectures at
universities.
vii
About the Technical Reviewer
Arunkumar is an architect with 20+ years of
experience in the IT industry. He has worked
with a wide variety of technologies in the
data, cloud, and AI spaces. He has experience
working in a variety of industries such as
banking, telecom, healthcare, and avionics.
As a lifelong learner, he enjoys taking on new
fields of study and challenging himself to
master the necessary skills and knowledge.
ix
Preface
This book explains how to implement a data lake strategy, covering the
technical and business challenges architects commonly face. It also
illustrates how and why client requirements should drive architectural
decisions.
Drawing upon a specific case from my own experience, I begin with
the consideration from which all subsequent decisions should flow: what
does your customer need?
I also describe the importance of identifying key stakeholders and the
key points to focus on when starting a project. Next, I take you through
the business and technical requirements-gathering process and how to
translate customer expectations into tangible technical goals.
From there, you’ll gain insight into the security model that will allow
you to establish security and legal guardrails, as well as different aspects of
security from the end user’s perspective. You’ll learn which organizational
roles need to be onboarded into the data lake, their responsibilities,
the services they need access to, and how the hierarchy of escalations
should work.
Subsequent chapters explore how to divide your data lakes into zones,
organize data for security and access, manage data sensitivity, and use
techniques for data obfuscation. Audit and logging capabilities in the
data lake are also covered before a deep dive into designing data lakes to
handle multiple file formats and access patterns. The book concludes by
focusing on production operationalization and solutions to implement a
production setup.
xi
Preface
xii
Introduction
I landed at the airport and took an Uber to my customer’s office. I was
supposed to meet with the program manager on the customer side. After
the initial process and getting myself “checked in,” I entered the conference
room that was booked for our team. I knew most of the team from other
projects, but I was meeting a few of them for the first time. After the usual
greetings and a few of my colleagues congratulating me on my new role, I
was ready for the day to unfold.
This customer was a big organization, and there was a clear
“separation of concerns” from multiple teams. The schedule was set up,
and our first tasks were to get acquainted with the different organizational
units, identify the key stakeholders, and understand the stakeholders’
primary “asks.” It was important for my team to understand the key
organizational units and have one-on-one initial discussions. We needed
to connect with the following people and teams:
xiii
Introduction
xiv
Introduction
xv
Introduction
xvi
Introduction
With the key stakeholders identified and meetings set up, it was time
for business. Having dedicated sessions with each key member was critical
to get “buy-in” from each of them for the platform architecture (more on
this to follow in the coming chapters).
xvii
Introduction
xviii
Introduction
8.0 9.0
If we look closely, the first stakeholders are from the business side. For
them, the objective is outcome focused. The technology is secondary for
them as long as we continue delivering high-quality business insights in a
repeatable and predictable time frame.
Second are the stakeholders from the CTO’s office. They want to design
the platform (data lake) as a future-ready solution. For them it is important
to make the right technical decisions and adopt a cloud-first approach.
They want to focus on a modern data stack that centers around cloud-
native and software-as-a-service (SaaS) offerings.
Next, the customer’s IT organization is a key stakeholder. Their focus is
to incorporate technical solutions that are easy to maintain, cloud native,
and based on the principles of keeping the integrations minimal.
Next in line as a key stakeholder is the security office team. They
want to ensure that we design a system that has the right “separation of
concerns” and has the right security guardrails so that confidential and
personally identifiable information (PII) data can be safe and secure.
xix
Introduction
Next in line is the CISO’s team for whom the data access policies, data
governance and auditability, etc., are primary concerns. They want to
ensure that the data is available only to the right resources at the right time
through role-, tag-, and attribute-based access controls.
Next in line is the data engineering team who will eventually “own”
the applications and system for maintenance. For them it was important
that the data engineering solution built on the data lake has reusability,
extensibility, and customizability, and is based on a solid programming
framework and design that will be easy to manage and use in the long run.
Next in line is the data scientist community who needs the right access
to the data and right access to the tools to convert the data into insights.
They also want “self-service” as a capability where they have the right
permissions to work on ideas that can help the business get value.
Next in line is the business analyst community who want to be
onboarded into this new data lake platform as soon as possible with access
to a “single source of truth” so that they can start building the mission-
critical application that the business is waiting for.
Finally, the cloud engineering team is a key stakeholder. This team
wants the whole platform to be secure, controlled, user friendly, reliable,
and durable.
As you might have imagined by now, I will be using my experience to
explain the end-to-end process of designing and implementing a data lake
strategy in the following chapters.
This book will (in broad strokes) cover concepts such as how to
understand and document the business asks, define the security model,
define the organization structure, design and implement the data lake
from end to end, set up a production playground, and operationalize the
data lake. Finally, I will present some lessons learned from my experience.
Chapter 1 will focus on each of these points and how each resulted in
the design of a small part of the key problem (platform design) and how
little by little things fell into place for me and my team. Let’s get started.
xx
CHAPTER 1
Understanding “the
Ask”
Objective: Asking the Right Questions
In the introduction of the book, I set the stage for the project we’ll start
discussing in this chapter. When I took up the solution architect and
delivery lead role, I had no idea what vision my customer had, other than
a very general understanding of the final product my customer was after.
The intention was to build a modern, cloud-centric data and analytics
platform (called a lake house). So, at this point, it was important for me
and my team to ask the right questions, gather the requirements in detail,
and start peeling back the layers of the onion. In short, we needed to
understand “the ask.”
The first ask (for my team and me) was to be aligned to the customer’s
vision. To understand this vision, we set up a meeting with the VP of
engineering (the platform owner) to establish the direction of the project
and the key decisions that needed to be made.
The Recommendations
I used the following checklist as part of the vision alignment, and you can
use this for your project too. Also, be open to bringing your own questions
to the meeting based on your customer’s interests and their maturity.
• What are the migration path, modernization
techniques, enhancements, and cloud vendor that will
be used?
My team and I started the first round of discussions with the key
customer stakeholders. We then understood the requirements better and
had a better appreciation of the direction our customer wanted to go in.
Each of the seven topics listed previously are detailed in the remainder of
the chapter.
2
Chapter 1 Understanding “the Ask”
3
Chapter 1 Understanding “the Ask”
What cloud Each year cloud vendors introduce new The customer’s
platform to use? capabilities, features, and integrations. decision to go with
By being aligned to a cloud vendor’s AWS ensured (for
capabilities, we can understand the example) that we
“out-of-box” offerings versus gaps for could leverage its
that specific vendor. Also this means ML capabilities on
a correct estimation for time and cost Sagemaker, their
based on the maturity of the vendor and centralized RBAC and
the capabilities they currently offer. TBAC policies through
lake formation, and
many more (more on
those later).
(continued)
4
Chapter 1 Understanding “the Ask”
Do you want Each of these solutions needs separate The customer was
to implement handling and enablement from a very clear that they
a lift-and-shift, technical point of view. wanted a data lake
modernization, or For example, lift and shift should focus in the cloud, which
migration solution on a path of least resistance to have the meant they were
strategy? same capability available in the cloud. ready to open up new
So, an Oracle system on-premise can possibilities, new
be deployed as an Oracle system on the personas, new kinds
cloud. of use cases, and new
opportunities for the
Migration is slightly different; for
whole organization.
example, the same Oracle system
can be migrated to a Redshift system
on the cloud leveraging native cloud
capabilities but keeping the basics
intact.
However, modernization can mean
replacing an on-premise system like
Oracle with a data lake or a lake
house architecture where we can
enable different personas such as data
engineers, analysts, BI team, and the
data science team to leverage the data
in different ways and with different
forms to get value.
5
Chapter 1 Understanding “the Ask”
–– The current setup was costly. The software vendors for the
commercial off-the-shelf (COTS) products were charging a
license fee based on the number of machines. As the
organization was growing, so was their user base.
6
Chapter 1 Understanding “the Ask”
Some of these points were critical for the customer, and hence we ensured
that when we designed the solution, we considered the people who would
be using the platform and what capabilities the final platform should have.
Identify key
Identify Current What are the current Identify technical and
stakeholders and
Challenges limitations business issues
owners
Along with the Figure 2-1 pointers on what we should focus on while
delivering an enterprise-scale data platform solution, Figure 1-2 provides
guidelines for a target-state implementation as part of an end-to-end data
platform implementation.
7
Chapter 1 Understanding “the Ask”
ML
Framework around
Fit for purpose data Decision of right Operationalization –
ETL for data model management
models that support technology for data
Data & ML Products business usage products
processing
from identifying data
to building models @
and collaboration of
data science team
Scale
8
Chapter 1 Understanding “the Ask”
9
Chapter 1 Understanding “the Ask”
Figure 1-3. A pie chart of what we saw were the driving factors for
the need to build a data lake solution
10
Chapter 1 Understanding “the Ask”
11
Chapter 1 Understanding “the Ask”
Have the consumption The simple answer was yes. A major focus was to
patterns changed? Are onboard the data science teams and enable them to
there new parties and build cutting-edge use cases to help do predictive
use cases that would insights on data rather than reactive ones. Similarly, a
be adopted on the new new kind of data analytics and BI teams would need
platform? instant and live access to the data to build and refresh
metrics for the business to help in quick decision-
making. Those personas and their set of use cases were
completely new and unknown and would surely need a
different design approach.
Do you want to be Most customers start with the idea of setting up a
provider agnostic or cloud-based system targeting a specific cloud provider
multicloud (from a for partnership. However, soon clients decide to have
strategy point)? a multicloud strategy that is provider agnostic. These
decisions do not impact the solution strategy in the short
to medium run, but they do have implications in the long
run. For this customer, they did not have any preference
about this, and we were supposed to focus on the AWS-
specific solution for now.
12
Chapter 1 Understanding “the Ask”
identifying the priority and ordering of tasks and ensuring we got calendar
time from each stakeholder so that we did not have to wait for any
important decisions from our customers.
We enabled three workstreams. I ensured we had dedicated teams for
each of the three workstreams, and each had specific responsibility areas,
as listed in Table 1-3. You can use this table to plan ahead for important
meetings with the right stakeholders.
13
Chapter 1 Understanding “the Ask”
Data security − Work with the data security teams, CISO teams, and cloud
engineering teams, and have a common understanding of how
many AWS accounts are needed, how many environments are
needed (dev/UAT/prod), how to separate out the concerns of “blast
radius,” how to manage data encryption, how to manage PII data,
how to implement network security on data onboarding and IAM
policies, etc.
− Identify and document processes to define how to onboard a new
source system and what access and security should be in place.
− Identify and document processes to define a user onboarding process
through AD integrations, IAM policies, and roles to be applied.
− Have separate capabilities between interactive access and
automated access and have different policies, services, and
guardrails for both types.
− Understand and document life-cycle policies and practices for
data and processes.
− Understand and document a role-based matrix of who will be
getting access to this new platform and what will be their access
privileges.
− Define and document a DR strategy (hot-hot, hot-cold, cold-cold,
etc.).
− Define and document how third-party tools will be authenticated
and how they will access data within the platform (temp
credentials, SSO etc.).
− Define and document role-based, attribute-based, domain-based,
tag-based data access, and sharing needs.
− Define and document data consumption roles and policies, etc.
(continued)
14
Chapter 1 Understanding “the Ask”
15
Chapter 1 Understanding “the Ask”
16
Chapter 1 Understanding “the Ask”
Key Takeaways
To recap, we met with all the key stakeholders including our sponsor
for the data strategy work. We interviewed key personnel and identified
key areas (to prioritize), and we understood the current landscape and
maturity. We devised a strategy to work on three workstreams and defined
key meetings and whiteboard sessions for the next few weeks (putting
meetings on calendars for key personnel). Last but not least, we defined
common terms and presented what our focus would be and the possible
measure of success for this project.
Based on the series of discussions, in general our goal for the next steps
were as follows:
17
Chapter 1 Understanding “the Ask”
Test and deploy the data lake: After the data lake is
developed, it needs to be tested and deployed. This
includes testing the data lake to ensure it meets
the customer’s requirements and deploying it in a
production environment.
18
CHAPTER 2
The Recommendations
I used the following key design decisions to come up with a blueprint and
ensured that those KDDs addressed the needs of each stakeholder. The
objectives of the internal and external stakeholders were different. For
example, the internal teams wanted a security blueprint that focused on
a separation of concerns, the right access and security controls, and tight
integration with enterprise security principles and policies, whereas the
external stakeholders asked us to focus on cloud-native and best-of-breed
technologies and the right service model to build the solution.
The following checklist was part of the vision alignment, and you can
use this for your project too as a template. Be open to asking your own
questions based on your customer’s interest and their maturity (in other
words, use this as a starting guide).
20
Chapter 2 Enabling the Security Model
21
Chapter 2 Enabling the Security Model
Figure 2-1 provides a glimpse of the overall process that was followed
for this customer based on the AWS stack selected for the project. The idea
was to have a data strategy design (more to follow in the next chapters)
of organizing the structure of data into Raw (or Bronze), Curated (or
Silver), and Provisioned (or Gold) for the Automated (ETL jobs, etc.) and
Playground (ad hoc or interactive) access perspective. For the interactive
access process, the access control was defined at a granular level (tag,
sensitivity, and domain level) and was based on AWS Lake Formation
22
Chapter 2 Enabling the Security Model
We started the security and access control design by taking baby steps
and handling the architecture on a use case by use case basis. We wanted
to have a baseline architecture first and then test our hypothesis by laying
out additional use cases and validating whether our architecture could
stand the test of the same.
23
Chapter 2 Enabling the Security Model
24
Chapter 2 Enabling the Security Model
25
Chapter 2 Enabling the Security Model
Figure 2-2 shows how the overall Lake Formation setup might look
(from a high level).
26
Chapter 2 Enabling the Security Model
Figure 2-2. How a central catalog and access control can be designed
for managing role-based access for interactive users
27
Chapter 2 Enabling the Security Model
Consumption Accounts
ACCOUNT #1
ACCOUNT #4
Ingeson Account for 3rd
Party data Account to only save data in Lake formation
ACCOUNT #8
RAW format (that will contain
Curated Account Central Account
PII and other sensive data)
Playground account for
interacve access
1.0
ACCOUNT #2
2.0
Ingeson Account for On- ACCOUNT #6 ACCOUNT #7
premise Producon data
ACCOUNT #5 Query/Curated Account Lake Formaon Central
where data is clean, 4.0 (master account) that 5.0
3.0 manages all data catalog
Account to process data into enriched and converted to
common format through ”single version of truth” globally
enrichment, augmentaon, ACCOUNT #9
ACCOUNT #3 data quality, validaon etc.
No data is saved here – only Purpose Driven account
Ingeson Account for Other automated process run here for scheduled workloads
Cloud Accounts to build business
outcomes
Orchestration Account
28
Chapter 2 Enabling the Security Model
29
Chapter 2 Enabling the Security Model
30
Chapter 2 Enabling the Security Model
31
Random documents with unrelated
content Scribd suggests to you:
[1945] It is described by Tylor in his Anahuac, ch. 9; by
Brocklehurst in his Mexico to-day, ch. 21; by Bandelier in the
American Antiquarian (1878), ii. 15; in Mayer’s Mexico; and in the
summary of information (fifteen years old, however) in Bancroft’s
Mexico, iv. 553, etc., with references, p. 565, which includes
references to the Uhde collection at Heidelberg, the Christy
collection in London (Tylor), that of the American Philosophical
Society in Philadelphia (Trans., iii. 570), not to name the Mexican
sections of the large museums of America and Europe. Henry
Phillips, Jr. (Proc. Amer. Philosophical Soc., xxi. p. 111) gives a list
of public collections of American Archæology. There are some
private collections mentioned in the Archives de la Soc. Amér. de
France, Nouv. Ser., vol. i. A. de Longperier’s Notice des
Monuments dans la Salle des Antiquités Américaines (Paris, 1880)
covers a part of the great Paris exhibition of that year. Something
is found in E. T. Stevens’s Flint Chips, a guide to prehistoric
archæology as illustrated in the Blackmore Museum [at Salisbury,
England], London, 1870.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookbell.com