0% found this document useful (0 votes)
7 views

Chpt15

This document discusses the implementation of Data Lakes (DL) in large enterprises, emphasizing their role alongside existing Data Warehouses (DW) to meet growing consumer needs. It outlines the importance of DL for storing both structured and unstructured data, enabling advanced analytics and machine learning, and highlights the benefits of cloud-based DL solutions. Key considerations for architecture, phases of implementation, and examples from major cloud providers like AWS, Google Cloud, and Azure are also detailed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Chpt15

This document discusses the implementation of Data Lakes (DL) in large enterprises, emphasizing their role alongside existing Data Warehouses (DW) to meet growing consumer needs. It outlines the importance of DL for storing both structured and unstructured data, enabling advanced analytics and machine learning, and highlights the benefits of cloud-based DL solutions. Key considerations for architecture, phases of implementation, and examples from major cloud providers like AWS, Google Cloud, and Azure are also detailed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

1

15. How to implement Data Lake for large enterprises?

In this chapter we will be focusing on implementation of the Data Lake (DL) in cloud and the

significance of DL where the pre-existence of a Data Warehouse (DW) helps businesses to

take decisions. In general, data lake is not a replacement of existing data warehouse

applications but there is a high need for modernizing the data platform architecture in the

industry to sustain and stabilize the growing consumer needs. Before we start knowing each

segment of data lake, we will swing by the basics of data warehouse and its benefits.

15.1. What is a Data Warehouse?

A data warehouse (DW) is a centralized data storage system where a large volume

of information can be stored and analyzed to bring more insights from data. Data in large

enterprises come from various sources (See Figure 15.1 (a)) like transactional processing

systems, master data applications, communication systems, customer interactions and 3rd

party systems. In recent years there is a growing need to organize and archive them for

late analytical purposes.

Figure 15.1 (a)

Information Classification: General


2

Data gets added to data warehouse from the various applications including high-

performance transactional system which handling hundreds to millions of transactions in

a regular cadence. Processing the data in same systems becomes very expensive, time

consuming, and heterogeneous forms of data sources limits organizations to make better

decisions. DW often referenced as a processed data layer where business knows exactly

what data is consumed and stored in the system. Use case would be identified before the

data is added to the system. Data model is well designed prior to data movement into the

data warehouse storage layer and key performance indexes are identified.

15.1.1. Roles of Data warehouse for Industries

Data warehouse for industries such as Banking and Financial Services, Healthcare,

Retail, e-commerce, Agriculture, Hospitality and Quick Service Restaurants plays a

major role in curating the complex data, organize the data from various sources, enables

systematic approach in making decisions, durable and reliable for processing large

volume of data in batch mode.

Major role of data warehouse is to integrate the corporate data sources to provide

users with rich information to operationalize and improvise the business standards from

the generated data. DW also the primary component in persisting the Source of Records

(SORs) from various business modules in an organization. It provides various framework

to store massive volume of data efficiently.

 Business Intelligence tools such as SAP BOBJ, Tableau, QlikSense and others

uses the data warehouse application as main source to represent the valuable

insights. It makes it easy for business teams and data analysts to experience the

holistic view of their business performance/ progress in a single place. Key

Information Classification: General


3

performance indexes delivered to the industry experts are cleaner, easy to access,

accurate and reliable data points makes it easy to take impactful decisions.

 Data quality (DQ) is an integral part of data warehouse which helps users to

apply rules to perform pre-processing techniques to cleanse the data before it gets

stored into the DW system. DQ captures the accepted, rejected and erroneous of

data that’s getting inserted into the DW and the data team works on the rejected

and data errors to make it right before it gets added to the DW storage layer. DQ

process helps to understand the data from multi sources and analyse to determine

the final form of data stored in the DW, this process is called Data Profiling. Data

team identifies the data inconsistent data formats or layouts such as valid/ in valid

values, date format, verifying the address information, and so on.

 Data Integration is an important feature for the DW teams to integrate similar

data from various entities. It helps to standardize the data to form a meaningful

and consistent reference to any fields which are used differently in multiple

systems in the same organizations. It acts as a data movement tool from various

systems into the DW while applying all the standardization and cleansing

techniques.

15.2. What is a Data Lake?

In the last decade the nature of data is not just structured data which is well known for the

business. In fact, about 80% of the data we have today are generated in less than a decade

and its very important to store, analyze and make decisions on unknown data for every large

organization.

A data lake (DL) is a centralized repository for storing structured and unstructured data at

Information Classification: General


4

the scale of Peta-byte or more. It allows users to store the data as raw without having the

metadata and its kind. DL provides a unified way of gathering known and unknown data and

enables users to run analytics, build dashboard and framework to run computation in parallel

on big data and components to perform real-time analytics on massive datasets.

Figure 15.2 (a)

15.3. Why do we need Data Lake?

Organizations that heavily invested on data platform requires a secure, highly scalable,

cost efficient and fault tolerant solutions to ingest, store, and analyze massive datasets to

achieve best business value from their data. Enterprises who implemented a Data Lake

outperforming over 10% in organic revenue growth when compared with others who do not

have data lake in their data strategy. New era of analytics highly leverages machine learning

over new data sources like log files, click-streams, social media, and IoT devices stored in

the data lake. Early prediction of Business demand, Customer 360 analytics,

Information Classification: General


5

Figure 15.3 (a)

Behavioral analytics, and Trend analysis are some popular use cases opened up by

incorporating the data lake solutions in the large organizations. Empowering the data

engineering team to design cost-effective and standardized data layer helps to improve the

solution delivery by 40% when compared to the legacy storage and data warehouse

strategies. In cloud, data lake takes advantage of endless storage elasticity feature and pay per

use costing principles helps business to build solutions instantly whereas in legacy

architecture extension of resources and licenses takes months to fulfil the needs of the data

engineering team with respect to hardware and software procurements. Centralized data

repository in cloud helps security practice to control and protect the data much more ease

than the traditional approach. Adaption of shared server/storage model not only reduces the

cost of implementing the data lake and also enables security tightly lockdown as per

organizations security policies. Data lakes in cloud provides seamless integrations to many

Information Classification: General


6

existing business applications and products which makes it easy to connect and continue the

pre-existing tools in place.

15.4. Overview of Data Lake in Cloud

Data lake in cloud is a game-changing, cost-effective and scalable solutions that

enables easy to start and provision any organization with a high grade and well-suited

solutions for most of the existing data platforms. Building data lake with the awareness of

store without purpose brings more ideas to the business results and at the same time

features of storing them organized for later usage. Data lake in cloud comes with many

benefits in using robust services such as Big Data compute applications, Machine

Learning services, and massively distributed storage layer which stores peta-bytes of data

and trillions of objects.

In general terms data lake is referred as a stream of water flowing from various

parts and finally stored in to a lake which has mixed water properties that can be stored in

mass and leverages when needed. Likewise, data lake provides user with multiple data

sources integration into unique standardized layer for storing structured and unstructured

data formats, analyse the data when needed. Data movement in/out of the data lake has

various options and in cloud variety of data sources and downstream applications makes

it easier to implement rather than in on-premises architecture. Modern database

architecture comes with the super-fast compute framework such as Apache Spark/

Hadoop and the massively distributed file storage systems like HDFS. Concept of

distributed data storage makes the data locality and process of compute where the data

resides accelerates the processing speed and enabled more room for in-memory

Information Classification: General


7

computation techniques. Traditional data warehouse applications heavily consume the

data transfers between the storage and the processing layers since the data has to be

fetched to some extend to the system that performs the computations. Whereas the

distributed storage and processing framework makes it easier, when data lakes built on

top of such best performing and optimized storage architecture smooths the usage and

produces quick results for the business users.

Cloud services are fully decoupled in a way that enables organization to choose

services according to their needs. Any services that helps them to achieve the results then

it’s easy to productionize the solutions in matter of days. In recent years another

advantage of building data lake in cloud is the evolution of hybrid, and multi-cloud

enablement which makes an organization to choose many services from different cloud

providers. According to the latest survey more organizations are moving towards the

cloud and hybrid architecture for minimizing the procurement and maintenance

overheads. Cloud also provides various serverless capability for data lakes in the form of

Code as a Service, or Function as a Service. Auto scale up/out and scale down/in features

helps data teams to increase or decrease data usage on the go without commitment of the

required resources.

15.5. Key Considerations for Data Lake Architecture

Building a data lake is an imperial process of shifting the pyramid towards modern data

storage capabilities. During this process having the right team with good experience in

digesting the form of data your business handles and right skill set to craft the platform of

your choice. There are some understanding to be made about what you would be expecting to

do in the new solutions.

Information Classification: General


8

 Expect the data may be of many forms

 Data is not going to clean like before

 Advanced analytics might need more than one way to the problem

 Building quick solutions and meeting the failure is normal

Identify the applications that you need to focus to migrate to new architecture, prioritize

them accordingly to your demands. Initial data lake you build should be simple enough to see

if the framework covers all your data aspects by just adding basic data store feature, enabling

the security and governance principles to the infrastructure. Ingestion framework to handle

structured and unstructured data and secure them in the storage. Data protection at scale is a

major element to be considered since the volume, variety, and velocity going to be more than

ever before. Selecting the data cleaning, processing, aggregating and reduced redundancy

would be another area to be carefully selected.

Advanced analytical tools and machine learning work bench are very essential elements

while building new data lake solutions. Data lineage and metadata management should be

made available to users to easily search for the data points that are stored in the data lake.

Source of Record for each data object need to be identified to make sure the data that comes

in must follow certain standard and shall be notified for any changes/ use case/ conversion

applied to the data sources. Data security shall be configured in an advanced way with some

Single Sign On feature and enabling Multi Factor Authorization (MFA) components.

Information Classification: General


9

Figure 15.5 (a)

15.6. Phases of Data Lake Implementation

Implementing the data lake to a large organization needs multi-phase execution and

it highly critical white board the end-to-end solution. As discussed in the key

consideration section we will focus deeper into each segment. Before deep dive into the

phases of data lake, this section explains the components and design of data lake in the

cloud using Amazon Web Services Cloud, Google Cloud and Azure

15.6.1. Data Lake Architecture on Amazon Web Services

Amazon Simple Storage Service is an object storage to store and retrieve any

volume of data from anywhere. AWS S3 offer users with scalable, secure, durable and

highly available storage solutions. S3 has a lifecycle policy which helps users to define

and select various pricing options based on storage and access requirements.

Information Classification: General


10

Figure 15.6 (a) AWS Data Lake Architecture

AWS Lambda allows users to write code as functions and deploy them to AWS Lambda

(Figure 15.6 (a) – Ingestion) without worrying about the servers and the infrastructures.

Users will pay only for the consumed compute time.

Information Classification: General


11

AWS Elastic Cloud Compute is a cloud service which offers compute instances based

on user’s requirements. EC2 interface simple web interface helps users to select and

configure the instance for their scale and spins up within few minutes.

AWS Elastic Map Reduce service is a cloud big data platform (Figure 15.6 (a) –

Compute Layer) that enables users to run and scale Apache Spark, Hive, HBase and so

on. Also has highly available clusters and auto-scaling policies to make data platform

more stable.

15.6.2. Data Lake Architecture on Google Cloud Platform

Cloud Storage is a unified, scalable, and highly durable object storage for developers

and enterprises. It allows user to store media, files and application data.

Cloud DataProc is a managed Spark and Hadoop service that allows users to perform

batch processing, querying, streaming, and machine learning. Dataproc (see Figure 15.6

(b)) automation helps user to create clusters quickly, manage them easily, and close the

instances when it is not used.

BigQuery is a Server less, highly scalable, and cost-effective cloud data warehouse that

can analyze Petabytes of data using ANSI SQL model. Greater results found for real-time

and predictive analytics.

Cloud DataFlow is a fully managed unified streaming and batch data processing engine.

Serverless app which provides automatic provisioning and management of the resources.

This service provides higher reliability and fault-tolerant in nature.

Information Classification: General


12

Figure 15.6 (b) Google Cloud Data Lake Architecture

Cloud Bigtable is a fully managed NoSQL database for large analytical and processing

workloads. Organized data lake formats often require such NoSQL for personalization,

Digital contents and Internet of Things applications.

Cloud DataLab tool is mainly used for exploratory data analysis on Google cloud to

perform any machine learning and transformation using any languages such as Python,

SQL from Jupyter notebooks.

Cloud Functions offers your code to be deployed in Google platform and execute when

needed. Users will pay as they use the resources without any server procurement or

management

Information Classification: General


13

15.6.3. Azure Cloud Data Lake

Below architecture of Data lake from Azure integrates heterogeneous sources like

click-stream data, censor data, traditional data sources such as databases and event based

real-time data pipelines. Azure supports Data Lake Storage (Figure 15.6 c) with the

power of HDInsights for high processing framework which extends the utilization of

Spark and its core services.

Figure 15.6 (C) Azure Data Lake Architecture

15.7. What to load into your data lake?

Organizations claims to use a data lake approach to load and analyze data and

content that would not go into a traditional data warehouse, such as web server logs or

sensor logs, social media content, IoT feeds or image files and associated metadata. Data

lake analytics can therefore encompass any historical data or content from which you

may be able to derive business insights. But a data lake can play a key role in harvesting

conventional structured data as well. Data that you offload from your data warehouse in

Information Classification: General


14

order to control the costs and improve the performance of the warehouse.

Other key strategy to be taken would be on offloading traditional data warehouse

into data lake and the data pipeline to move the data using any standard Extract

Transform and Load interface. ETL frameworks does supports data movements for full

data loads, change data captures and slowly changing dimensions. Incremental loads are

so popular for any large and growing datasets which are transactional in nature.

15.8. A Cloud Data Lake Journey

This session have been focus more into cloud technologies and top providers in the

markets as you have seen Gartner’s cloud infrastructure provides: Amazon Web Services,

Google Cloud and Microsoft Azure. Discuss phases along with the service offerings from

different provides.

15.8.1. Cloud Infrastructures

Building a data lake in cloud brings lot of advantages, mainly fully managed services

offers an organization to focus on their data needs rather than the maintenance of physical

hardware and licensing. Below are the important benefits of using Cloud solutions for

your data lake:

 Storage Capacity: In cloud you can storage start with small files and it provides

elasticity to grow your data into data lake to Exa-byte size. This helps your

organization to focus on data strategy without worrying about the storage servers.

 Cost efficiency: Cloud providers has various options in storing and processing

your data applications and also has various pricing options such as pay for your

usage, fixed standard pricing, and long-term pricing which gives like 60% to 75%

of cost savings. Most of the service providers allow for multiple storage classes

Information Classification: General


15

and pricing options. This enables companies to only pay for as much as they need,

instead of planning for an assumed cost and capacity, which is required when

building a data lake locally.

 Central repository: A centralized location for all object stores and data access

means the setup is the same for every team in an organization. This improves

efficiency and now engineers can focus on more critical items.

 Data security: All companies have a responsibility to protect their data; with data

lakes designed to store all types of data, including sensitive information like

financial records or customer details, security becomes even more important.

Cloud providers guarantee security of data as defined by the shared responsibility

model.

 Auto-scaling: Modern cloud services are designed to provide immediate scaling

functionality, so businesses don’t have to worry about expanding capacity when

necessary or paying for hardware that they don’t need. Auto-scaling can be done

in horizontal scale out/ in or vertical up/ down based on the business needs.

15.8.2. Data Lake Storage

In this section we can see options available for data storage. Collecting data from

various sources has various kinds and types, most of the modern data applications has

heterogeneous sources and has veracity in nature.

Data movements from on-premises data warehouse into cloud data lake has

different types; lift and shift, Database migration, and processed loads. Depends on the

applications need and the priority of the business. The following sources of data is

common across the cloud data lake and the services and tools used to ingest the data only

Information Classification: General


16

differs: Databases, Files (csv, xls, pdfs, and logs), IoT device feeds, Apps data. We will

be seeing various ways to capture the data into data lake from top cloud providers and

open source engines.

 Google Cloud Platform - Storage

Cloud Storage, you can start with a few small files and grow your data lake to

Exabyte in size. Cloud Storage (see figure 15.8 a) supports high-volume ingestion of new

data and high-volume consumption of stored data in combination with other services such

as Pub/Sub. Cloud storage also promises durability of 99.999999999% annual durability.

Google cloud provides various ingest options. Pub/Sub is an option to ingest real-time or

near real-time data into GC. Storage Transfer Service offers moving data from online or

from on-premises such as data centre to cloud seamlessly and quickly. gsutil (Google

Store) an option if you want one-time or scheduled frequency file transfers into Google

Storage.

Figure 15.8 (a) GCP Storage

Information Classification: General


17

 Amazon Web Services

AWS S3 acts as a primary drop location for the data lake solutions (see figure

15.8 b), once the file is placed into a bucket (a folder in cloud) using an ETL engine there

are various ways to process them. S3 provides 99.999999999% durability and 99.99%

availability of objects over a given year with endless storage so customers no need to

worry about the growing data storage needs. Once the data is place inside the S3 buckets

it can trigger consecutive actions based on the type of data ingested. Migrating a data

base can be done using Database Migration Service which helps to migrate data quickly

and securely. With no downtime to the existing databases. DMS can support

homogeneous migrations like Oracle to Oracle. SQL Server to SQL Server and also

heterogenous migrations like Oracle or Microsoft SQL Server to AWS Aurora database.

Figure 15.8 (b) AWS Storage

 Microsoft Azure

Azure storage service is a MS cloud storage solution, it’s a massively scalable object

Information Classification: General


18

store. Storage comes with various data services such as Azure Blobs, Files, Queues,

Tables and Disks (refer figure 15.8. (c)). Copy Data service from Azure offers data

ingestion from 70+ data sources on premises or cloud. An easy graphical user interface

driven ingestion process allows users to select 1000 of tables and databases and it

automates the data pipeline instances based on the options user has selected.

Figure 15.8 (c) Azure Storage

15.8.3. Data Transformation

Building a modern data platform requires a flexible and efficient transformation

tools to perform the data transformations. Since the data lake brings an ability to store

raw data with no oversight on the contents. In traditional data warehouse we saw the is a

high need for intermediate storage or database such as data marts whereas in data lake

there should be no excessive use of database and pre-processing methods. Data lake

architecture completely decouples the complexity and reduces cost by enabling stateful

operations in-memory and supports all kinds of complex transformations and

aggregations without any database. The process of schema-on-read is also formally

referred as Extract-Load-Transform and it is mainly applicable in data lake platforms.

Information Classification: General


19

Accessing data with no schema is a major challenge in selecting the any ETL and data

lakes are typically used as repositories for raw data in structured or semi-structured

formats.

15.8.4. Data Security

Organizations stepping into cloud and data platform solutions always tend to build a

strong data governance and security strategies. In the current cloud industry, every

provider focus more on security layers since most of the cost effective and preferred

solutions on cloud ends with shared hardware infrastructures. Below are some standards

followed across all the cloud data storage provides in the industry

 Cloud native key management services

 Customer owned private/public key managements

 Encryption based key management services

Conclusion

There are various options available for building a data lake solution in the market and that are

available in matter of hours to operate. Serverless and fully managed solutions providers leads

the customer engagements with high availability and secured platform integrations.

Information Classification: General

You might also like