Chpt15
Chpt15
In this chapter we will be focusing on implementation of the Data Lake (DL) in cloud and the
take decisions. In general, data lake is not a replacement of existing data warehouse
applications but there is a high need for modernizing the data platform architecture in the
industry to sustain and stabilize the growing consumer needs. Before we start knowing each
segment of data lake, we will swing by the basics of data warehouse and its benefits.
A data warehouse (DW) is a centralized data storage system where a large volume
of information can be stored and analyzed to bring more insights from data. Data in large
enterprises come from various sources (See Figure 15.1 (a)) like transactional processing
systems, master data applications, communication systems, customer interactions and 3rd
party systems. In recent years there is a growing need to organize and archive them for
Data gets added to data warehouse from the various applications including high-
a regular cadence. Processing the data in same systems becomes very expensive, time
consuming, and heterogeneous forms of data sources limits organizations to make better
decisions. DW often referenced as a processed data layer where business knows exactly
what data is consumed and stored in the system. Use case would be identified before the
data is added to the system. Data model is well designed prior to data movement into the
data warehouse storage layer and key performance indexes are identified.
Data warehouse for industries such as Banking and Financial Services, Healthcare,
major role in curating the complex data, organize the data from various sources, enables
systematic approach in making decisions, durable and reliable for processing large
Major role of data warehouse is to integrate the corporate data sources to provide
users with rich information to operationalize and improvise the business standards from
the generated data. DW also the primary component in persisting the Source of Records
Business Intelligence tools such as SAP BOBJ, Tableau, QlikSense and others
uses the data warehouse application as main source to represent the valuable
insights. It makes it easy for business teams and data analysts to experience the
performance indexes delivered to the industry experts are cleaner, easy to access,
accurate and reliable data points makes it easy to take impactful decisions.
Data quality (DQ) is an integral part of data warehouse which helps users to
apply rules to perform pre-processing techniques to cleanse the data before it gets
stored into the DW system. DQ captures the accepted, rejected and erroneous of
data that’s getting inserted into the DW and the data team works on the rejected
and data errors to make it right before it gets added to the DW storage layer. DQ
process helps to understand the data from multi sources and analyse to determine
the final form of data stored in the DW, this process is called Data Profiling. Data
team identifies the data inconsistent data formats or layouts such as valid/ in valid
data from various entities. It helps to standardize the data to form a meaningful
and consistent reference to any fields which are used differently in multiple
systems in the same organizations. It acts as a data movement tool from various
systems into the DW while applying all the standardization and cleansing
techniques.
In the last decade the nature of data is not just structured data which is well known for the
business. In fact, about 80% of the data we have today are generated in less than a decade
and its very important to store, analyze and make decisions on unknown data for every large
organization.
A data lake (DL) is a centralized repository for storing structured and unstructured data at
the scale of Peta-byte or more. It allows users to store the data as raw without having the
metadata and its kind. DL provides a unified way of gathering known and unknown data and
enables users to run analytics, build dashboard and framework to run computation in parallel
Organizations that heavily invested on data platform requires a secure, highly scalable,
cost efficient and fault tolerant solutions to ingest, store, and analyze massive datasets to
achieve best business value from their data. Enterprises who implemented a Data Lake
outperforming over 10% in organic revenue growth when compared with others who do not
have data lake in their data strategy. New era of analytics highly leverages machine learning
over new data sources like log files, click-streams, social media, and IoT devices stored in
the data lake. Early prediction of Business demand, Customer 360 analytics,
Behavioral analytics, and Trend analysis are some popular use cases opened up by
incorporating the data lake solutions in the large organizations. Empowering the data
engineering team to design cost-effective and standardized data layer helps to improve the
solution delivery by 40% when compared to the legacy storage and data warehouse
strategies. In cloud, data lake takes advantage of endless storage elasticity feature and pay per
use costing principles helps business to build solutions instantly whereas in legacy
architecture extension of resources and licenses takes months to fulfil the needs of the data
engineering team with respect to hardware and software procurements. Centralized data
repository in cloud helps security practice to control and protect the data much more ease
than the traditional approach. Adaption of shared server/storage model not only reduces the
cost of implementing the data lake and also enables security tightly lockdown as per
organizations security policies. Data lakes in cloud provides seamless integrations to many
existing business applications and products which makes it easy to connect and continue the
enables easy to start and provision any organization with a high grade and well-suited
solutions for most of the existing data platforms. Building data lake with the awareness of
store without purpose brings more ideas to the business results and at the same time
features of storing them organized for later usage. Data lake in cloud comes with many
benefits in using robust services such as Big Data compute applications, Machine
Learning services, and massively distributed storage layer which stores peta-bytes of data
In general terms data lake is referred as a stream of water flowing from various
parts and finally stored in to a lake which has mixed water properties that can be stored in
mass and leverages when needed. Likewise, data lake provides user with multiple data
sources integration into unique standardized layer for storing structured and unstructured
data formats, analyse the data when needed. Data movement in/out of the data lake has
various options and in cloud variety of data sources and downstream applications makes
architecture comes with the super-fast compute framework such as Apache Spark/
Hadoop and the massively distributed file storage systems like HDFS. Concept of
distributed data storage makes the data locality and process of compute where the data
resides accelerates the processing speed and enabled more room for in-memory
data transfers between the storage and the processing layers since the data has to be
fetched to some extend to the system that performs the computations. Whereas the
distributed storage and processing framework makes it easier, when data lakes built on
top of such best performing and optimized storage architecture smooths the usage and
Cloud services are fully decoupled in a way that enables organization to choose
services according to their needs. Any services that helps them to achieve the results then
it’s easy to productionize the solutions in matter of days. In recent years another
advantage of building data lake in cloud is the evolution of hybrid, and multi-cloud
enablement which makes an organization to choose many services from different cloud
providers. According to the latest survey more organizations are moving towards the
cloud and hybrid architecture for minimizing the procurement and maintenance
overheads. Cloud also provides various serverless capability for data lakes in the form of
Code as a Service, or Function as a Service. Auto scale up/out and scale down/in features
helps data teams to increase or decrease data usage on the go without commitment of the
required resources.
Building a data lake is an imperial process of shifting the pyramid towards modern data
storage capabilities. During this process having the right team with good experience in
digesting the form of data your business handles and right skill set to craft the platform of
your choice. There are some understanding to be made about what you would be expecting to
Advanced analytics might need more than one way to the problem
Identify the applications that you need to focus to migrate to new architecture, prioritize
them accordingly to your demands. Initial data lake you build should be simple enough to see
if the framework covers all your data aspects by just adding basic data store feature, enabling
the security and governance principles to the infrastructure. Ingestion framework to handle
structured and unstructured data and secure them in the storage. Data protection at scale is a
major element to be considered since the volume, variety, and velocity going to be more than
ever before. Selecting the data cleaning, processing, aggregating and reduced redundancy
Advanced analytical tools and machine learning work bench are very essential elements
while building new data lake solutions. Data lineage and metadata management should be
made available to users to easily search for the data points that are stored in the data lake.
Source of Record for each data object need to be identified to make sure the data that comes
in must follow certain standard and shall be notified for any changes/ use case/ conversion
applied to the data sources. Data security shall be configured in an advanced way with some
Single Sign On feature and enabling Multi Factor Authorization (MFA) components.
Implementing the data lake to a large organization needs multi-phase execution and
it highly critical white board the end-to-end solution. As discussed in the key
consideration section we will focus deeper into each segment. Before deep dive into the
phases of data lake, this section explains the components and design of data lake in the
cloud using Amazon Web Services Cloud, Google Cloud and Azure
Amazon Simple Storage Service is an object storage to store and retrieve any
volume of data from anywhere. AWS S3 offer users with scalable, secure, durable and
highly available storage solutions. S3 has a lifecycle policy which helps users to define
and select various pricing options based on storage and access requirements.
AWS Lambda allows users to write code as functions and deploy them to AWS Lambda
(Figure 15.6 (a) – Ingestion) without worrying about the servers and the infrastructures.
AWS Elastic Cloud Compute is a cloud service which offers compute instances based
on user’s requirements. EC2 interface simple web interface helps users to select and
configure the instance for their scale and spins up within few minutes.
AWS Elastic Map Reduce service is a cloud big data platform (Figure 15.6 (a) –
Compute Layer) that enables users to run and scale Apache Spark, Hive, HBase and so
on. Also has highly available clusters and auto-scaling policies to make data platform
more stable.
Cloud Storage is a unified, scalable, and highly durable object storage for developers
and enterprises. It allows user to store media, files and application data.
Cloud DataProc is a managed Spark and Hadoop service that allows users to perform
batch processing, querying, streaming, and machine learning. Dataproc (see Figure 15.6
(b)) automation helps user to create clusters quickly, manage them easily, and close the
BigQuery is a Server less, highly scalable, and cost-effective cloud data warehouse that
can analyze Petabytes of data using ANSI SQL model. Greater results found for real-time
Cloud DataFlow is a fully managed unified streaming and batch data processing engine.
Serverless app which provides automatic provisioning and management of the resources.
Cloud Bigtable is a fully managed NoSQL database for large analytical and processing
workloads. Organized data lake formats often require such NoSQL for personalization,
Cloud DataLab tool is mainly used for exploratory data analysis on Google cloud to
perform any machine learning and transformation using any languages such as Python,
Cloud Functions offers your code to be deployed in Google platform and execute when
needed. Users will pay as they use the resources without any server procurement or
management
Below architecture of Data lake from Azure integrates heterogeneous sources like
click-stream data, censor data, traditional data sources such as databases and event based
real-time data pipelines. Azure supports Data Lake Storage (Figure 15.6 c) with the
power of HDInsights for high processing framework which extends the utilization of
Organizations claims to use a data lake approach to load and analyze data and
content that would not go into a traditional data warehouse, such as web server logs or
sensor logs, social media content, IoT feeds or image files and associated metadata. Data
lake analytics can therefore encompass any historical data or content from which you
may be able to derive business insights. But a data lake can play a key role in harvesting
conventional structured data as well. Data that you offload from your data warehouse in
order to control the costs and improve the performance of the warehouse.
into data lake and the data pipeline to move the data using any standard Extract
Transform and Load interface. ETL frameworks does supports data movements for full
data loads, change data captures and slowly changing dimensions. Incremental loads are
so popular for any large and growing datasets which are transactional in nature.
This session have been focus more into cloud technologies and top providers in the
markets as you have seen Gartner’s cloud infrastructure provides: Amazon Web Services,
Google Cloud and Microsoft Azure. Discuss phases along with the service offerings from
different provides.
Building a data lake in cloud brings lot of advantages, mainly fully managed services
offers an organization to focus on their data needs rather than the maintenance of physical
hardware and licensing. Below are the important benefits of using Cloud solutions for
Storage Capacity: In cloud you can storage start with small files and it provides
elasticity to grow your data into data lake to Exa-byte size. This helps your
organization to focus on data strategy without worrying about the storage servers.
Cost efficiency: Cloud providers has various options in storing and processing
your data applications and also has various pricing options such as pay for your
usage, fixed standard pricing, and long-term pricing which gives like 60% to 75%
of cost savings. Most of the service providers allow for multiple storage classes
and pricing options. This enables companies to only pay for as much as they need,
instead of planning for an assumed cost and capacity, which is required when
Central repository: A centralized location for all object stores and data access
means the setup is the same for every team in an organization. This improves
Data security: All companies have a responsibility to protect their data; with data
lakes designed to store all types of data, including sensitive information like
model.
necessary or paying for hardware that they don’t need. Auto-scaling can be done
in horizontal scale out/ in or vertical up/ down based on the business needs.
In this section we can see options available for data storage. Collecting data from
various sources has various kinds and types, most of the modern data applications has
Data movements from on-premises data warehouse into cloud data lake has
different types; lift and shift, Database migration, and processed loads. Depends on the
applications need and the priority of the business. The following sources of data is
common across the cloud data lake and the services and tools used to ingest the data only
differs: Databases, Files (csv, xls, pdfs, and logs), IoT device feeds, Apps data. We will
be seeing various ways to capture the data into data lake from top cloud providers and
Cloud Storage, you can start with a few small files and grow your data lake to
Exabyte in size. Cloud Storage (see figure 15.8 a) supports high-volume ingestion of new
data and high-volume consumption of stored data in combination with other services such
Google cloud provides various ingest options. Pub/Sub is an option to ingest real-time or
near real-time data into GC. Storage Transfer Service offers moving data from online or
from on-premises such as data centre to cloud seamlessly and quickly. gsutil (Google
Store) an option if you want one-time or scheduled frequency file transfers into Google
Storage.
AWS S3 acts as a primary drop location for the data lake solutions (see figure
15.8 b), once the file is placed into a bucket (a folder in cloud) using an ETL engine there
are various ways to process them. S3 provides 99.999999999% durability and 99.99%
availability of objects over a given year with endless storage so customers no need to
worry about the growing data storage needs. Once the data is place inside the S3 buckets
it can trigger consecutive actions based on the type of data ingested. Migrating a data
base can be done using Database Migration Service which helps to migrate data quickly
and securely. With no downtime to the existing databases. DMS can support
homogeneous migrations like Oracle to Oracle. SQL Server to SQL Server and also
heterogenous migrations like Oracle or Microsoft SQL Server to AWS Aurora database.
Microsoft Azure
Azure storage service is a MS cloud storage solution, it’s a massively scalable object
store. Storage comes with various data services such as Azure Blobs, Files, Queues,
Tables and Disks (refer figure 15.8. (c)). Copy Data service from Azure offers data
ingestion from 70+ data sources on premises or cloud. An easy graphical user interface
driven ingestion process allows users to select 1000 of tables and databases and it
automates the data pipeline instances based on the options user has selected.
tools to perform the data transformations. Since the data lake brings an ability to store
raw data with no oversight on the contents. In traditional data warehouse we saw the is a
high need for intermediate storage or database such as data marts whereas in data lake
there should be no excessive use of database and pre-processing methods. Data lake
architecture completely decouples the complexity and reduces cost by enabling stateful
Accessing data with no schema is a major challenge in selecting the any ETL and data
lakes are typically used as repositories for raw data in structured or semi-structured
formats.
Organizations stepping into cloud and data platform solutions always tend to build a
strong data governance and security strategies. In the current cloud industry, every
provider focus more on security layers since most of the cost effective and preferred
solutions on cloud ends with shared hardware infrastructures. Below are some standards
followed across all the cloud data storage provides in the industry
Conclusion
There are various options available for building a data lake solution in the market and that are
available in matter of hours to operate. Serverless and fully managed solutions providers leads
the customer engagements with high availability and secured platform integrations.