0% found this document useful (0 votes)
14 views

unified-data-fabric-whitepaper

This whitepaper introduces the concept of Data Fabric architecture, emphasizing its necessity for managing diverse data types across hybrid environments. It outlines how Cloudera's platform supports the implementation of a unified Data Fabric, detailing its components and capabilities for data management, ingestion, processing, and governance. The document also highlights the importance of Data Fabric in addressing modern data challenges and enhancing organizational data accessibility and security.

Uploaded by

handong890
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

unified-data-fabric-whitepaper

This whitepaper introduces the concept of Data Fabric architecture, emphasizing its necessity for managing diverse data types across hybrid environments. It outlines how Cloudera's platform supports the implementation of a unified Data Fabric, detailing its components and capabilities for data management, ingestion, processing, and governance. The document also highlights the importance of Data Fabric in addressing modern data challenges and enhancing organizational data accessibility and security.

Uploaded by

handong890
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

WHITEPAPER

Data Architecture Series


The Unified Data Fabric
Building a modern data fabric
with Cloudera
Table of Contents

Abstract  3

Introduction  3

Why is a Data Fabric Architecture Needed  4

What is a Data Fabric? 4


Definition  4
Properties  4

How to build a Data Fabric with Cloudera 5


Cloudera  5

Beyond the Data Fabric  8

About Cloudera  9

2 The Unified Data Fabric


Abstract
This whitepaper provides an introduction
to the Data Fabric architecture. It explains
what it is, why it was created, especially
the challenges it addresses, offers a
Cloudera-based reference architecture
and highlights two key areas the Fabric
can be extended.
Version: 1.0
Author: Jean-Philippe Player

Introduction Recommended Reading


The recommended reading listed below is limited
In this section we briefly summarize why we wrote this
to only those sources that directly support this
whitepaper, who it is intended for, why they should
whitepaper. Reading the official Cloudera blog
read it, and recommendations for further reading.
or subscribing via email will provide access to a stream
Audience of useful reading.
This whitepaper was written for members of Architecture, • Enterprise Data Fabric Enables DataOps | Forrester
Operations, Engineering and Business leaders of
Enterprise Data Platform teams. It may also provide • How a Big Data Fabric con Transform Your
useful reading for Chief Data Officers (CDO) and Data Architecture
Chief Information Officers (CIO) that want to establish • Conquering hybrid and multi-cloud with
or strengthen their understanding of the Data Fabric big data fabric
architecture, specifically as it applies to Cloudera’s
products and services. • Cloudera Data Catalog overview
Purpose • Shared Data Experience (SDX) | Cloudera
The Data Fabric is one of three important emerging
data architectures; the other two are Data Mesh
and Data Lakehouse. Organizations need to clearly
understand what each of them is, why they are
important and how to implement them at scale,
in a hybrid landscape. That is the goal of this short
introductory whitepaper.

3 The Unified Data Fabric


Why is a Data Fabric What is a Data Fabric?
Architecture Needed Definition
Enterprises today have to contend with exponentially The Data Fabric offers a comprehensive approach
increasing volumes of batch and streaming data, to centrally manage, own, curate, secure and govern
comprising a variety of structured, unstructured, and enterprise data across multiple clouds and on premises.
semi-structured data types, and originating from Forrester coined the term and their definition of a Data
an expanding number of disparate sources located Fabric is as follows:
on-premises, in the cloud and at the edge.

At the same time, business users demand faster and


easier access to reliable, trusted, up-to-date data A Data Fabric orchestrates disparate data
to make accurate business decisions. sources intelligently and securely in a self-
Traditional data approaches require spending a lot of time
service manner, leveraging data platforms
on manually preparing data, managing ingestion, such as data lakes, data warehouses, NoSQL,
standardizing data sets and orchestrating data movement translytical, and others to deliver a unified,
between on premises and cloud environments. Finding trusted, and comprehensive view of customer
the right data sets and making them available for and business data across the enterprise to
analytics is often a convoluted process that further
support applications and insights.”
slows down business decisions. That is compounded
by regulatory compliance and security controls that Properties
must be manually applied at every step of the data A modern Data Fabric comprises multiple layers that
lifecycle, from ingestion to analytical applications. work together to meet these needs:
The Data Fabric has emerged as a modern data 1. Data management
architecture to overcome these challenges and 2. Data ingestion and streaming
supports the needs of a hybrid, multi-cloud
3. Data processing and persistence
environment. The architecture focuses on making data
readily available to business users wherever it resides, 4. Data orchestration
improve collaboration, enable self-service, and leverage 5. Data discovery
automation to simplify data management and enforce 6. Global data access
the necessary compliance and security requirements.
The Data Management layer is the core layer of the
architecture and provides the needed tools and
interfaces for all the other layers. It is what provides
the end-to-end data management capabilities that
ensure the reliability, security and governance of data
and is the main focus of this paper.
Figure 01 — Enterprise Data Fabric Reference Architecture, Enterprise Data Fabric
Enables DataOps, Forrester, August 2, 2021

1 Global distributed platform, in-memory, embedded, Global Data


6 AI/ML
Data Management self-service, APIs Access

Metadata/catalog Data
5 Data modeling, preparation, curation, graph engine AI/ML
Discovery
Data security

Data governance Data


4 Transformation, integration, cleansing AI/ML
Orchestration
Data processing DATA PROCESSING

Data quality Hadoop


Data platform Data lake Data Processing
3 NoSQL AI/ML
–processing EDS/BDW and Persistence
Data lineage Spark

Policies
Data Ingestion
2 Ingestion, streaming, data movement AI/ML
AI/ML and Streaming
AI/ML
Cloud Data Sources On premises

SOURCES
4 The Unified Data Fabric
How to build a Data Fabric
with Cloudera
Cloudera has been built from the ground up to support Cloudera provides the freedom to securely move
hybrid, multi-cloud data management in support of applications, data, and users bi-directionally between
a Data Fabric architecture. In this section we provide the data center and multiple data clouds, regardless of
an introduction to Cloudera, with a focus on the data where your data lives. As a result, the platform is perfectly
management capabilities that enable the Data Fabric. placed to implement modern data architectures:

• A unified Data Fabric which centrally orchestrates


Cloudera
disparate data sources intelligently and securely
Cloudera is a true hybrid platform designed to provide
across multiple clouds and on premises.
the freedom to choose any cloud, any analytics,
any data. Cloudera delivers faster and easier data • An open Data Lakehouse that enables multi-
management and data analytics for data anywhere, function analytics on both streaming and stored
with optimal performance, scalability, and security. data in a cloud-native object store across hybrid
With Cloudera you get the value of Cloudera on multi-cloud.
premises and Cloudera on cloud for faster time to
value and increased IT control.
• A scalable Data Mesh that helps eliminate data silos
by distributing ownership to cross-functional teams
while maintaining a common data infrastructure.

Lakehouse
On Premises
Ozone S3

Data
Unified Operational
Highly Engineering Data Fabric Database Customer Data Amazon Amazon
Proprietary Athena Redshift
Data

ADLS GCS
Data Machine
Warehouse Learning

Cloud
Machine Data Advertising Data Dataproc
Azure Synapse Data Flow
Analytics
BigQuery

Azure

Figure 02 — Cloudera

5 The Unified Data Fabric


Cloudera on Cloud
Management Data Replication Workload
Console Catalog Manager Manager

Data
Services
DF DE DW OD ML

Metadata | Security | Encryption | Control | Governance

Data Clusters Object Storage


NiFi*, Kafka, Flink, spark, S3 | ADLS | GCS
Impala, Hive, HBase, Phoenix

Cloudera Data Hub


Figure 03 — Services Components of Cloudera on cloud  *Flow Management rate card

Figure 03 provides a summary of the logical Data Catalog


components that make up Cloudera in the Public Cloud.
The Cloudera Data Catalog sits within the Control
We’ll now explore how each of these components
Plane. This global catalog provides a searchable
support the Data Fabric.
inventory of all the assets that are part of the Data
Common Control Plane Fabric, making data assets easily discoverable.

The Cloudera Control Pane provides a ubiquitous • Comprehensive — Support for all entities that make
service that is consistent and spans an organization’s up the hybrid platform: Hive tables, Kafka topics,
deployment instances. In the diagram above this Nifi flow, HBase tables, Machine Learning Models,
shows how a public cloud instance shares services etc. Each asset will be displayed alongside its
such as governance with the private cloud instance. contextual metadata, such as schema, security
It goes further in supporting multiple cloud and multiple policies, tags and classifications, profile, governance
private cloud deployments. The Control Plane is rules and business annotations.
a federated service which enables the metadata,
security, encryption and governance to be managed
• Discoverability — Single location to discover and
search for data from all nodes of the Fabric.
as a central, but federated service. The fundamental
building blocks are based on Open Source components • Governance — Built in profiling to give insights into
and have an Open and Accessible API which provides data quality and sensitivity, built in classification
integration to a wider ecosystem of services and engine that assigns security, compliance and policy
supports open standards and Interoperability. related attributes such as PII.

6 The Unified Data Fabric


• Lineage — Automatic capture of lineage information Replication Manager
helps understand where the data came from, how
The Cloudera Replication Manager is designed to
it is being used, what impact changes would have.
serve a number of use cases around cross-fabric data
It can further be extended to propagate security
orchestration and replication: workload migration,
policies across the entire Data Fabric, making it
cloud bursting, backup and disaster recovery, and
safer and easier to share data.
replication in support of development and test systems.
• Policy — Security, Compliance and Governance It supports full and incremental replication for all data
policies can be assigned to any data asset directly storage types available in the fabric.
from the Catalog.
A key tenet of a Unified Data Fabric is having consistent
• Security — Complete audit log of all access and security and governance controls across all fabric
modifications made to data sets locate anywhere endpoints. Tightly integrated with SDX, Replication
in the fabric. Manager supports that function1 by moving policies
with the data, replicating all associated metadata,
• Collaboration — Supports business annotations classification tags, security policies, compliance rules
and metadata, curation and collaboration. and lineage information.
This addresses the requirements of the Data Global Unified Security with SDX
Management layer (1) of the Data Fabric, when
deployed in conjunction with the Shared Data SDX supports attribute-based policies through the use
Experience (SDX). of tags, such as “PII”, that can be assigned to any data
asset including individual columns of a table. The data
Shared Data Experience (SDX) access policy for PII data can be specified by a
Cloudera SDX combines enterprise-grade security, centralized team responsible for enterprise-wide rules,
governance and management capabilities with shared while the tag itself can be assigned by the creator of the
metadata that is deployed locally in each node of the data set, either manually or via automatic classification.
Data Fabric, and federated via the Control Plane. The automatic capture of lineage information through
It provides a governance layer that is truly global — the data pipeline enables tag inheritance, and as such
spanning control planes and deployment instances to propagation of the relevant policies within a local node
assign ownership, capture audit and apply global policies in the fabric autonomously.
across on premises deployments and public clouds. Replication Manager is aware of these tags. As data is
• Metadata — establish information assets for increased moved between environments, classification tags are
usability, trust and value leveraging all metadata also automatically propagated and assigned to the
(structural, operational, business and social). data. This enforces the appropriate policies globally
across the nodes of the fabric, and provides unified
• Security — granular, dynamic, role- and attribute- policy management and compliance across all of the
based security policies. Prevent and audit organization’s environments, while allowing business
unauthorized access to sensitive or restricted users self-service access to trusted data.
data across platform.
Source

• Encryption — strong cryptography for data in 1


Available from second half of 2022

motion and rest, centralized authentication with


single-sign on.

• Control — move data and workloads between


deployments for optimum performance, cost and
resilience, meeting ever changing business needs.

• Governance — enterprise-grade auditing, lineage,


and governance capabilities applied across the
platform with rich extensibility for partner integrations.

7 The Unified Data Fabric


Figure 04 — Key Cloudera Components that make up the Data Fabric architecture

1 6
Global Data
Access
Data Warehouse Operational Database Data Visualization

5
Data
Discovery
Data Catalog Data Visualization

4
Data Catalog
Data
Orchestration
Streaming Analytics Data Engineering

3
Data Processing
and Persistence
Data Warehouse Operational Database Data Hub

2
Data Ingestion
and Streaming
Data Flow & Streaming Data Engineering

Cloud Data Sources On premises

Data Services Beyond the Data Fabric


The Layers 2 through 6 of the Data Fabric are addressed
The Data Fabric takes a centralized approach to all
using Cloudera Data Services, see figure 04:
aspects of data management. As organizations scale,
• Data ingestion and streaming — provided by moving to a distributed model for managing the data
Cloudera DataFlow and Cloudera Data Engineering domains can be beneficial and is encapsulated in the
concept of the Data Mesh, which is the subject of a
• Data processing and persistence — provided by separate whitepaper in the series.
Cloudera Data Hub, Cloudera Data Warehouse
and Cloudera Operational Database

• Data orchestration — provided by components


embedded in Cloudera Data Engineering and
Cloudera Streaming Analytics

• Data discovery — provided by Cloudera Data


Visualization and the Cloudera Data Catalog

• Global data access — provided by Cloudera Data


Warehouse, Cloudera Operational Database and
Cloudera AI

8 The Unified Data Fabric


About Cloudera
Cloudera is the only true hybrid platform for data,
analytics, and AI. With 100x more data under
management than other cloud-only vendors,
Cloudera empowers global enterprises to transform
data of all types, on any public or private cloud,
into valuable, trusted insights. Our open data
lakehouse delivers scalable and secure data
management with portable cloud-native
analytics, enabling customers to bring GenAI
models to their data while maintaining privacy
and ensuring responsible, reliable AI deployments.
The world’s largest brands in financial services,
insurance, media, manufacturing, and government
rely on Cloudera to be able to use their data
to solve the impossible — today and in the future.

To learn more, visit Cloudera.com and follow


us on LinkedIn and X. Cloudera and associated
marks are trademarks or registered trademarks
of Cloudera, Inc. All other company and
product names may be trademarks of their
respective owners.

Cloudera, Inc. | 5470 Great America Pkwy, Santa Clara, CA 95054 USA | cloudera.com

© 2025 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other
trademarks are the property of their respective companies. Information is subject to change without notice. WP_003_V2 December 23, 2024

You might also like