HPE Ezmeral Data Fabric Database-A00125063enw
HPE Ezmeral Data Fabric Database-A00125063enw
Contents
Introduction .......................................................................................................................................................................................................................................................................................................................................... 3
Advantages of HPE Ezmeral Data Fabric Database.............................................................................................................................................................................................................................. 3
Database architecture.......................................................................................................................................................................................................................................................................................................... 4
HPE Ezmeral Data Fabric Data ......................................................................................................................................................................................................................................................................................... 5
Tables and volumes............................................................................................................................................................................................................................................................................................................... 5
Table regions and containers....................................................................................................................................................................................................................................................................................... 5
Data models ......................................................................................................................................................................................................................................................................................................................................... 7
HPE Ezmeral Data Fabric Database as a document database.................................................................................................................................................................................................... 7
HPE Ezmeral Data Fabric Database as a column-oriented database ...............................................................................................................................................................................10
HPE Ezmeral Data Fabric Database and Hive Integration...........................................................................................................................................................................................................14
Database security ........................................................................................................................................................................................................................................................................................................................14
Dynamic data masking for JSON document database.....................................................................................................................................................................................................................14
Predefined mask types ....................................................................................................................................................................................................................................................................................................15
Create a table...................................................................................................................................................................................................................................................................................................................................15
Conclusion...........................................................................................................................................................................................................................................................................................................................................16
Technical white paper Page 3
HPE INTERNAL USE ONLY
Introduction
HPE Ezmeral Data Fabric is the industry’s first edge-to-cloud solution that ingests, stores in the native format, and completes in-place
processing of different data types. It supports popular data and analytics APIs, which simplify data access across the enterprise for all
analytics users. HPE Ezmeral Data Fabric Database is an enterprise-grade, high-performance, and NoSQL database management system
that is built into the HPE Ezmeral Data Fabric platform. It requires no additional process to manage, leverages the same architecture as the
rest of the platform, and requires minimal additional configuration.
Architect
Integrated analytics
• Advanced analytics with SQL
• Advanced analytics with Spark
• Global distributed applications
• Replication
Developer Admin/DevSecOps
Powerful APIs Mission critical
• High performance via secondary • High availability
indexes • No data loss
• Easy application development • Self-tuning
• Extreme scale for CRUD ops • Zero downtime
• Flexible data model • Security
• Rich query • Disaster recovery
• Storage data consistency • Snapshots
From an administrator’s point-of-view, HPE Ezmeral Data Fabric Database provides the following capabilities:
• Minimal administration: Single namespace for files, tables, and streams; flexible schema that allows built-in data management and
protection; automatic splits and merges as data grows and shrinks; and easy bulk data loading
• Self-healing from hardware and software failures: Replicated state and data for instant recovery and automated replication of data
• Global low-latency replication: Multi-primary (that is, active to active) replication, which is important for disaster recovery, also reduces risk
of data loss, application failover, and faster data access
• High performance and low latency: Integrated system with fewer software layers, single hop to data, and no compactions with low I/O
amplification
• Fine-grained security: Access permissions can be granted to tables (as well as files and streams) at a granular level using Access Control
Expressions (ACEs) are designed for flexibility and ease-of-use
Technical white paper
HPE INTERNAL USE ONLY Page 4
The HPE Ezmeral Data Fabric Database includes two NoSQL databases, a binary-style table, and a document-style table.
• Key-value and columnar database with HBase API
– Supports Apache HBase tables and databases
– Provides a native implementation of the HBase API for optimized performance on the data fabric platform
• JSON document database based on the OJAI API
– Supports JSON documents as a native datastore
– Stores JSON documents in HPE Ezmeral Data Fabric Database JSON tables
– With HPE Ezmeral Data Fabric 7.0.0, all fields of JSON tables support dynamic data masking (DDM)
– The JSON database supports eight predefined dynamic data masks
Database architecture
HPE Ezmeral Data Fabric Database implements tables within the framework of the data fabric filesystem. It creates tables (both binary and
JSON tables) in logical units called volumes. See Figure 2.
Table A Table B
Row-key A001 Row-key A001
Row-key A002 Row-key A002
Row-key A003 Row-key A003
Table C Table D
Row-key A001 Row-key A001
Row-key A002 Row-key A002
Row-key A003 Row-key A003
Volumes do not have a fixed size and they do not occupy disk space until the file system writes data to a container within the volume. A
large volume may contain anywhere from 50–100 million containers.
Tables are stored in containers and implemented in volumes, and provide the following capabilities:
• Multi-tenancy
• Snapshots
• Mirroring and replication
The following are the key advantages to storing table regions in containers:
• Cluster scalability
• High data availability
Cluster scalability
Container Location Database (CLDB) tracks information and location of tables (and files) through file system containers. As this architecture
keeps the CLDB size small, it becomes practical to store tens of exabytes in a data fabric cluster, regardless of the number of tables and files.
The cluster’s CLDB tracks the location of containers and CLDBs are updated only when a container is moved, a node fails, or because of
periodic block change reports. Therefore, the update rate, even for very large clusters, is relatively low. The data fabric filesystem does not
have to query the CLDB often, so it can cache container locations for very long times.
Moreover, CLDBs are very small in comparison to Apache Hadoop NameNodes. NameNodes track metadata and block information for all
files and the locations for all blocks in every file as well. As blocks are typically 200 MB in size on average, the total number of objects that a
NameNode tracks is very large. CLDBs, however, track containers, which are much larger objects, so the size of the location information can
be 100x to 1000x smaller than the location information in a NameNode. CLDBs do not track information about tables and files.
High availability
Due to the way updates to table regions (also called tablets) are applied and replicated, data in table regions are instantly available. Tables
and table regions are part of abstract entities called containers that provide the automatic replication of table regions (with a default of
three) across the nodes of a cluster.
Containers are replicated to a configurable number of copies. These copies are distributed to different nodes in the same cluster as the
original or primary container. The cluster CLDB determines the order in which the replicas are updated. Together, the replicas form a
replication chain that is updated transactionally. When an update is applied to a region (also called tablets) in the primary container (which is
at the head of a replication chain), the update is applied serially to the replicas of that container in the chain. The update is complete only
when all replicas in the chain are updated.
Technical white paper
HPE INTERNAL USE ONLY Page 6
As a result of this architecture, when a hardware failure brings down a node, the regions served by that node are available instantly from one
of the other nodes that have the replicated data.
The HPE Ezmeral Data Fabric software can detect the exact point at which replicas diverge, even at a 2 GB per second update rate. The
software randomly picks any one of the three copies as the new primary, rolls back the other surviving replicas to the divergence point, and
then rolls forward to converge with the chosen primary. The HPE Ezmeral Data Fabric software can do this on the fly with very little impact
on normal operations. Since containers are contained in volumes, the automatic replication factor is set at the volume level.
Multi-tenancy
HPE Ezmeral Data Fabric Database tables are created in volumes. When the volume is restricted so is the table data. If a volume is restricted
to a subset of a cluster’s nodes, then it allows isolation of sensitive data or applications, and the use of heterogeneous hardware in the cluster
for specific workloads.
For example, data placement can be used to keep personally identifiable information (PII) on nodes that have encrypted drives, or to keep
HPE Ezmeral Data Fabric Database tables on nodes that have SSDs. Work environments can be isolated for different database users or
applications and HPE Ezmeral Data Fabric Database tables placed on specific hardware for better performance or load isolation.
Isolation of work environments for different database users or applications lets policies, quotas, and access privileges for specific users and
volumes to be set. Multiple jobs with different requirements can run without conflict.
Figure 3 depicts a data fabric cluster storing table and file data. The cluster has three separate volumes mounted at directories /user/eng,
/user/mkt and /project/test. As shown, each directory contains both file data and table data, grouped together logically. Since each directory
maps to a different volume, data in each directory can have a different policy. For example, /user/eng has a disk-usage quota, while
/user/mkt is on a snapshot schedule. Furthermore, two directories, /user/mkt and /project/test are mirrored to locations outside the cluster,
providing read-only access to high-traffic data, including the tables in those volumes.
Snapshots
Since HPE Ezmeral Data Fabric Database tables are created in volumes, a volume snapshot can be used to capture the state of a volume’s
directories, tables, and files, at an exact point in time. Volume snapshots can be used for rollbacks, hot backups, model training, and real-time
data analysis management.
Hot backups
Backups of table data can be created on the fly for auditing or governance compliance.
Technical white paper
HPE INTERNAL USE ONLY Page 7
Model training
Machine learning frameworks can use snapshots to enable a reproducible and auditable model training process. Snapshots allow the training
process to work against a preserved image of the training data from a precise moment in time. In most cases, the use of snapshots requires
no additional storage and snapshots are taken in less than one second.
Mirroring
Since HPE Ezmeral Data Fabric Database tables are created in volumes, volume mirroring enables automatic replication of differential data
across clusters and this is done so, as designated, using mirror schedules or through a manual mirroring operation one time without defining
a schedule. Consider mirroring volumes to create disaster recovery solutions for databases or provide read-only access to data from multiple
locations.
As HPE Ezmeral Data Fabric Database does not require RegionServers to be reconstructed, databases can be brought up on the mirrored
site if the active site goes down. Mirroring is a parallel operation, copying data directly from the nodes of one data fabric cluster to the nodes
in a remote data fabric cluster. The contents of the volume are mirrored, even if the files in the volume are being written to or deleted.
Data fabric captures only that data which has changed at the file-block level since the last data transfer. After the data differential is
identified, it is then compressed and transferred over the WAN to the recovery site, using very low network bandwidth. Finally, checksums
are used to ensure data integrity across the two clusters. There is no performance penalty on the cluster because of mirroring.
Replication
Automatically replicating differential data across clusters is possible when coupling this feature with volume mirroring processes. Consider
using replication to allow for reliable data protection and uninterrupted access to data, in addition to combining its features with mirroring
for data recovery features.
Data replication processes can be initiated to specifically allow for high availability of data. The process involves copying volume data from
one node to another within and across clusters. Specifically, streams and tables can be replicated through gateways on a record-by-record
basis in real-time within the HPE Ezmeral Data Fabric Database.
Data models
HPE Ezmeral Data Fabric Database can be used as both a document database and a column-oriented database. As a document database,
JSON documents are stored in HPE Ezmeral Data Fabric Database JSON table. As a column-oriented database, binary files are stored in
HPE Ezmeral Data Fabric Database binary tables.
JSON table
{“_id … }
{“_id … }
{“_id … }
Fields in the document can be identified by using field paths. For example, address; street:
{
"_id": "ID001",
"name" : "Bob",
"address": {
"house" : 123,
"street": "Main",
"phones": [
{ "mobile": "555-1234" },
{ "work": "+1-123-456-7890" }]},
"hobbies": ["badminton", "chess", "beaches"]
}
Figure 6 summarizes the different code paths and the components involved for processing HPE Ezmeral Data Fabric Database JSON
queries.
JSON documents
and indexes
A secondary index (also referred to as an index) is a special table that stores a subset of document fields from a JSON table. The index
orders its data on a set of fields, defined as the indexed fields. This contrasts with the JSON table that orders its data on the table primary
key (rowId or rowKey). With administrator privileges, one or more indexes on each JSON table can be created. After the indexes are created,
applications can leverage them to accelerate query response times. Secondary indexes can also contain additional fields known as included
fields (or sometimes covered fields) beyond those being indexed, so that many queries can be satisfied with a single read.
These indexes provide efficient access to a wider range of queries on data in HPE Ezmeral Data Fabric Database. They allow queries to
efficiently query data through fields other than the primary key. This capability results in HPE Ezmeral Data Fabric Database supporting a
broader set of use cases. Applications that benefit include rich, interactive business applications, and user-facing analytic applications.
Secondary indexes also enable business intelligence (BI) tools and ad hoc queries on operational data sets. Secondary indexes can be
created only on HPE Ezmeral Data Fabric Database JSON tables.
Technical white paper
HPE INTERNAL USE ONLY Page 10
As a column-oriented database, HPE Ezmeral Data Fabric Database stores data in binary format and supports the Apache HBase API while
allowing native implementation of the HBase API. The HBase applications can use HPE Ezmeral Data Fabric Database tables without
modifying any code. HPE Ezmeral Data Fabric Database tables also:
• Use the HBase data model
• Allow large-scale applications managing columnar data
• Support binary compatibility with applications using standard HBase application APIs
• Identify data elements with binary tables and rows that are indexed by key columns in each row and column families
Data is stored as a collection of key-value pairs where the key serves as a unique identifier. Typically, tables of the same type (in this case,
binary) are created in their volume.
Binary table
Key 001 Column A Column B Column C
Key 002 Column A Column B Column C
Key 003 Column A Column B Column C
HPE Ezmeral Data Fabric Database stores data as a nested series of maps. Each map consists of a set of key-value pairs, where the value
can be the key in another map. Keys are kept in strict lexicographical order: 1, 10, and 113 come before 2, 20, and 213.
This structure results in values with versions that can be accessed flexibly and quickly. Since HPE Ezmeral Data Fabric Database binary
tables are sparse, any of the column values for a given key can be null.
Technical white paper
HPE INTERNAL USE ONLY Page 11
Compression settings for individual column families can be specified, so that the settings that prioritize speed of access or efficient use of
disk space can be chosen, according to needs.
Be aware of the approximate number of rows in the column families. This property is called the column family’s cardinality. When column
families in the same table have very disparate cardinalities, the sparser table’s data can be spread out across multiple nodes, due to the
denser table requiring more splits. Scans on the sparser column family can take longer due to this effect.
For example, consider a table that lists products across a small range of model numbers, but with a row for the unique serial numbers for
each individual product manufactured within a given model. Such a table will have a very large difference in cardinality between a column
family that relates to the model number compared to a column family that relates to the serial number. Scans on the model number column
family will have to range across the cluster, since the frequent splits required by the comparatively large numbers of serial number rows will
spread the mode number rows out across many regions on many nodes.
Column families
Example table
This example uses JSON notation for representational clarity. In this example, time stamps are arbitrarily assigned.
Queries return the most recent time stamp, by default. For example, a query for the value in
“arbitrarySecondKey"/“secondColumnFamily:firstColumn” returns valueThree. Specifying a time stamp with a query for
“arbitrarySecondKey"/“secondColumnFamily:firstColumn”/11 returns valueSeven.
Table replication
Data in one table can be replicated to another table in the same cluster or in a separate cluster. This type of replication is in addition to the
automatic replication that occurs with table regions within a volume. Changes such as “puts” and “deletes”, the entire tables, specific columns,
and specific column families can all be replicated.
Data fabric binary tables can only be replicated to binary tables and the JSON tables can only be replicated to JSON tables.
• Tables from which data is replicated are called source tables. Tables to which the data is replicated are called replicas.
• Clusters from which data is replicated are called source clusters. Clusters to which data is replicated are called destination clusters.
A single cluster can be both a source cluster and a destination cluster, depending on the replication configuration in which the cluster
participates.
• Replication takes place between source and destination clusters. However, source clusters do not send data to nodes in the destination
cluster directly. The replication stream (the data being pushed to the replicas) is consumed by one or more data fabric gateways in the
destination cluster. The gateways receive the updates from the source cluster, batch them, and apply them to the replica tables. Multiple
gateways serve the purpose of both load balancing and failover.
Technical white paper
HPE INTERNAL USE ONLY Page 13
Modes of replication
Table data can be replicated in one of two replication modes. The mode per source-replica pair is specified.
• Asynchronous replication
In this replication mode, HPE Ezmeral Data Fabric Database confirms to client applications that operations are complete after the
operations are performed on source tables. Updates are replicated in the background. Therefore, the latency of updates from client
applications is not affected by the time required for the network round trip between the source cluster and the destination cluster.
This type of replication is well-suited for clusters that are geographically separated in wide-area networks.
HPE Ezmeral Data Fabric Database can throttle the replication stream to minimize the impact of the replication process on incoming
operations during periods of heavy load. Throttling distributes disk reads and CPU usage more evenly over time, so that incoming operations
on a source table can be completed faster. Throttling is disabled by default. Asynchronous replication is the default replication mode.
• Synchronous replication
In this replication mode, HPE Ezmeral Data Fabric Database confirms to client applications that changes have been applied to a source
table only when these two conditions are true:
– The change was sent to all the container copies in the local cluster.
– The change was sent to a gateway in the destination cluster. This operation takes place only after the first. Puts are not sent to
gateways until after they are sent to all container copies in the cluster where the source table is located.
If a gateway fails, the source detects this and resends operations to the gateway when it is restarted, or a new gateway is brought online. Due
to the confirmations that HPE Ezmeral Data Fabric Database receives on source clusters, synchronous replication is especially well-suited for
creating a backup of data for disaster recovery.
When the latency of a replication stream is high, HPE Ezmeral Data Fabric Database switches to asynchronous replication temporarily so
that client applications are not blocked indefinitely. After the latency is sufficiently reduced, it switches back to synchronous replication. The
same switching occurs when a gateway fails and HPE Ezmeral Data Fabric Database does not resume synchronous replication until a new
gateway is established or the failed gateway is restarted.
• One-to-many replication
A single source table can replicate to multiple replicas.
• Replication loops
When three or more tables need to be kept in sync, primary-secondary replication between pairs of them can be set up to form a
replication loop. Operations on a table are propagated to the other clusters in the loop, but there is no attempt to reapply the operations
at the originating table. This is because the operations are tagged with a universally unique identifier (UUID) that identifies the table
where the operations originated.
• Primary-secondary replication in two directions
Primary-secondary replication configurations can be combined to replicate data between clusters. Two clusters engaged in replication can
each act as a source cluster and a destination cluster.
Multi-primary replication
In this replication topology, there are two primary-secondary relationships, with each table playing both the primary and secondary roles.
Client applications update both tables, and each table replicates updates to the other.
All updates from a source table arrive at a replica after having been authenticated at a gateway. Therefore, ACEs on the replica that control
permissions for updates to column families and columns are irrelevant; gateways have the implicit authority to update replicas. If one of the
tables goes offline, client applications can be directed to the other table. When the offline table comes back online, replication between the
two tables resumes automatically. When both tables are in sync again, client applications can be redirected back to the original table.
HPE Ezmeral Data Fabric Database binary tables can be created from Hive, and both can access them. Hive queries can be run on
HPE Ezmeral Data Fabric Database binary tables, convert existing binary tables into Hive and HPE Ezmeral Data Fabric Database tables,
and run Hive queries on those tables as well. Utilizing Hive allows for the widest standard SQL access against the NoSQL tables stored in
HPE Ezmeral Data Fabric Database.
HPE EEP
An HPE Ezmeral Ecosystem Pack (EEP) provides a set of ecosystem components that work together on one or more data fabric cluster
versions. Each HPE EEP contains only one version of an ecosystem component. Hadoop ecosystem components within an HPE EEP
undergo extensive interoperability testing to validate that the components can work together. Hive is included in HPE EEP.
Database security
HPE Ezmeral Data Fabric Database has column family, column, and field-level ACEs, as well as policy-based security, which security policies
to be created that control access to information. ACEs and security policies provide an all-or-nothing approach—either the data for the
column or field is returned or not returned.
As a typical example, consider the credit card industry. The application that prints receipts for credit card purchases does not need the full
credit card numbers but only needs the last four digits of the credit card number to identify the credit card being used. However, in the same
organization, the full credit card number should be available for payment processing. With ACEs and policies, it is possible to either get the
credit card number or not. ACEs and/or policies cannot be used to return only the last four digits of the credit card number.
DDM offers the solution. It’s easy to use and backward compatible with existing applications. DDM applies the masking rules to query results,
with no modifications required to existing queries. Its disadvantage is that it is not a fully secure solution for sensitive fields; it does not
prevent users from connecting to the database and running exhaustive queries that expose pieces of sensitive data. Therefore, DDM can be
viewed as a complementary solution to other database security features, such as auditing, encryption, and row/column-level security. The
maximum number of supported DDMs is 128. There are eight predefined DDMs supported on the JSON database.
Technical white paper
HPE INTERNAL USE ONLY Page 15
The following table describes the predefined masks that DDM supports:
mrddm_redact This data mask will mask all alphabetic characters with “x” and all numeric characters with “0” for strings. For Binary, Boolean, Byte, Int,
other data types, the mask replaces all values with whatever is equivalent to 0 for that data type. Long, Short, String, Float,
Double, Time, Time stamp,
Date, Array
mrddm_last4 Displays only the last four characters and replaces everything else with *. This can be used in a wide number String, Array
of applications, including credit card numbers, passport information, and social security numbers.
mrddm_first4 Displays only the first four characters. This is similar to the last four data mask, but just shows the first String, Array
characters instead.
mrddm_first6last4 Displays only the first six characters and last four characters. This is similar to the last four data mask format. String, Array
mrddm_email Displays the first two characters and the last two characters of the user name and the first character of the String (in format of email),
domain and the whole top-level domain. For instance, [email protected] will be masked to ex***le@h**. com. Array
mrddm_hash Displays the hash of the data. This is useful for verifying if two cells match but will not show the pattern or String, Array
the length of the data.
mrddm_date Displays a generic date for all date fields but shows the correct year. This mask makes all months and days Time stamp, Date
of the month into the value one.
Create a table
Both binary and JSON tables can be created using HPE Ezmeral Data Fabric Management Control System (MCS). Figure 11 illustrates the
options available when creating a new table with the MCS UI.
Conclusion
HPE Ezmeral Data Fabric Database is an enterprise-grade, high-performance, NoSQL database management system that supports both
binary and JSON formats. It is fully integrated into the HPE Ezmeral Data Fabric platform, along with File Store, Objects, and Streams and
provides a unified layer that fosters enterprise-wide data observability and lifecycle management across traditional and cloud-native apps
without refactoring data or applications.
Learn more at
hpe.com/datafabric
© Copyright 2022 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without
notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements
accompanying such products and services. Nothing herein should be construed as constituting an additional warranty.
Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.
This document contains confidential and/or legally privileged information. It is intended for Hewlett Packard Enterprise Internal
Use only. If you are not an intended recipient as identified on the front cover of this document, you are strictly prohibited from
reviewing, redistributing, disseminating, or in any other way using or relying on the contents of this document.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Windows is either a registered trademark or
trademark of Microsoft Corporation in the United States and/or other countries. All third-party marks are property of their
respective owners.
a00125063ENW, Rev. 1