0% found this document useful (0 votes)
213 views16 pages

HPE Ezmeral Data Fabric Database-A00125063enw

Uploaded by

huongbx.hpe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views16 pages

HPE Ezmeral Data Fabric Database-A00125063enw

Uploaded by

huongbx.hpe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Technical white paper

HPE INTERNAL USE ONLY

HPE Ezmeral Data Fabric


Database
Technical white paper
HPE INTERNAL USE ONLY

Contents
Introduction .......................................................................................................................................................................................................................................................................................................................................... 3
Advantages of HPE Ezmeral Data Fabric Database.............................................................................................................................................................................................................................. 3
Database architecture.......................................................................................................................................................................................................................................................................................................... 4
HPE Ezmeral Data Fabric Data ......................................................................................................................................................................................................................................................................................... 5
Tables and volumes............................................................................................................................................................................................................................................................................................................... 5
Table regions and containers....................................................................................................................................................................................................................................................................................... 5
Data models ......................................................................................................................................................................................................................................................................................................................................... 7
HPE Ezmeral Data Fabric Database as a document database.................................................................................................................................................................................................... 7
HPE Ezmeral Data Fabric Database as a column-oriented database ...............................................................................................................................................................................10
HPE Ezmeral Data Fabric Database and Hive Integration...........................................................................................................................................................................................................14
Database security ........................................................................................................................................................................................................................................................................................................................14
Dynamic data masking for JSON document database.....................................................................................................................................................................................................................14
Predefined mask types ....................................................................................................................................................................................................................................................................................................15
Create a table...................................................................................................................................................................................................................................................................................................................................15
Conclusion...........................................................................................................................................................................................................................................................................................................................................16
Technical white paper Page 3
HPE INTERNAL USE ONLY

Introduction
HPE Ezmeral Data Fabric is the industry’s first edge-to-cloud solution that ingests, stores in the native format, and completes in-place
processing of different data types. It supports popular data and analytics APIs, which simplify data access across the enterprise for all
analytics users. HPE Ezmeral Data Fabric Database is an enterprise-grade, high-performance, and NoSQL database management system
that is built into the HPE Ezmeral Data Fabric platform. It requires no additional process to manage, leverages the same architecture as the
rest of the platform, and requires minimal additional configuration.

Architect
Integrated analytics
• Advanced analytics with SQL
• Advanced analytics with Spark
• Global distributed applications
• Replication

Developer Admin/DevSecOps
Powerful APIs Mission critical
• High performance via secondary • High availability
indexes • No data loss
• Easy application development • Self-tuning
• Extreme scale for CRUD ops • Zero downtime
• Flexible data model • Security
• Rich query • Disaster recovery
• Storage data consistency • Snapshots

Figure 1. Advantages of HPE Ezmeral Data Fabric Database

Advantages of HPE Ezmeral Data Fabric Database


• Integrated analytics with SQL: The HPE Ezmeral Data Fabric Database integration with Apache Drill provides low latency, is distributed,
and enables SQL query engine for large-scale data sets including structured and semi-structured, nested data.
• Operational analytics: HPE Ezmeral Data Fabric Database can run in the same cluster as Apache Hadoop and Apache Spark, allowing
immediate analysis or processing of live, interactive data. This also helps eliminate data silos to speed the data-to-action cycle, providing a
more efficient data architecture.
• Global distribution of applications: Application access to HPE Ezmeral Data Fabric Database tables is distributable on a global scale.
• Flexible data model: HPE Ezmeral Data Fabric Database functions as document database and column-oriented database. As a document
database, it stores JSON documents in JSON tables and as a column-oriented database, it stores binary files in binary tables.

From an administrator’s point-of-view, HPE Ezmeral Data Fabric Database provides the following capabilities:
• Minimal administration: Single namespace for files, tables, and streams; flexible schema that allows built-in data management and
protection; automatic splits and merges as data grows and shrinks; and easy bulk data loading
• Self-healing from hardware and software failures: Replicated state and data for instant recovery and automated replication of data
• Global low-latency replication: Multi-primary (that is, active to active) replication, which is important for disaster recovery, also reduces risk
of data loss, application failover, and faster data access
• High performance and low latency: Integrated system with fewer software layers, single hop to data, and no compactions with low I/O
amplification
• Fine-grained security: Access permissions can be granted to tables (as well as files and streams) at a granular level using Access Control
Expressions (ACEs) are designed for flexibility and ease-of-use
Technical white paper
HPE INTERNAL USE ONLY Page 4

The HPE Ezmeral Data Fabric Database includes two NoSQL databases, a binary-style table, and a document-style table.
• Key-value and columnar database with HBase API
– Supports Apache HBase tables and databases
– Provides a native implementation of the HBase API for optimized performance on the data fabric platform
• JSON document database based on the OJAI API
– Supports JSON documents as a native datastore
– Stores JSON documents in HPE Ezmeral Data Fabric Database JSON tables
– With HPE Ezmeral Data Fabric 7.0.0, all fields of JSON tables support dynamic data masking (DDM)
– The JSON database supports eight predefined dynamic data masks

Database architecture
HPE Ezmeral Data Fabric Database implements tables within the framework of the data fabric filesystem. It creates tables (both binary and
JSON tables) in logical units called volumes. See Figure 2.

Data fabric file system volume

Table A Table B
Row-key A001 Row-key A001
Row-key A002 Row-key A002
Row-key A003 Row-key A003

Table C Table D
Row-key A001 Row-key A001
Row-key A002 Row-key A002
Row-key A003 Row-key A003

Figure 2. Database architecture

The architecture has the following advantages:


• It reduces process overhead because it has no extra layers to pass through when performing operations HPE Ezmeral Data Fabric
Database, like several other NoSQL databases, is a log-based database. HPE Ezmeral Data Fabric Database runs inside the data fabric
filesystem process, which enables it to read from and write to disks directly. In contrast, other NoSQL databases must communicate with a
separate process to perform disk reads and writes. HPE Ezmeral Data Fabric Database eliminates extra process hops, duplicate caching,
and needless abstractions, with the consequence of optimizing I/O operations on the data.
• It minimizes compaction delays because it avoids I/O storms when it merges logged operations with structures on disk. As a log-based
database, HPE Ezmeral Data Fabric Database must write logged operations to disk. It stores table regions (also called tablets) and smaller
structures within them partially as b-trees. Together, with write-ahead logs (WAL), these b-trees comprise log-structured merge trees.
WAL for the smaller structures within regions are periodically restructured by rolling merge operations on the b-trees. As HPE Ezmeral
Data Fabric Database performs these merges at small scales, applications running against it sees no significant effects on latency while the
mergers are taking place.
Technical white paper
HPE INTERNAL USE ONLY Page 5

HPE Ezmeral Data Fabric Database


HPE Ezmeral Data Fabric Database tables are implemented directly in the data fabric file system, which allows it to leverage the same
architecture as the rest of the platform and results in minimal additional management.
• HPE Ezmeral Data Fabric Database tables are created in logical units called volumes.
• The tables are sharded by implementing table regions (also called tablets).
• Table regions are stored in abstract entities called data containers.
• Data containers belong to file system volumes.

Tables and volumes


As volumes are a management entity that logically organize a cluster’s data, they can be used to enforce disk usage limits, set replication
levels, define snapshots and mirrors, and establish ownership and accountability.

Volumes do not have a fixed size and they do not occupy disk space until the file system writes data to a container within the volume. A
large volume may contain anywhere from 50–100 million containers.

Tables are stored in containers and implemented in volumes, and provide the following capabilities:
• Multi-tenancy
• Snapshots
• Mirroring and replication

Table regions and containers


Each region of a table, along with its corresponding WAL files, b-trees, and other associated structures, is stored in one container. Each
container (which can be from 16 to 32 GB in size) can store more than one region (which by default is 4096 MB in size). The recommended
practice is to use the default size for a region and allow it to be split automatically. Massive regions can affect synchronization of containers
and load balancing across a cluster. Smaller regions spread data better across more nodes. Since a container always belongs to exactly one
volume, that container’s replicas all belong to the same volume.

The following are the key advantages to storing table regions in containers:
• Cluster scalability
• High data availability

Cluster scalability
Container Location Database (CLDB) tracks information and location of tables (and files) through file system containers. As this architecture
keeps the CLDB size small, it becomes practical to store tens of exabytes in a data fabric cluster, regardless of the number of tables and files.

The cluster’s CLDB tracks the location of containers and CLDBs are updated only when a container is moved, a node fails, or because of
periodic block change reports. Therefore, the update rate, even for very large clusters, is relatively low. The data fabric filesystem does not
have to query the CLDB often, so it can cache container locations for very long times.

Moreover, CLDBs are very small in comparison to Apache Hadoop NameNodes. NameNodes track metadata and block information for all
files and the locations for all blocks in every file as well. As blocks are typically 200 MB in size on average, the total number of objects that a
NameNode tracks is very large. CLDBs, however, track containers, which are much larger objects, so the size of the location information can
be 100x to 1000x smaller than the location information in a NameNode. CLDBs do not track information about tables and files.

High availability
Due to the way updates to table regions (also called tablets) are applied and replicated, data in table regions are instantly available. Tables
and table regions are part of abstract entities called containers that provide the automatic replication of table regions (with a default of
three) across the nodes of a cluster.

Containers are replicated to a configurable number of copies. These copies are distributed to different nodes in the same cluster as the
original or primary container. The cluster CLDB determines the order in which the replicas are updated. Together, the replicas form a
replication chain that is updated transactionally. When an update is applied to a region (also called tablets) in the primary container (which is
at the head of a replication chain), the update is applied serially to the replicas of that container in the chain. The update is complete only
when all replicas in the chain are updated.
Technical white paper
HPE INTERNAL USE ONLY Page 6

As a result of this architecture, when a hardware failure brings down a node, the regions served by that node are available instantly from one
of the other nodes that have the replicated data.

The HPE Ezmeral Data Fabric software can detect the exact point at which replicas diverge, even at a 2 GB per second update rate. The
software randomly picks any one of the three copies as the new primary, rolls back the other surviving replicas to the divergence point, and
then rolls forward to converge with the chosen primary. The HPE Ezmeral Data Fabric software can do this on the fly with very little impact
on normal operations. Since containers are contained in volumes, the automatic replication factor is set at the volume level.

Multi-tenancy
HPE Ezmeral Data Fabric Database tables are created in volumes. When the volume is restricted so is the table data. If a volume is restricted
to a subset of a cluster’s nodes, then it allows isolation of sensitive data or applications, and the use of heterogeneous hardware in the cluster
for specific workloads.

For example, data placement can be used to keep personally identifiable information (PII) on nodes that have encrypted drives, or to keep
HPE Ezmeral Data Fabric Database tables on nodes that have SSDs. Work environments can be isolated for different database users or
applications and HPE Ezmeral Data Fabric Database tables placed on specific hardware for better performance or load isolation.

Isolation of work environments for different database users or applications lets policies, quotas, and access privileges for specific users and
volumes to be set. Multiple jobs with different requirements can run without conflict.

Figure 3 depicts a data fabric cluster storing table and file data. The cluster has three separate volumes mounted at directories /user/eng,
/user/mkt and /project/test. As shown, each directory contains both file data and table data, grouped together logically. Since each directory
maps to a different volume, data in each directory can have a different policy. For example, /user/eng has a disk-usage quota, while
/user/mkt is on a snapshot schedule. Furthermore, two directories, /user/mkt and /project/test are mirrored to locations outside the cluster,
providing read-only access to high-traffic data, including the tables in those volumes.

Figure 3. Data fabric cluster storing table and file data

Snapshots
Since HPE Ezmeral Data Fabric Database tables are created in volumes, a volume snapshot can be used to capture the state of a volume’s
directories, tables, and files, at an exact point in time. Volume snapshots can be used for rollbacks, hot backups, model training, and real-time
data analysis management.

Rollback from errors


Application errors or inadvertent user errors can mistakenly delete data or modify data in an unexpected way. With volume snapshots, the
HPE Ezmeral Data Fabric Database tables can be rolled back to a known, well-defined state.

Hot backups
Backups of table data can be created on the fly for auditing or governance compliance.
Technical white paper
HPE INTERNAL USE ONLY Page 7

Model training
Machine learning frameworks can use snapshots to enable a reproducible and auditable model training process. Snapshots allow the training
process to work against a preserved image of the training data from a precise moment in time. In most cases, the use of snapshots requires
no additional storage and snapshots are taken in less than one second.

Managing real-time data analysis


By using snapshots, query engines such as Apache Drill can produce precise synchronic summaries of data sources subject to constant
updates, such as sensor data or social media streams. Using a snapshot of the HPE Ezmeral Data Fabric Database data for such analyses
allows very precise comparisons to be done across multiple ever-changing data sources without the need to stop real-time data ingestion.

Mirroring
Since HPE Ezmeral Data Fabric Database tables are created in volumes, volume mirroring enables automatic replication of differential data
across clusters and this is done so, as designated, using mirror schedules or through a manual mirroring operation one time without defining
a schedule. Consider mirroring volumes to create disaster recovery solutions for databases or provide read-only access to data from multiple
locations.

As HPE Ezmeral Data Fabric Database does not require RegionServers to be reconstructed, databases can be brought up on the mirrored
site if the active site goes down. Mirroring is a parallel operation, copying data directly from the nodes of one data fabric cluster to the nodes
in a remote data fabric cluster. The contents of the volume are mirrored, even if the files in the volume are being written to or deleted.
Data fabric captures only that data which has changed at the file-block level since the last data transfer. After the data differential is
identified, it is then compressed and transferred over the WAN to the recovery site, using very low network bandwidth. Finally, checksums
are used to ensure data integrity across the two clusters. There is no performance penalty on the cluster because of mirroring.

Replication
Automatically replicating differential data across clusters is possible when coupling this feature with volume mirroring processes. Consider
using replication to allow for reliable data protection and uninterrupted access to data, in addition to combining its features with mirroring
for data recovery features.

Data replication processes can be initiated to specifically allow for high availability of data. The process involves copying volume data from
one node to another within and across clusters. Specifically, streams and tables can be replicated through gateways on a record-by-record
basis in real-time within the HPE Ezmeral Data Fabric Database.

Data models
HPE Ezmeral Data Fabric Database can be used as both a document database and a column-oriented database. As a document database,
JSON documents are stored in HPE Ezmeral Data Fabric Database JSON table. As a column-oriented database, binary files are stored in
HPE Ezmeral Data Fabric Database binary tables.

HPE Ezmeral Data Fabric Database as a document database


HPE Ezmeral Data Fabric Database supports JSON documents as a native datastore. A JSON document is a tree of fields. These JSON
documents are stored in HPE Ezmeral Data Fabric Database tables.
• HPE Ezmeral Data Fabric Database JSON tables use the Open JSON Application Interface (OJAI) data model and support the OJAI API.
• Documents are in JSON format; HPE Ezmeral Data Fabric Database stores them in an efficient binary encoding, rather than plain
ASCII text.
• With JSON tables, each value has a unique key (_id).

JSON table
{“_id … }
{“_id … }
{“_id … }

Figure 4. JSON table format


Technical white paper
HPE INTERNAL USE ONLY Page 8

Fields in the document can be identified by using field paths. For example, address; street:

{
"_id": "ID001",
"name" : "Bob",
"address": {
"house" : 123,
"street": "Main",
"phones": [
{ "mobile": "555-1234" },
{ "work": "+1-123-456-7890" }]},
"hobbies": ["badminton", "chess", "beaches"]
}

Figure 5. JSON table example

With JSON support document, the following can be done:


• Store data that is hierarchical and nested evolving over time.
• Read and write individual document fields, subsets of fields, or whole documents from and to a disk. To update individual fields or subsets
of fields, there is no need to read entire documents, modify them, and then write the modified documents to disk.
• Build applications with the HPE Ezmeral Data Fabric Database JSON API library, which is an implementation of the OJAI. This is an API
library for easily managing complex, evolving, hierarchical data. More data types than the standard types that JSON supports can be used;
complex queries can be created; and is accessible to JSON table documents without connection or configuration objects. This allows
large-scale applications to manage JSON documents.
• Filter query results within HPE Ezmeral Data Fabric Database before results are returned to client applications.
• Run client applications on Linux®, OS X, and Windows systems.
• Perform complex data analysis on JSON data with Apache Drill or other analytical tools in real time without having to copy data to
another cluster.
• Scale data to span thousands of nodes.
• Control read and write access to single fields and subsets of fields within a JSON table by using ACEs.
• Manage the disk layout of single fields and subdocuments within JSON tables.
• Use secondary indexes to improve query performance.
Technical white paper
HPE INTERNAL USE ONLY Page 9

OJAI Distributed Query Service


OJAI queries either directly access HPE Ezmeral Data Fabric Database JSON or leverage the OJAI Distributed Query Service. The latter
provides distributed query support for HPE Ezmeral Data Fabric Database JSON, powered by Apache Drill. The data fabric client
automatically determines whether OJAI queries benefit from using the OJAI Distributed Query Service when the service is available. This
section describes the architecture, including the code paths and components involved. It also discusses queries that originate from Drill SQL,
which leverage the full functionality of Drill.

Figure 6 summarizes the different code paths and the components involved for processing HPE Ezmeral Data Fabric Database JSON
queries.

BI/SQL tools OJAI


applications

Data fabric client

OJAI Analytical Drill queries


Distributed OJAI queries requiring OJAI
Drill Query Service Distributed Query Service
(powered by Apache Drill)
Simple OJAI queries

HPE Ezmeral Data Fabric Database


JSON

JSON documents
and indexes

Figure 6. OJAI Distributed Query Service

Data fabric automatically chooses the code path to use.


Secondary indexes
HPE Ezmeral Data Fabric Database JSON natively supports secondary indexes on fields in JSON tables. Indexes provide flexible,
high-performance access to data stored in the database.

A secondary index (also referred to as an index) is a special table that stores a subset of document fields from a JSON table. The index
orders its data on a set of fields, defined as the indexed fields. This contrasts with the JSON table that orders its data on the table primary
key (rowId or rowKey). With administrator privileges, one or more indexes on each JSON table can be created. After the indexes are created,
applications can leverage them to accelerate query response times. Secondary indexes can also contain additional fields known as included
fields (or sometimes covered fields) beyond those being indexed, so that many queries can be satisfied with a single read.

These indexes provide efficient access to a wider range of queries on data in HPE Ezmeral Data Fabric Database. They allow queries to
efficiently query data through fields other than the primary key. This capability results in HPE Ezmeral Data Fabric Database supporting a
broader set of use cases. Applications that benefit include rich, interactive business applications, and user-facing analytic applications.
Secondary indexes also enable business intelligence (BI) tools and ad hoc queries on operational data sets. Secondary indexes can be
created only on HPE Ezmeral Data Fabric Database JSON tables.
Technical white paper
HPE INTERNAL USE ONLY Page 10

HPE Ezmeral Data Fabric Database as a column-oriented database


HPE Ezmeral Data Fabric Database supports column-oriented databases as a native datastore. These database tables are conceptually
identical to Apache HBase tables.

As a column-oriented database, HPE Ezmeral Data Fabric Database stores data in binary format and supports the Apache HBase API while
allowing native implementation of the HBase API. The HBase applications can use HPE Ezmeral Data Fabric Database tables without
modifying any code. HPE Ezmeral Data Fabric Database tables also:
• Use the HBase data model
• Allow large-scale applications managing columnar data
• Support binary compatibility with applications using standard HBase application APIs
• Identify data elements with binary tables and rows that are indexed by key columns in each row and column families
Data is stored as a collection of key-value pairs where the key serves as a unique identifier. Typically, tables of the same type (in this case,
binary) are created in their volume.

Binary table
Key 001 Column A Column B Column C
Key 002 Column A Column B Column C
Key 003 Column A Column B Column C

Figure 7. Column-oriented database

HPE Ezmeral Data Fabric Database stores data as a nested series of maps. Each map consists of a set of key-value pairs, where the value
can be the key in another map. Keys are kept in strict lexicographical order: 1, 10, and 113 come before 2, 20, and 213.

In descending order of granularity, the elements of a binary table are:


• Key—Identify the rows in a table. In HPE Ezmeral Data Fabric Database, the maximum supported size of a row key is 64 KB. However, the
recommended practice is to keep it lower than a few hundred bytes.
• Row—Span one or more column families and columns. In HPE Ezmeral Data Fabric Database, the maximum supported size of a row is
2 GB. However, the recommended practice is to keep the size under 2 MB. In general, the database performs better with many small rows,
rather than with fewer large rows.
• Column family—Is a key associated with a set of columns. The user can specify this association according to the individual use case,
creating sets of columns. A column family can contain an arbitrary number of columns. HPE Ezmeral Data Fabric Database binary tables
support up to 64 column families.
• Column—Are keys that are associated with a series of timestamps that define when the value in that column was updated.
• Time stamp—In a column specifies when the data was written to that column.
• Value—Is the data written to that column at the specific time stamp.

This structure results in values with versions that can be accessed flexibly and quickly. Since HPE Ezmeral Data Fabric Database binary
tables are sparse, any of the column values for a given key can be null.
Technical white paper
HPE INTERNAL USE ONLY Page 11

Column families in binary tables


Scanning an entire table for matches can be very performance intensive. Column families enable grouping of related sets of data and restrict
queries to a defined subset, leading to better performance. When a column family is designed, consider what kinds of queries are going to be
used the most often and group the columns accordingly.

Compression settings for individual column families can be specified, so that the settings that prioritize speed of access or efficient use of
disk space can be chosen, according to needs.

Be aware of the approximate number of rows in the column families. This property is called the column family’s cardinality. When column
families in the same table have very disparate cardinalities, the sparser table’s data can be spread out across multiple nodes, due to the
denser table requiring more splits. Scans on the sparser column family can take longer due to this effect.

For example, consider a table that lists products across a small range of model numbers, but with a row for the unique serial numbers for
each individual product manufactured within a given model. Such a table will have a very large difference in cardinality between a column
family that relates to the model number compared to a column family that relates to the serial number. Scans on the model number column
family will have to range across the cluster, since the frequent splits required by the comparatively large numbers of serial number rows will
spread the mode number rows out across many regions on many nodes.

Row key Customer Sales

Customer Id Name City Product Amount

101 Jon White Los Angeles, CA Chairs $400.00

102 Jane Brown Atlanta, GA Lamps $200.00

103 Bill Green Pittsburgh, PA Desk $500.00

104 Jack Black St. Louis, MO Bed $1600.00

Column families

Figure 8. Column-oriented database showing column families


Technical white paper
HPE INTERNAL USE ONLY Page 12

Example table
This example uses JSON notation for representational clarity. In this example, time stamps are arbitrarily assigned.

Queries return the most recent time stamp, by default. For example, a query for the value in
“arbitrarySecondKey"/“secondColumnFamily:firstColumn” returns valueThree. Specifying a time stamp with a query for
“arbitrarySecondKey"/“secondColumnFamily:firstColumn”/11 returns valueSeven.

Figure 9. Column-oriented database with column families represented in JSON Notation

Table replication
Data in one table can be replicated to another table in the same cluster or in a separate cluster. This type of replication is in addition to the
automatic replication that occurs with table regions within a volume. Changes such as “puts” and “deletes”, the entire tables, specific columns,
and specific column families can all be replicated.

Data fabric binary tables can only be replicated to binary tables and the JSON tables can only be replicated to JSON tables.
• Tables from which data is replicated are called source tables. Tables to which the data is replicated are called replicas.
• Clusters from which data is replicated are called source clusters. Clusters to which data is replicated are called destination clusters.
A single cluster can be both a source cluster and a destination cluster, depending on the replication configuration in which the cluster
participates.
• Replication takes place between source and destination clusters. However, source clusters do not send data to nodes in the destination
cluster directly. The replication stream (the data being pushed to the replicas) is consumed by one or more data fabric gateways in the
destination cluster. The gateways receive the updates from the source cluster, batch them, and apply them to the replica tables. Multiple
gateways serve the purpose of both load balancing and failover.
Technical white paper
HPE INTERNAL USE ONLY Page 13

Source cluster Gateway Destination cluster


Customer Customer
table table
Gateway

Figure 10. Table replication

Modes of replication
Table data can be replicated in one of two replication modes. The mode per source-replica pair is specified.
• Asynchronous replication
In this replication mode, HPE Ezmeral Data Fabric Database confirms to client applications that operations are complete after the
operations are performed on source tables. Updates are replicated in the background. Therefore, the latency of updates from client
applications is not affected by the time required for the network round trip between the source cluster and the destination cluster.
This type of replication is well-suited for clusters that are geographically separated in wide-area networks.
HPE Ezmeral Data Fabric Database can throttle the replication stream to minimize the impact of the replication process on incoming
operations during periods of heavy load. Throttling distributes disk reads and CPU usage more evenly over time, so that incoming operations
on a source table can be completed faster. Throttling is disabled by default. Asynchronous replication is the default replication mode.
• Synchronous replication
In this replication mode, HPE Ezmeral Data Fabric Database confirms to client applications that changes have been applied to a source
table only when these two conditions are true:
– The change was sent to all the container copies in the local cluster.
– The change was sent to a gateway in the destination cluster. This operation takes place only after the first. Puts are not sent to
gateways until after they are sent to all container copies in the cluster where the source table is located.

If a gateway fails, the source detects this and resends operations to the gateway when it is restarted, or a new gateway is brought online. Due
to the confirmations that HPE Ezmeral Data Fabric Database receives on source clusters, synchronous replication is especially well-suited for
creating a backup of data for disaster recovery.

When the latency of a replication stream is high, HPE Ezmeral Data Fabric Database switches to asynchronous replication temporarily so
that client applications are not blocked indefinitely. After the latency is sufficiently reduced, it switches back to synchronous replication. The
same switching occurs when a gateway fails and HPE Ezmeral Data Fabric Database does not resume synchronous replication until a new
gateway is established or the failed gateway is restarted.

Supported replication topologies


There are two types of basic topologies that can be used for replication scenarios: primary-secondary replication, with which several different
types of more complicated topologies can be constructed, as well as multi-primary replication.
Primary-secondary replication
In this topology, replication is one way from source tables to replicas. The replicas can be in a remote cluster or in the cluster where the
source tables are located. Several topologies are possible for primary-secondary replication:
• Replication from one source table to one or more replica tables:
In this topology, updates on a source table are replicated to one or more replicas, but updates to the replicas are not replicated back to
the source table.
• Many-to-one replication
Multiple source tables can replicate to a single replica.
Technical white paper
HPE INTERNAL USE ONLY Page 14

• One-to-many replication
A single source table can replicate to multiple replicas.
• Replication loops
When three or more tables need to be kept in sync, primary-secondary replication between pairs of them can be set up to form a
replication loop. Operations on a table are propagated to the other clusters in the loop, but there is no attempt to reapply the operations
at the originating table. This is because the operations are tagged with a universally unique identifier (UUID) that identifies the table
where the operations originated.
• Primary-secondary replication in two directions
Primary-secondary replication configurations can be combined to replicate data between clusters. Two clusters engaged in replication can
each act as a source cluster and a destination cluster.

Multi-primary replication
In this replication topology, there are two primary-secondary relationships, with each table playing both the primary and secondary roles.
Client applications update both tables, and each table replicates updates to the other.

All updates from a source table arrive at a replica after having been authenticated at a gateway. Therefore, ACEs on the replica that control
permissions for updates to column families and columns are irrelevant; gateways have the implicit authority to update replicas. If one of the
tables goes offline, client applications can be directed to the other table. When the offline table comes back online, replication between the
two tables resumes automatically. When both tables are in sync again, client applications can be redirected back to the original table.

HPE Ezmeral Data Fabric Database and Hive Integration


Hive
Apache Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large data
sets stored in Hadoop-compatible file systems, such as HPE Ezmeral Data Fabric. Hive provides a mechanism to project structure onto this
data and queries the data using a SQL-like language called HiveQL. At the same time, this language also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

HPE Ezmeral Data Fabric Database binary tables can be created from Hive, and both can access them. Hive queries can be run on
HPE Ezmeral Data Fabric Database binary tables, convert existing binary tables into Hive and HPE Ezmeral Data Fabric Database tables,
and run Hive queries on those tables as well. Utilizing Hive allows for the widest standard SQL access against the NoSQL tables stored in
HPE Ezmeral Data Fabric Database.

HPE EEP
An HPE Ezmeral Ecosystem Pack (EEP) provides a set of ecosystem components that work together on one or more data fabric cluster
versions. Each HPE EEP contains only one version of an ecosystem component. Hadoop ecosystem components within an HPE EEP
undergo extensive interoperability testing to validate that the components can work together. Hive is included in HPE EEP.

Database security
HPE Ezmeral Data Fabric Database has column family, column, and field-level ACEs, as well as policy-based security, which security policies
to be created that control access to information. ACEs and security policies provide an all-or-nothing approach—either the data for the
column or field is returned or not returned.

Dynamic data masking for JSON document database


Dynamic data masking (DDM) is a feature that allows masking of sensitive information when retrieving data. It is the ability to apply a variety
of data masks in real-time on customer-designated fields in database queries, to hide sensitive data depending on who is accessing the data.
This feature is suitable for PII or General Data Protection Regulation (GDPR) use cases.

As a typical example, consider the credit card industry. The application that prints receipts for credit card purchases does not need the full
credit card numbers but only needs the last four digits of the credit card number to identify the credit card being used. However, in the same
organization, the full credit card number should be available for payment processing. With ACEs and policies, it is possible to either get the
credit card number or not. ACEs and/or policies cannot be used to return only the last four digits of the credit card number.

DDM offers the solution. It’s easy to use and backward compatible with existing applications. DDM applies the masking rules to query results,
with no modifications required to existing queries. Its disadvantage is that it is not a fully secure solution for sensitive fields; it does not
prevent users from connecting to the database and running exhaustive queries that expose pieces of sensitive data. Therefore, DDM can be
viewed as a complementary solution to other database security features, such as auditing, encryption, and row/column-level security. The
maximum number of supported DDMs is 128. There are eight predefined DDMs supported on the JSON database.
Technical white paper
HPE INTERNAL USE ONLY Page 15

Predefined mask types


The following table describes the predefined masks that DDM supports. Any data mask can be applied to arrays that contain elements of
allowed datatypes for that specific DDM. The behavior of this will just mask each value inside the array individually with whatever mask was
tagged to the column the array is in. Any values that have the incorrect type, or is a document inside the array, will not get masked.

The following table describes the predefined masks that DDM supports:

Mask Description Supported data types

mrddm_redact This data mask will mask all alphabetic characters with “x” and all numeric characters with “0” for strings. For Binary, Boolean, Byte, Int,
other data types, the mask replaces all values with whatever is equivalent to 0 for that data type. Long, Short, String, Float,
Double, Time, Time stamp,
Date, Array

mrddm_last4 Displays only the last four characters and replaces everything else with *. This can be used in a wide number String, Array
of applications, including credit card numbers, passport information, and social security numbers.

mrddm_first4 Displays only the first four characters. This is similar to the last four data mask, but just shows the first String, Array
characters instead.

mrddm_first6last4 Displays only the first six characters and last four characters. This is similar to the last four data mask format. String, Array

mrddm_email Displays the first two characters and the last two characters of the user name and the first character of the String (in format of email),
domain and the whole top-level domain. For instance, [email protected] will be masked to ex***le@h**. com. Array

mrddm_hash Displays the hash of the data. This is useful for verifying if two cells match but will not show the pattern or String, Array
the length of the data.

mrddm_date Displays a generic date for all date fields but shows the correct year. This mask makes all months and days Time stamp, Date
of the month into the value one.

Create a table
Both binary and JSON tables can be created using HPE Ezmeral Data Fabric Management Control System (MCS). Figure 11 illustrates the
options available when creating a new table with the MCS UI.

Figure 11. Create a table using the UI


Technical white paper
HPE INTERNAL USE ONLY

Conclusion
HPE Ezmeral Data Fabric Database is an enterprise-grade, high-performance, NoSQL database management system that supports both
binary and JSON formats. It is fully integrated into the HPE Ezmeral Data Fabric platform, along with File Store, Objects, and Streams and
provides a unified layer that fosters enterprise-wide data observability and lifecycle management across traditional and cloud-native apps
without refactoring data or applications.

Learn more at
hpe.com/datafabric

© Copyright 2022 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without
notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements
accompanying such products and services. Nothing herein should be construed as constituting an additional warranty.
Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

This document contains confidential and/or legally privileged information. It is intended for Hewlett Packard Enterprise Internal
Use only. If you are not an intended recipient as identified on the front cover of this document, you are strictly prohibited from
reviewing, redistributing, disseminating, or in any other way using or relying on the contents of this document.

Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Windows is either a registered trademark or
trademark of Microsoft Corporation in the United States and/or other countries. All third-party marks are property of their
respective owners.

a00125063ENW, Rev. 1

You might also like