Cloudera Administration
Cloudera Administration
Important Notice
© 2010-2021 Cloudera, Inc. All rights reserved.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and company
names or logos mentioned in this document are the property of their respective owners.
Reference to any products, services, processes or other information, by trade name,
trademark, manufacturer, supplier or otherwise does not constitute or imply
endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Cloudera.
The information in this document is subject to change without notice. Cloudera shall
not be liable for any damages resulting from technical errors or omissions which may
be present in this document, or from use of this document.
Cloudera, Inc.
395 Page Mill Road
Palo Alto, CA 94306
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Cloudera Manager.................................................................................................15
Learn about Cloudera Manager ........................................................................................................................17
Installing and Upgrading....................................................................................................................................17
Managing CDH using Cloudera Manager...........................................................................................................17
Monitoring CDH using Cloudera Manager.........................................................................................................17
Managing CDH using the Cloudera Manager API...............................................................................................17
Cloudera Manager Admin Console....................................................................................................................18
Starting and Logging into the Admin Console......................................................................................................................20
Cloudera Manager Admin Console Home Page...................................................................................................................20
Displaying Cloudera Manager Documentation....................................................................................................................25
Automatic Logout................................................................................................................................................................25
Cloudera Manager Frequently Asked Questions................................................................................................26
General Questions................................................................................................................................................................26
Cloudera Manager API.......................................................................................................................................28
Backing Up and Restoring the Cloudera Manager Configuration .......................................................................................30
Using the Cloudera Manager API for Cluster Automation...................................................................................................31
Cloudera Manager Administration.....................................................................................................................33
Starting, Stopping, and Restarting the Cloudera Manager Server.......................................................................................33
Configuring Cloudera Manager Server Ports.......................................................................................................................33
Moving the Cloudera Manager Server to a New Host.........................................................................................................34
Migrating from the Cloudera Manager Embedded PostgreSQL Database Server to an External PostgreSQL Database
.......................................................................................................................................................................................3 5
Migrating from the Cloudera Manager External PostgreSQL Database Server to a MySQL/Oracle Database Server........41
Managing the Cloudera Manager Server Log......................................................................................................................44
Cloudera Manager Agents...................................................................................................................................................44
Configuring Network Settings..............................................................................................................................................50
Managing Licenses..............................................................................................................................................................50
Sending Usage and Diagnostic Data to Cloudera................................................................................................................57
Exporting and Importing Cloudera Manager Configuration................................................................................................61
Backing Up Cloudera Manager............................................................................................................................................61
Other Cloudera Manager Tasks and Settings.......................................................................................................................66
Cloudera Management Service............................................................................................................................................67
Extending Cloudera Manager.............................................................................................................................72
Cluster Configuration Overview..............................................................................73
Modifying Configuration Properties Using Cloudera Manager..........................................................................74
Changing the Configuration of a Service or Role Instance...................................................................................................74
Restarting Services and Instances after Configuration Changes..........................................................................................78
Suppressing Configuration and Parameter Validation Warnings........................................................................................78
Autoconfiguration..............................................................................................................................................79
Autoconfiguration................................................................................................................................................................80
Role-Host Placement............................................................................................................................................................87
Custom Configuration........................................................................................................................................88
Stale Configurations...........................................................................................................................................91
Client Configuration Files...................................................................................................................................93
How Client Configurations are Deployed.............................................................................................................................93
Downloading Client Configuration Files...............................................................................................................................94
Manually Redeploying Client Configuration Files................................................................................................................94
Viewing and Reverting Configuration Changes..................................................................................................94
Viewing Configuration Changes...........................................................................................................................................94
Reverting Configuration Changes........................................................................................................................................95
Exporting and Importing Cloudera Manager Configuration...............................................................................95
Cloudera Manager Configuration Properties Reference....................................................................................96
Managing Clusters..................................................................................................97
Adding and Deleting Clusters.............................................................................................................................97
Adding a Cluster Using New Hosts.......................................................................................................................................97
Adding a Cluster Using Currently Managed Hosts.............................................................................................................101
Deleting a Cluster...............................................................................................................................................................104
Starting, Stopping, Refreshing, and Restarting a Cluster..................................................................................104
Pausing a Cluster in AWS..................................................................................................................................106
Shutting Down and Starting Up the Cluster.......................................................................................................................106
Considerations after Restart..............................................................................................................................................107
Renaming a Cluster..........................................................................................................................................108
Cluster-Wide Configuration..............................................................................................................................108
Virtual Private Clusters and Cloudera SDX.......................................................................................................108
Overview............................................................................................................................................................................109
Advantages of Separating Compute and Data Resources..................................................................................................109
Architecture.......................................................................................................................................................................109
Performance Trade Offs.....................................................................................................................................................111
Using Virtual Private Clusters in Your Applications............................................................................................................111
Adding a Compute Cluster and Data Context....................................................................................................................111
Improvements for Virtual Private Clusters in CDH 6.3.......................................................................................................112
Compatibility Considerations for Virtual Private Clusters .................................................................................................113
Tutorial: Using Impala, Hive and Hue with Virtual Private Clusters...................................................................................116
Networking Considerations for Virtual Private Clusters.....................................................................................................127
Managing Services...........................................................................................................................................132
Managing the HBase Service.............................................................................................................................................132
Managing HDFS.................................................................................................................................................................132
Managing Apache Hive in CDH..........................................................................................................................................170
Managing Hue...................................................................................................................................................................170
Managing Impala..............................................................................................................................................................174
Managing Key-Value Store Indexer....................................................................................................................................185
Managing Kudu.................................................................................................................................................................186
Managing Solr...................................................................................................................................................................187
Managing Spark.................................................................................................................................................................192
Managing the Sqoop 1 Client.............................................................................................................................................193
Managing YARN (MRv2) and MapReduce (MRv1).............................................................................................................196
Managing ZooKeeper.........................................................................................................................................................218
Configuring Services to Use the GPL Extras Parcel.............................................................................................................221
Managing Hosts...................................................................................................223
The Status Tab..................................................................................................................................................223
The Configuration Tab......................................................................................................................................224
The Roles and Disks Overview Tabs..................................................................................................................224
The Templates Tab............................................................................................................................................224
The Parcels Tab.................................................................................................................................................224
Viewing Host Details........................................................................................................................................224
Status.................................................................................................................................................................................225
Processes...........................................................................................................................................................................226
Resources...........................................................................................................................................................................226
Commands.........................................................................................................................................................................227
Configuration.....................................................................................................................................................................227
Components.......................................................................................................................................................................227
Audits.................................................................................................................................................................................227
Charts Library.....................................................................................................................................................................227
Using the Host Inspector..................................................................................................................................228
Running the Host Inspector................................................................................................................................................228
Viewing Past Host Inspector Results..................................................................................................................................228
Adding a Host to the Cluster............................................................................................................................228
Using the Add Hosts Wizard to Add Hosts.........................................................................................................................229
Adding a Host by Installing the Packages Using Your Own Method..................................................................................235
Specifying Racks for Hosts................................................................................................................................235
Host Templates.................................................................................................................................................236
Creating a Host Template..................................................................................................................................................236
Editing a Host Template.....................................................................................................................................................236
Applying a Host Template to a Host...................................................................................................................................237
Performing Maintenance on a Cluster Host.....................................................................................................237
Decommissioning Hosts.....................................................................................................................................................237
Recommissioning Hosts.....................................................................................................................................................239
Stopping All the Roles on a Host........................................................................................................................................239
Starting All the Roles on a Host..........................................................................................................................................239
Tuning and Troubleshooting Host Decommissioning.........................................................................................................239
Maintenance Mode............................................................................................................................................................242
Changing Hostnames........................................................................................................................................245
Deleting Hosts..................................................................................................................................................247
Moving a Host Between Clusters.....................................................................................................................248
Managing Services...............................................................................................249
Adding a Service...............................................................................................................................................249
Comparing Configurations for a Service Between Clusters..............................................................................250
Add-on Services................................................................................................................................................251
Custom Service Descriptor Files.........................................................................................................................................251
Installing an Add-on Service...............................................................................................................................................251
Adding an Add-on Service..................................................................................................................................................253
Uninstalling an Add-on Service..........................................................................................................................................253
Starting, Stopping, and Restarting Services.....................................................................................................253
Starting and Stopping Services..........................................................................................................................................253
Restarting a Service...........................................................................................................................................................254
Rolling Restart..................................................................................................................................................254
Aborting a Pending Command.........................................................................................................................257
Deleting Services..............................................................................................................................................257
Renaming a Service..........................................................................................................................................258
Configuring Maximum File Descriptors............................................................................................................258
Exposing Hadoop Metrics to Graphite.............................................................................................................258
Configure Hadoop Metrics for Graphite Using Cloudera Manager....................................................................................259
Graphite Configuration Settings Per Daemon....................................................................................................................260
Exposing Hadoop Metrics to Ganglia...............................................................................................................261
Configure Hadoop Metrics for Ganglia Using Cloudera Manager.....................................................................................261
Ganglia Configuration Settings Per Daemon.....................................................................................................................263
Managing Roles....................................................................................................265
Role Instances..................................................................................................................................................265
Role Groups......................................................................................................................................................268
Creating a Role Group........................................................................................................................................................268
Managing Role Groups......................................................................................................................................................269
Performance Management...................................................................................401
Optimizing Performance in CDH.......................................................................................................................401
Choosing and Configuring Data Compression..................................................................................................405
Configuring Data Compression..........................................................................................................................................406
Tuning the Solr Server......................................................................................................................................406
Setting Java System Properties for Solr..............................................................................................................................406
Tuning to Complete During Setup......................................................................................................................................406
General Tuning...................................................................................................................................................................407
Other Resources.................................................................................................................................................................413
Tuning Apache Spark Applications...................................................................................................................413
Tuning Spark Shuffle Operations........................................................................................................................................413
Reducing the Size of Data Structures.................................................................................................................................419
Choosing Data Formats.....................................................................................................................................................420
Tuning YARN.....................................................................................................................................................420
Overview............................................................................................................................................................................420
Cluster Configuration.........................................................................................................................................................424
YARN Configuration...........................................................................................................................................................425
MapReduce Configuration.................................................................................................................................................426
Step 7: MapReduce Configuration.....................................................................................................................................426
Step 7A: MapReduce Sanity Checking................................................................................................................................427
Continuous Scheduling.......................................................................................................................................................427
Configuring Your Cluster In Cloudera Manager.................................................................................................................428
Tuning JVM Garbage Collection.......................................................................................................................428
Resource Management........................................................................................431
Cloudera Manager Resource Management.....................................................................................................431
Static Service Pools...........................................................................................................................................432
Linux Control Groups (cgroups)..........................................................................................................................................433
Dynamic Resource Pools..................................................................................................................................437
Managing Dynamic Resource Pools...................................................................................................................................437
YARN Pool Status and Configuration Options....................................................................................................................442
Defining Configuration Sets...............................................................................................................................................444
Scheduling Configuration Sets...........................................................................................................................................445
Assigning Applications and Queries to Resource Pools......................................................................................................446
YARN (MRv2) and MapReduce (MRv1) Schedulers..........................................................................................449
Configuring the Fair Scheduler...........................................................................................................................................450
Enabling and Disabling Fair Scheduler Preemption...........................................................................................................453
Data Storage for Monitoring Data....................................................................................................................454
Configuring Service Monitor Data Storage........................................................................................................................454
Configuring Host Monitor Data Storage............................................................................................................................454
Viewing Host and Service Monitor Data Storage...............................................................................................................454
Data Granularity and Time-Series Metric Data..................................................................................................................455
Moving Monitoring Data on an Active Cluster...................................................................................................................455
Host Monitor and Service Monitor Memory Configuration...............................................................................................455
Disabling Metric rollup ......................................................................................................................................................457
Cluster Utilization Reports...............................................................................................................................457
Configuring the Cluster Utilization Report.........................................................................................................................458
Using the Cluster Utilization Report to Manage Resources...............................................................................................460
Downloading Cluster Utilization Reports Using the Cloudera Manager API......................................................................466
Creating a Custom Cluster Utilization Report....................................................................................................................466
High Availability...................................................................................................478
HDFS High Availability......................................................................................................................................478
Introduction to HDFS High Availability...............................................................................................................................479
Configuring Hardware for HDFS HA...................................................................................................................................480
Enabling HDFS HA..............................................................................................................................................................481
Disabling and Redeploying HDFS HA..................................................................................................................................485
Configuring Other CDH Components to Use HDFS HA.......................................................................................................485
Administering an HDFS High Availability Cluster...............................................................................................................486
Changing a Nameservice Name for Highly Available HDFS Using Cloudera Manager......................................................487
MapReduce (MRv1) and YARN (MRv2) High Availability..................................................................................488
YARN (MRv2) ResourceManager High Availability.............................................................................................................488
Work Preserving Recovery for YARN Components.............................................................................................................490
MapReduce (MRv1) JobTracker High Availability..............................................................................................................492
Cloudera Navigator Key Trustee Server High Availability.................................................................................493
Configuring Key Trustee Server High Availability Using Cloudera Manager......................................................................493
Recovering a Key Trustee Server........................................................................................................................................494
Enabling Key Trustee KMS High Availability.....................................................................................................494
Enabling Navigator HSM KMS High Availability................................................................................................496
HSM KMS High Availability Backup and Recovery.............................................................................................................496
High Availability for Other CDH Components...................................................................................................497
HBase High Availability......................................................................................................................................................497
Oozie High Availability.......................................................................................................................................................502
Search High Availability.....................................................................................................................................................503
Navigator Data Management in a High Availability Environment....................................................................505
Configuring Cloudera Manager for High Availability With a Load Balancer.....................................................506
Introduction to Cloudera Manager Deployment Architecture...........................................................................................507
Prerequisites for Setting up Cloudera Manager High Availability......................................................................................508
Cloudera Manager Failover Protection..............................................................................................................................509
High-Level Steps to Configure Cloudera Manager High Availability .................................................................................510
Database High Availability Configuration..........................................................................................................................536
TLS and Kerberos Configuration for Cloudera Manager High Availability.........................................................................536
Backing Up Databases..........................................................................................652
Backing Up PostgreSQL Databases...................................................................................................................652
Backing Up MariaDB Databases.......................................................................................................................653
Backing Up MySQL Databases..........................................................................................................................653
Backing Up Oracle Databases...........................................................................................................................653
Database Vendor Resources.............................................................................................................................653
Cloudera Manager
Cloudera Manager is an end-to-end application for managing CDH clusters. Cloudera Manager provides granular visibility
into and control over every part of the CDH cluster—empowering operators to improve performance, enhance quality
of service, increase compliance, and reduce administrative costs. With Cloudera Manager, you can easily deploy and
centrally operate the complete CDH stack and other managed services. The application automates the installation
process, reducing deployment time from weeks to minutes; gives you a cluster-wide, real-time view of hosts and
services running; provides a single, central console to enact configuration changes across your cluster; and incorporates
a full range of reporting and diagnostic tools to help you optimize performance and utilization. Cloudera Manager also
provides an API you can use to automate cluster operations.
The following topics from the core Cloudera Enterprise documentation library can help you understand Cloudera
Manager concepts and how to use Cloudera Manager to manage, monitor, and upgrade your deployment. They are
listed by broad category:
• Quick Start
• Cloudera Manager API tutorial
• Cloudera Manager REST API documentation
• Python Client (deprecated)
• Python Client (Swagger-based)
• Java Client (Swagger-based)
• Java SDK Reference
• Using the Cloudera Manager API for Cluster Automation on page 31
The Cloudera Manager Admin Console top navigation bar provides the following tabs and menus:
• Clusters > cluster_name
– Services - Display individual services, and the Cloudera Management Service. In these pages you can:
– View the status and other details of a service instance or the role instances associated with the service
– Make configuration changes to a service instance, a role, or a specific role instance
– Add and delete a service or role
– Stop, start, or restart a service or role.
– View the commands that have been run for a service or a role
– View an audit event history
– Deploy and download client configurations
– Decommission and recommission role instances
– Enter or exit maintenance mode
– Perform actions unique to a specific type of service. For example:
– Enable HDFS high availability or NameNode federation
– Run the HDFS Balancer
– Create HBase, Hive, and Sqoop directories
– Cloudera Manager Management Service - Manage and monitor the Cloudera Manager Management Service.
This includes the following roles: Activity Monitor, Alert Publisher, Event Server, Host Monitor, Navigator
Audit Server, Navigator Metadata Server, Reports Manager, and Service Monitor.
– Cloudera Navigator - Opens the Cloudera Navigator user interface.
– Hosts - Displays the hosts in the cluster.
– Reports - Create reports about the HDFS, MapReduce, YARN, and Impala usage and browse HDFS files, and
manage quotas for HDFS directories.
– Utilization Report - Opens the Cluster Utilization Report. displays aggregated utilization information for
YARN and Impala jobs.
– MapReduce_service_name Jobs - Query information about MapReduce jobs running on your cluster.
– YARN_service_name Applications - Query information about YARN applications running on your cluster.
– Impala_service_name Queries - Query information about Impala queries running on your cluster.
– Dynamic Resource Pools - Manage dynamic allocation of cluster resources to YARN and Impala services by
specifying the relative weights of named pools.
– Static Service Pools - Manage static allocation of cluster resources to HBase, HDFS, Impala, MapReduce, and
YARN services.
• Hosts - Display the hosts managed by Cloudera Manager.
– Installation Guide
– API Documentation
– Release Notes
– About - Version number and build details of Cloudera Manager and the current date and time stamp of the
Cloudera Manager server.
• Logged-in User Menu - The currently logged-in user. The subcommands are:
– Change Password - Change the password of the currently logged in user.
– Logout
Note: You can configure the Cloudera Manager Admin Console to automatically log out a user after
a configurable period of time. See Automatic Logout on page 25.
Status
The Status tab contains:
• Clusters - The clusters being managed by Cloudera Manager. Each cluster is displayed either in summary form or
in full form depending on the configuration of the Administration > Settings > Other > Maximum Cluster Count
Shown In Full property. When the number of clusters exceeds the value of the property, only cluster summary
information displays.
– Summary Form - A list of links to cluster status pages. Click Customize to jump to the Administration >
Settings > Other > Maximum Cluster Count Shown In Full property.
– Full Form - A separate section for each cluster containing a link to the cluster status page and a table containing
links to the Hosts page and the status pages of the services running in the cluster.
Each service row in the table has a menu of actions that you select by clicking
Click the indicator to display the Health Issues pop-up dialog box.
By default only Bad health test results are shown in the dialog box. To display
Concerning health test results, click the Also show n concerning issue(s)
link.Click the link to display the Status page containing with details about
the health test result.
Configuration Indicates that the service has at least one configuration issue. The indicator
issue shows the number of configuration issues at the highest severity level. If
there are configuration errors, the indicator is red. If there are no errors
but configuration warnings exist, then the indicator is yellow. No indicator
is shown if there are no configuration notifications.
Click the indicator to display the Configuration Issues pop-up dialog box.
By default only notifications at the Error severity level are listed, grouped
by service name are shown in the dialog box. To display Warning
notifications, click the Also show n warning(s) link.Click the message
associated with an error or warning to be taken to the configuration property
for which the notification has been issued where you can address the
issue.See Managing Services on page 249.
Restart Configuration Indicates that at least one of a service's roles is running with a configuration
Needed modified that does not match the current configuration settings in Cloudera Manager.
Refresh Click the indicator to display the Stale Configurations on page 91 page.To
Needed bring the cluster up-to-date, click the Refresh or Restart button on the Stale
Configurations page or follow the instructions in Refreshing a Cluster on
page 105, Restarting a Cluster on page 106, or Restarting Services and
Instances after Configuration Changes on page 78.
Client Indicates that the client configuration for a service should be redeployed.
configuration
Click the indicator to display the Stale Configurations on page 91 page.To
redeployment
bring the cluster up-to-date, click the Deploy Client Configuration button
required
on the Stale Configurations page or follow the instructions in Manually
Redeploying Client Configuration Files on page 94.
– Cloudera Management Service - A table containing a link to the Cloudera Manager Service. The Cloudera
Manager Service has a menu of actions that you select by clicking
.
– Charts - A set of charts (dashboard) that summarize resource utilization (IO, CPU usage) and processing
metrics.
Click a line, stack area, scatter, or bar chart to expand it into a full-page view with a legend for the individual
charted entities as well more fine-grained axes divisions.
By default the time scale of a dashboard is 30 minutes. To change the time scale, click a duration link
at the top-right of the dashboard.
To set the dashboard type, click and select one of the following:
• Custom - displays a custom dashboard.
• Default - displays a default dashboard.
• Reset - resets the custom dashboard to the predefined set of charts, discarding any customizations.
Displays all commands run recently across the clusters. A badge indicates how many recent commands are
still running. Click the command link to display details about the command and child commands. See also Viewing
Running and Recent Commands on page 301.
Note: You can configure the Cloudera Manager Admin Console to automatically log out a user after
a configurable period of time. See Automatic Logout on page 25.
Automatic Logout
For security purposes, Cloudera Manager automatically logs out a user session after 30 minutes. You can change this
session logout period.
To configure the timeout period:
1. Click Administration > Settings.
2. Click Category > Security.
3. Edit the Session Timeout property.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
When the timeout is one minute from triggering, the user sees the following message:
If the user does not click the mouse or press a key, the user is logged out of the session and the following message
appears:
General Questions
What are the differences between the Cloudera Express and the Cloudera Enterprise versions of Cloudera Manager?
Cloudera Express includes a free version of Cloudera Manager. The Cloudera Enterprise version of Cloudera Manager
provides additional functionality. Both the Cloudera Express and Cloudera Enterprise versions automate the installation,
configuration, and monitoring of CDH on an entire cluster. See the data sheet at Cloudera Enterprise Datasheet for a
comparison of the two versions.
The Cloudera Enterprise version of Cloudera Manager is available as part of the Cloudera Enterprise subscription
offering, and requires a license. You can also choose a Cloudera Enterprise Trial that is valid for 60 days.
If you are not an existing Cloudera customer, contact Cloudera Sales using this form or call 866-843-7207 to obtain a
Cloudera Enterprise license. If you are already a Cloudera customer and you need to upgrade from Cloudera Express
to Cloudera Enterprise, contact Cloudera Support to obtain a license.
Where are CDH libraries located when I distribute CDH using parcels?
With parcel software distribution, the path to the CDH libraries is /opt/cloudera/parcels/CDH/lib/ instead of
the usual /usr/lib/.
What upgrade paths are available for Cloudera Manager, and what's involved?
For instructions about upgrading, see Upgrading Cloudera Manager.
Do worker hosts need access to the Cloudera public repositories for an install with Cloudera Manager?
You can perform an installation or upgrade using the parcel format and when using parcels, only the Cloudera Manager
Server requires access to the Cloudera public repositories. Distribution of the parcels to worker hosts is done between
the Cloudera Manager Server and the worker hosts. See Parcels for more information. If you want to install using the
traditional packages, hosts only require access to the installation files.
For both parcels and packages, it is also possible to create local repositories that serve these files to the hosts that are
being upgraded. If you have established local repositories, no access to the Cloudera public repository is required. For
more information, see Configuring a Local Package Repository.
Can I use the service monitoring features of Cloudera Manager without the Cloudera Management Service?
No. To understand the desired state of the system, Cloudera Manager requires the global configuration that the
Cloudera Management Service roles gather and provide. The Cloudera Manager Agent doubles as both the agent for
supervision and for monitoring.
Can I run the Cloudera Management Service and the Hadoop services on the host where the Cloudera Manager Server
is running?
Yes. This is especially common in deployments that have a small number of hosts.
Important: To enable Kerberos (SPNEGO) authentication for the Cloudera Manager Admin Console
and API, you must first enable Kerberos for cluster services.
With SPNEGO enabled, the Swagger-based Java and Python SDKs, as well as the older deprecated Java
SDK, can still authenticate using HTTP Basic Authentication. The older deprecated Python SDK cannot.
Do not enable SPNEGO if you are relying on the deprecated Python client for any operations.
If you have already enabled Kerberos authentication, you can enable it for the API by doing the following:
1. If you have not already done so, enable Kerberos for cluster services.
2. Navigate to Administration > Settings.
3. Enter SPNEGO in the Search field.
4. Check the box labeled Enable SPNEGO/Kerberos Authentication for the Admin Console and API. Leave the other
SPNEGO settings blank to allow Cloudera Manager to automatically generate the principal and keytab.
5. Click Save Changes.
6. Restart Cloudera Manager Server.
You can access the Cloudera Manager Swagger API user interface from the Cloudera Manager Admin Console. Go to
Support > API Explorer to open Swagger.
http://cm_server_host:7180/api/v32/clusters/clusterName/services/serviceName/roles
http://cm_server_host:7180/api/v32/clusters/clusterName/services/serviceName/roles/roleName/process
http://cm_server_host:7180/api/v32/clusters/clusterName/services/serviceName/roles/roleName/process/
configFiles/configFileName
For example:
http://cm_server_host:7180/api/v32/clusters/Cluster%201/services/OOZIE-1/roles/
OOZIE-1-OOZIE_SERVER-e121641328fcb107999f2b5fd856880d/process/configFiles/oozie-site.xml
http://cm_server_host:7180/api/v32/clusters/Cluster%201/services/service_name/config?view=FULL
Search the results for the display name of the desired property. For example, a search for the display name HDFS
Service Environment Advanced Configuration Snippet (Safety Valve) shows that the corresponding property name
is hdfs_service_env_safety_valve:
{
"name" : "hdfs_service_env_safety_valve",
"require" : false,
"displayName" : "HDFS Service Environment Advanced Configuration Snippet (Safety
Valve)",
"description" : "For advanced use onlyu, key/value pairs (one on each line) to be
inserted into a roles
environment. Applies to configurations of all roles in this service except client
configuration.",
"relatedName" : "",
"validationState" : "OK"
}
Similar to finding service properties, you can also find host properties. First, get the host IDs for a cluster with the URL:
http://cm_server_host:7180/api/v32/hosts
{
"hostId" : "2c2e951c-aaf2-4780-a69f-0382181f1821",
"ipAddress" : "10.30.195.116",
"hostname" : "cm_server_host",
"rackId" : "/default",
"hostUrl" :
"http://cm_server_host:7180/cmf/hostRedirect/2c2e951c-adf2-4780-a69f-0382181f1821",
"maintenanceMode" : false,
"maintenanceOwners" : [ ],
"commissionState" : "COMMISSIONED",
"numCores" : 4,
"totalPhysMemBytes" : 10371174400
}
Then obtain the host properties by including one of the returned host IDs in the URL:
http://cm_server_host:7180/api/v32/hosts/2c2e951c-adf2-4780-a69f-0382181f1821?view=FULL
Where:
• admin_uname is a username with either the Full Administrator or Cluster Administrator role.
• admin_pass is the password for the admin_uname username.
• cm_server_host is the hostname of the Cloudera Manager server.
• path_to_file is the path to the file where you want to save the configuration.
Important: If you configure this redaction, you cannot use an exported configuration to restore the
configuration of your cluster due to the redacted information.
-Dcom.cloudera.api.redaction=true
For example:
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
Using a previously saved JSON document that contains the Cloudera Manager configuration data, you can restore that
configuration to a running cluster.
1. Using the Cloudera Manager Administration Console, stop all running services in your cluster:
a. On the Home > Status tab, click
Warning: If you do not stop the cluster before making this API call, the API call will stop all cluster
services before running the job. Any running jobs and data are lost.
Where:
• admin_uname is a username with either the Full Administrator or Cluster Administrator role.
• admin_pass is the password for the admin_uname username.
• cm_server_host is the hostname of the Cloudera Manager server.
• path_to_file is the path to the file containing the JSON configuration file.
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. If
you maintain hundreds of test and development clusters in different configurations, this process can be complex and
cumbersome if not automated.
HDFS HA, turn on Kerberos security and generate keytabs, and customize service directories and ports. Every
configuration available in Cloudera Manager is exposed in the API.
The API also provides access to management functions:
• Obtaining logs and monitoring the system
• Starting and stopping services
• Polling cluster events
• Creating a disaster recovery replication schedule
For example, you can use the API to retrieve logs from HDFS, HBase, or any other service, without knowing the log
locations. You can also stop any service with no additional steps.
Use scenarios for the Cloudera Manager API for cluster automation might include:
• OEM and hardware partners that deliver Hadoop-in-a-box appliances using the API to set up CDH and Cloudera
Manager on bare metal in the factory.
• Automated deployment of new clusters, using a combination of Puppet and the Cloudera Manager API. Puppet
does the OS-level provisioning and installs the software. The Cloudera Manager API sets up the Hadoop services
and configures the cluster.
• Integrating the API with reporting and alerting infrastructure. An external script can poll the API for health and
metrics information, as well as the stream of events and alerts, to feed into a custom dashboard.
<project>
<repositories>
<repository>
<id>cdh.repo</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<name>Cloudera Repository</name>
</repository>
…
</repositories>
<dependencies>
<dependency>
<groupId>com.cloudera.api</groupId>
<artifactId>cloudera-manager-api</artifactId>
<version>4.6.2</version> <!-- Set to the version of Cloudera Manager you use
-->
</dependency>
…
</dependencies>
...
</project>
The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry point is
a handle to the root of the API:
From the root, you can traverse down to all other resources. (It's called "v32" because that is the current Cloudera
Manager API version, but the same builder will also return a root from an earlier version of the API.) The tree view
shows some key resources and supported operations:
• RootResourcev32
– ClustersResourcev32 - host membership, start cluster
– ServicesResourcev32 - configuration, get metrics, HA, service commands
// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource().readClusters(DataView.SUMMARY);
for (ApiCluster cluster : clusters) {
LOG.info("{}: {}", cluster.getName(), cluster.getVersion());
}
Python Example
You can see an example of automation with Python at the following link: Python example. The example contains
information on the requirements and steps to automate a cluster deployment.
You can stop (for example, to perform maintenance on its host) or restart the Cloudera Manager Server without
affecting the other services running on your cluster. Statistics data used by activity monitoring and service monitoring
will continue to be collected during the time the server is down.
To stop the Cloudera Manager Server:
Setting Description
HTTP Port for Admin Console Specify the HTTP port to use to access the Server using
the Admin Console.
HTTPS Port for Admin Console Specify the HTTPS port to use to access the Server using
the Admin Console.
Agent Port to connect to Server Specify the port for Agents to use to connect to the
Server.
Important:
• The Cloudera Manager version on the destination host must match the version on the source
host.
• Do not install the other components, such as CDH and databases.
3. Copy the entire content of /var/lib/cloudera-scm-server/ on the old host to that same path on the new
host. Ensure you preserve permissions and all file content.
4. If the database server is not available:
a. Install the database packages on the host that will host the restored database. This could be the same host
on which you have just installed Cloudera Manager or it could be a different host. If you used the embedded
PostgreSQL database, install the PostgreSQL package as described in Managing the Embedded PostgreSQL
Database. If you used an external MySQL, PostgreSQL, or Oracle database, reinstall following the instructions
in Step 4: Install and Configure Databases.
b. Restore the backed up databases to the new database installation.
5. Update /etc/cloudera-scm-server/db.properties with the database name, database instance name,
username, and password.
6. Do the following on all cluster hosts:
a. In /etc/cloudera-scm-agent/config.ini, update the server_host property to the new hostname.
b. If you are replacing the Cloudera Manager database with a new database, and you are not using a backup of
the original Cloudera Manager database, delete the /var/lib/cloudera-scm-agent/cm_guid file.
c. Restart the agent using the following command:
7. Stop the Cloudera Manager server on the source host by running the following command:
8. Copy any Custom Service Descriptor files for add-on services to the configured directory on the new Cloudera
Manager host. The directory path is configured by going to Administration > Settings and editing the Local
Descriptor Repository Path property. The default value is/opt/cloudera/csd. See Add-on Services on page
251.
9. Start the Cloudera Manager Server on the new (destination) host. Cloudera Manager should resume functioning
as it did before the failure. Because you restored the database from the backup, the server should accept the
running state of the Agents, meaning it will not terminate any running processes.
The process is similar with secure clusters, though files in /etc/cloudera-scm-server must be restored in addition
to the database. See Cloudera Security.
Migrating from the Cloudera Manager Embedded PostgreSQL Database Server to an External PostgreSQL
Database
Cloudera Manager provides an embedded PostgreSQL database server for demonstration and proof of concept
deployments when creating a cluster. To remind users that this embedded database is not suitable for production,
Cloudera Manager displays the banner text: "You are running Cloudera Manager in non-production mode, which uses
an embedded PostgreSQL database. Switch to using a supported external database before moving into production."
If, however, you have already used the embedded database, and you are unable to redeploy a fresh cluster, then you
must migrate to an external PostgreSQL database.
Note: This procedure does not describe how to migrate to a database server other than PostgreSQL.
Moving databases from one database server to a different type of database server is a complex process
that requires modification of the schema and matching the data in the database tables to the new
schema. It is strongly recommended that you engage with Cloudera Professional Services if you wish
to perform a migration to an external database server other than PostgreSQL.
Prerequisites
Before migrating the Cloudera Manager embedded PostgreSQL database to an external PostgreSQL database, ensure
that your setup meets the following conditions:
• The external PostgreSQL database server is running.
• The database server is configured to accept remote connections.
• The database server is configured to accept user logins using md5.
• No one has manually created any databases in the external database server for roles that will be migrated.
Note: To view a list of databases in the external database server (requires default superuser
permission):
Important: Only perform the steps in Configuring and Starting the PostgreSQL Server. Do not proceed
with the creation of databases as described in the subsequent section.
For large clusters, Cloudera recommends running your database server on a dedicated host. Engage Cloudera Professional
Services or a certified database administrator to correctly tune your external database server.
head -1 /var/lib/cloudera-scm-server-db/data/generated_password.txt
2. Make a list of all services that are using the embedded database server. Then, after determining which services
are not using the embedded database server, remove those services from the list. The scm database must remain
in your list. Use the following table as a guide:
3. Verify which roles are using the embedded database. Roles using the embedded database server always use port
7432 (the default port for the embedded database) on the Cloudera Manager Server host.
For Cloudera Management Services:
a. Select Cloudera Management Service > Configuration, and type "7432" in the Search field.
b. Confirm that the hostname for the services being used is the same hostname used by the Cloudera Manager
Server.
Note:
If any of the following fields contain the value "7432", then the service is using the embedded
database:
• Activity Monitor
• Navigator Audit Server
• Navigator Metadata Server
• Reports Manager
Note: Do not add the postgres, template0, or template1 databases to your list. These are
used only by the PostgreSQL server.
List of databases
Name | Owner | Encoding | Collate
Ctype | |
Access
--------------------+--------------------+----------+------------+------------+----------------
amon | amon | UTF8 | en_US.UTF8 | en_US.UTF8 |
hive | hive | UTF8 | en_US.UTF8 | en_US.UTF8 |
hue | hue | UTF8 | en_US.UTF8 | en_US.UTF8 |
nav | nav | UTF8 | en_US.UTF8 | en_US.UTF8 |
navms | navms | UTF8 | en_US.UTF8 | en_US.UTF8 |
oozie_oozie_server | oozie_oozie_server | UTF8 | en_US.UTF8 | en_US.UTF8 |
postgres | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 |
rman | rman | UTF8 | en_US.UTF8 | en_US.UTF8 |
scm | scm | UTF8 | en_US.UTF8 | en_US.UTF8 |
sentry | sentry | UTF8 | en_US.UTF8 | en_US.UTF8 |
template0 | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 |
=c/"cloudera-scm"
template1 | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 |
=c/"cloudera-scm"
(12 rows)
You should now have a list of all roles and database names that use the embedded database server, and are ready to
proceed with the migration of databases from the embedded database server to the external PostgreSQL database
server.
Migrate Databases from the Embedded Database Server to the External PostgreSQL Database Server
While performing this procedure, ensure that the Cloudera Manager Agents remain running on all hosts. Unless
otherwise specified, when prompted for a password use the cloudera-scm password.
Note: After completing this migration, you cannot delete the cloudera-scm postgres superuser
unless you remove the access privileges for the migrated databases. Minimally, you should change
the cloudera-scm postgres superuser password.
1. In Cloudera Manager, stop the cluster services identified as using the embedded database server (see Identify
Roles that Use the Embedded Database Server on page 36). Refer to Starting, Stopping, and Restarting Services
on page 253 for details about how to stop cluster services. Be sure to stop the Cloudera Management Service as
well. Also be sure to stop any services with dependencies on these services. The remaining CDH services will
continue to run without downtime.
Note: If you do not stop the services from within Cloudera Manager before stopping Cloudera
Manager Server from the command line, they will continue to run and maintain a network
connection to the embedded database server. If this occurs, then the embedded database server
will ignore any command line stop commands (Step 2) and require that you manually kill the
process, which in turn causes the services to crash instead of stopping cleanly.
2. Navigate to Hosts > All Hosts, and make note of the number of roles assigned to hosts. Also take note whether
or not they are in a commissioned state. You will need this information later to validate that your scm database
was migrated correctly.
3. Stop the Cloudera Manager Server. To stop the server:
4. Obtain and save the embedded database superuser password (you will need this password in subsequent steps)
from the generated_password.txt file:
head -1 /var/lib/cloudera-scm-server-db/data/generated_password.txt
5. Export the PostgreSQL user roles from the embedded database server to ensure the correct users, permissions,
and passwords are preserved for database access. Passwords are exported as an md5sum and are not visible in
plain text. To export the database user roles (you will need the cloudera-scm user password):
6. Edit /var/tmp/cloudera_user_roles.sql to remove any CREATE ROLE and ALTER ROLE commands for
databases not in your list. Leave the entries for cloudera-scm untouched, because this user role is used during
the database import.
7. Export the data from each of the databases on your list you created in Identify Roles that Use the Embedded
Database Server on page 36:
Password:
cp /etc/cloudera-scm-server/db.properties /etc/cloudera-scm-server/db.properties.embedded
10. Copy the file /var/tmp/cloudera_user_roles.sql and the database dump files from the embedded database
server host to /var/tmp on the external database server host:
cd /var/tmp
scp cloudera_user_roles.sql *.dump <user>@<postgres-server>:/var/tmp
11. Import the PostgreSQL user roles into the external database server.
The external PostgreSQL database server superuser password is required to import the user roles. If the superuser
role has been changed, you will be prompted for the username and password.
Note: Only run the command that applies to your context; do not execute both commands.
For example:
12. Import the Cloudera Manager database on the external server. First copy the database dump files from the
Cloudera Manager Server host to your external PostgreSQL database server, and then import the database data:
Note: To successfully run the pg_restore command, there must be an existing database on
the database server to complete the connection; the existing database will not be modified. If
the -d <existing-database> option is not included, then the pg_restore command will
fail.
13. Update the Cloudera Manager Server database configuration file to use the external database server. Edit the
/etc/cloudera-scm-server/db.properties file as follows:
a. Update the com.cloudera.cmf.db.host value with the hostname and port number of the external
database server.
b. Change the com.cloudera.cmf.db.setupType value from "EMBEDDED" to "EXTERNAL".
14. Start the Cloudera Manager Server and confirm it is working:
Note that if you start the Cloudera Manager GUI at this point, it may take up to five minutes after executing the
start command before it becomes available.
In Cloudera Manager Server, navigate to Hosts > All Hosts and confirm the number of roles assigned to hosts (this
number should match what you found in Step 2); also confirm that they are in a commissioned state that matches
what you observed in Step 2.
15. Update the role configurations to use the external database hostname and port number. Only perform this task
for services where the database has been migrated.
For Cloudera Management Services:
1. Select Cloudera Management Service > Configuration, and type "7432" in the Search field.
2. Change any database hostname properties from the embedded database to the external database hostname
and port number.
3. Click Save Changes.
For the Oozie Service:
a. Select Oozie service > Configuration, and type "7432" in the Search field.
b. Change any database hostname properties from the embedded database to the external database hostname
and port number.
c. Click Save Changes.
For Hive, Hue, and Sentry Services:
1. Select the specific service > Configuration, and type "database host" in the Search field.
2. Change the hostname from the embedded database name to the external database hostname.
3. Click Save Changes.
16. Start the Cloudera Management Service and confirm that all management services are up and no health tests
are failing.
17. Start all Services via the Cloudera Manager web UI. This should start all services that were stopped for the database
migration. Confirm that all services are up and no health tests are failing.
18. On the embedded database server host, remove the embedded PostgreSQL database server:
a. Make a backup of the /var/lib/cloudera-scm-server-db/data directory:
For Debian/Ubuntu:
Migrating from the Cloudera Manager External PostgreSQL Database Server to a MySQL/Oracle Database
Server
Cloudera Manager provides an embedded PostgreSQL database server for demonstration and proof of concept
deployments when creating a cluster. To remind users that this embedded database is not suitable for production,
Cloudera Manager displays the banner text: "You are running Cloudera Manager in non-production mode, which uses
an embedded PostgreSQL database. Switch to using a supported external database before moving into production."
If you have already used the embedded database, and you are unable to redeploy a fresh cluster, then you must migrate
to an external PostgreSQL database. For details about how to migrate from an embedded PostgreSQL database to an
external PostgreSQL database, refer to Migrating from the Cloudera Manager Embedded PostgreSQL Database Server
to an External PostgreSQL Database on page 35.
Note: You can migrate to an external MySQL or Oracle database only after successfully migrating
from the embedded PostgreSQL database server to the external PostgreSQL database server.
Prerequisites
Before migrating from the Cloudera Manager external PostgreSQL database to an external MySQL/Oracle database,
ensure that your setup meets the following conditions:
• Configuration uses Cloudera Manager 5.15.0 or later on supported platforms.
• You must have a valid Cloudera Manager Enterprise license.
• If Cloudera Manager is secured, then you must import Kerberos account manager credentials and regenerate
them. For details, refer to Step 3: Add the Credentials for the Principal to the Cluster.
• You must have a destination host installed with the supported database of choice (MySQL or Oracle). For details
about installing and configuring MySQL for Cloudera, see Install and Configure MySQL for Cloudera Software. For
details about installing and configuring Oracle for Cloudera, see Install and Configure Oracle Database for Cloudera
Software.
• You have made configured target database hosts available.
• You have planned for cluster downtime during the migration process.
• You have a plan to follow service specific database migration instructions for services other than Cloudera Manager.
Refer to the appropriate service migration documentation for your cluster setup (for example, Migrate the Hue
Database).
• No one has manually created any databases in the external database server for roles that will be migrated.
• All health issues with your cluster are resolved.
For large clusters, Cloudera recommends running your database server on a dedicated host. Engage Cloudera Professional
Services or a certified database administrator to correctly tune your external database server.
Migrate from the Cloudera Manager External PostgreSQL Database Server to a MySQL/Oracle Database Server
To migrate from the Cloudera Manager external PostgreSQL database server to a MySQL or Oracle database server:
1. Migrate from the embedded PostgreSQL database server to an external PostgreSQL database server as described
in Migrating from the Cloudera Manager Embedded PostgreSQL Database Server to an External PostgreSQL
Database on page 35.
Important:
Migrating directly from the Cloudera Manager embedded PostgreSQL to a MySQL/Oracle database is not supported.
You must first migrate from the Cloudera Manager embedded PostgreSQL database server to the external
PostgreSQL database server. After performing this migration, you can use this procedure to migrate from the
external PostgreSQL database server to MySQL/Oracle database servers.
2. Export your Cloudera Manager Configuration. First, get the latest supported API version:
Note:
If you have Cloudera Manager with TLS for the Admin Console enabled, retrieve the certificate
file and use curl with the --cacert option:
sudo -u postgres psql -qtAX scm -c "select GUID from CM_VERSION" > uuid
Then move the UUID file to Cloudera Manager server's /etc/cloudera-scm-server directory.
4. Stop the cluster and the Cloudera Management Services. For details on how to stop services, see Starting, Stopping,
and Restarting Services on page 253.
5. Stop the Cloudera Manager Server:
Note:
For RHEL/CentOS 7, use the systemctl option instead:
6. Prepare the target database for Cloudera Manager. For details, refer to Install and Configure MySQL for Cloudera
Software or Install and Configure Oracle Database for Cloudera Software.
7. The process directory (/var/run/cloudera-scm-agent/process/) must be cleaned out for all of the hosts
that have agents running on them. The agent completes this cleanup with a server reboot. However, if a server
reboot is not a viable option, use one of the following options to accomplish the same task.
Note: This "hard restart" works for all supported platforms except SLES 12.
Alternatively, run the following command to view the start options available on your platform:
ls -la /var/run/cloudera-scm-agent/process/
mv /var/run/cloudera-scm-agent /var/run/cloudera-scm-agent-BU
The agent will recreate the directory. Delete the backed up copy after confirming that the migration was successful.
8. Start the Cloudera Manager server:
9. Log in to Cloudera Manager. Exit the installation wizard by clicking the product log in the upper-left corner to stop
the wizard and return to the Cloudera Manager home page.
10. Upgrade the Cloudera Manager Enterprise License by navigating to Administration > Licenses and installing a
valid Cloudera Manager license.
11. Restore the Cloudera Manager configuration:
12. Start the following: Cloudera Management Service, Host Monitor, and Services Monitor. Verify that all the services
in the Cloudera Management Service started and are Healthy.
13. Select the Home > Status tab for the cluster(s) that you previously stopped, and in the Actions dropdown, select
Start.
Note: You can also view the Cloudera Manager Server log at
/var/log/cloudera-scm-server/cloudera-scm-server.log on the Server host.
2. Set the CMF_VAR environment variable in /etc/default/cloudera-scm-server to the new parent directory:
export CMF_VAR=/opt
3. Create log/cloudera-scm_server and run directories in the new parent directory and set the owner and
group of all directories to cloudera-scm. For example, if the new parent directory is /opt/, do the following:
sudo su
cd /opt
mkdir log
chown cloudera-scm:cloudera-scm log
mkdir /opt/log/cloudera-scm-server
chown cloudera-scm:cloudera-scm log/cloudera-scm-server
mkdir run
chown cloudera-scm:cloudera-scm run
In a Cloudera Manager managed cluster, you can only start or stop role instance processes using Cloudera Manager.
Cloudera Manager uses an open source process management tool called supervisord, that starts processes, takes
care of redirecting log files, notifying of process failure, setting the effective user ID of the calling process to the right
user, and so on. Cloudera Manager supports automatically restarting a crashed process. It will also flag a role instance
with a bad health flag if its process crashes repeatedly right after start up.
The Agent is started by init.d at start-up. It, in turn, contacts the Cloudera Manager Server and determines which
processes should be running. The Agent is monitored as part of Cloudera Manager's host monitoring. If the Agent stops
heartbeating, the host is marked as having bad health.
One of the Agent's main responsibilities is to start and stop processes. When the Agent detects a new process from
the Server heartbeat, the Agent creates a directory for it in /var/run/cloudera-scm-agent and unpacks the
configuration. It then contacts supervisord, which starts the process.
cm_processes
To enable Cloudera Manager to run scripts in subdirectories of /var/run/cloudera-scm-agent, (because /var/run
is mounted noexec in many Linux distributions), Cloudera Manager mounts a tmpfs, named cm_processes, for
process subdirectories.
A tmpfs defaults to a max size of 50% of physical RAM but this space is not allocated until its used, and tmpfs is paged
out to swap if there is memory pressure.
The lifecycle actions of cmprocesses can be described by the following statements:
• Created when the Agent starts up for the first time with a new supervisord process.
• If it already exists without noexec, reused when the Agent is started using start and not recreated.
• Remounted if Agent is started using clean_restart.
• Unmounting and remounting cleans out the contents (since it is mounted as a tmpfs).
• Unmounted when the host is rebooted.
• Not unmounted when the Agent is stopped.
Starting Agents
To start Agents, the supervisord process, and all managed service processes, use the following command:
• Start
• Restart
Warning: The hard_stop and hard_restart commands kill all running managed service processes
on the host(s) where the command is run.
Note: The procedures in this section require you to stop all roles on the host. If it is not possible to
stop all roles immediately, you must do so within 60 days of the hard stop or hard restart.
To stop or restart Agents, the supervisord process, and all managed service processes, use one of the following
commands:
• Hard Stop
1. Stop all roles running on the host. See Stopping All the Roles on a Host on page 239.
If it is not possible to stop all roles immediately, you must do so within 60 days of the hard stop.
2. Run the following command:
RHEL 7, SLES 12, Debian 8, Ubuntu 16.04 and higher
• Hard Restart
1. Stop all roles running on the host. See Stopping All the Roles on a Host on page 239.
If it is not possible to stop all roles immediately, you must do so within 60 days of the hard restart.
2. Run the following command:
RHEL 7, SLES 12, Debian 8, Ubuntu 16.04 and higher
Property Description
Send Agent Heartbeat Every The interval in seconds between each heartbeat that is sent from Cloudera
Manager Agents to the Cloudera Manager Server.
Default: 15 sec.
Property Description
Set health status to Concerning if the The number of missed consecutive heartbeats after which a Concerning
Agent heartbeats fail health status is assigned to that Agent.
Default: 5.
Set health status to Bad if the Agent The number of missed consecutive heartbeats after which a Bad health
heartbeats fail status is assigned to that Agent.
Default: 10.
Important: If you modify the parcel directory location, make sure that all hosts use the same location.
Using different locations on different hosts can cause unexpected problems.
Default: ext2,ext3,ext4.
log_file The path to the Agent log file. If the Agent is being started using
the init.d script,
/var/log/cloudera-scm-agent/cloudera-scm-agent.out
will also have a small amount of output (from before logging is
initialized).
Default:
/var/log/cloudera-scm-agent/cloudera-scm-agent.log.
max_collection_wait_seconds Maximum time to wait for all metric collectors to finish collecting
data.
Default: 10 sec.
supervisord_port The supervisord port. A change takes effect the next time
supervisord is restarted (not when the Agent is restarted).
Default: 19001.
[JDBC] cloudera_mysql_connector_jar, Location of JDBC drivers. See Step 4: Install and Configure
cloudera_oracle_connector_jar, Databases.
cloudera_postgresql_jdbc_jar
Default:
• MySQL - /usr/share/java/mysql-connector-java.jar
• Oracle - /usr/share/java/oracle-connector-java.jar
• PostgreSQL -
/usr/share/cmf/lib/postgresql-version-build.jdbc4.jar
log_file=/opt/log/cloudera-scm-agent/cloudera-scm-agent.log
2. Create log/cloudera-scm_agent directories and set the owner and group to cloudera-scm. For example, if
the log is stored in /opt/log/cloudera-scm-agent, do the following:
sudo su
cd /opt
mkdir log
chown cloudera-scm:cloudera-scm log
mkdir /opt/log/cloudera-scm-agent
chown cloudera-scm:cloudera-scm log/cloudera-scm-agent
Managing Licenses
Minimum Required Role: Full Administrator
When you install Cloudera Manager, you can select among the following editions: Cloudera Express (no license required),
a 60-day Cloudera Enterprise Cloudera Enterprise trial license, or Cloudera Enterprise (which requires a license). To
obtain a Cloudera Enterprise license, fill in this form or call 866-843-7207.
Note: Cloudera Express is discontinued as of version 6.3.3. Upgrades to Cloudera Manager or CDH
6.3.3 and higher are not supported when running Cloudera Express. Downgrading from Cloudera
Enterprise license to Cloudera Express license is no longer supported in Cloudera Manager 6.3.3.
Cloudera Express includes a free license that provides access to CDH, Cloudera's Apache Hadoop distribution, and
a subset of cluster management features available with Cloudera Manager, for up to 100 total CDH hosts.
Note the following:
• Cloudera Manager will not allow you to add hosts to a CDH 6.x cluster if the total number of hosts across all
CDH 6.x clusters will exceed 100.
• Cloudera Manager will not allow you to upgrade any cluster to CDH 6.x if the total number of managed CDH6.x
cluster hosts will exceed 100. If an upgrade from Cloudera Manager 6.0 to 6.1 fails due to this limitation, you
must downgrade Cloudera Manager to version 6.0, remove some hosts so that the number of hosts is less
than 100, then retry the upgrade.
Note: If you downgrade from Cloudera Enterprise to Cloudera Express and the number of managed
hosts exceeds 100, Cloudera Manager will disable all cluster management commands except for
commands used to stop a cluster. You will not be able to restart or otherwise use clusters while
the total number of hosts exceeds 100. Use the Cloudera Manager Admin Console to remove
some hosts so that the number of hosts is less than 100.
• Cloudera Enterprise
Cloudera Enterprise is available on a subscription basis in five editions, each designed around how you use the
platform:
• Essentials Edition provides superior support and advanced management for core Apache Hadoop.
• Data Science and Engineering Edition for programmatic data preparation and predictive modeling.
• Operational Database Edition for online applications with real-time serving needs.
• Data Warehouse Edition for BI and SQL analytics.
• Enterprise Data Hub Edition provides for complete use of the platform.
All editions are available in your environment of choice: cloud, on-premise, or a hybrid deployment. For more
information, see the Cloudera Enterprise Data Sheet.
License Expiration
Before a license expires, the Cloudera Manager Admin Console displays a message that indicates the number of days
left on a license, starting at 60 days before expiration and counting down to 30, 14, and 0 days. Contact Cloudera
Support before expiration to receive an updated license.
When a Cloudera Enterprise license expires, the following occurs:
• Cloudera Enterprise Cloudera Enterprise Trial - Enterprise features are disabled.
• Cloudera Enterprise - All services will continue to run as-is. The Cloudera Manager Admin Console will be disabled,
meaning that you will not be able to view or modify clusters. Key Trustee KMS and the Key Trustee server will
continue to function on the cluster, but you cannot change any configurations or add any services. On license
expiration, the only action you can take in the Cloudera Manager Admin Console is to upload a new, valid license.
Trial Licenses
You can use a trial license only once; when the 60-day trial period expires or you have ended the trial, you cannot
restart the trial. With the trial license, you can upgrade to a Cloudera Enterprise license or downgrade to an express
license.
When a trial ends, enterprise features immediately become unavailable. However, data or configurations associated
with the disabled functions are not deleted, and become available again once you install a Cloudera Enterprise license.
Note: Trial licenses are not available for any of the Cloudera encryption products.
• IP addresses
• Rack name
3. Select a host and click OK.
4. When you are finished with the assignments, click Continue.
5. Choose the database type:
• Keep the default setting of Use Embedded Database to have Cloudera Manager create and configure required
databases. Record the auto-generated passwords.
• Select Use Custom Databases to specify the external database host and enter the database type, database
name, username, and password for the custom database.
• If you are adding the Oozie service, you can change your Oozie configuration to control when data is purged
to improve performance, reduce database disk usage, improve upgrade performance, or to keep the history
for a longer period of time. See Configuring Oozie Data Purge Settings Using Cloudera Manager.
6. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information
you have supplied. If the test succeeds in all cases, click Continue; otherwise, check and correct the information
you have provided for the database and then try the test again. (For some servers, if you are using the embedded
database, you will see a message saying the database will be created at a later step in the installation process.)
The Cluster Setup Review Changes screen displays.
7. Review the configuration changes to be applied. Confirm the settings entered for file system paths. The file paths
required vary based on the services to be installed. If you chose to add the Sqoop service, indicate whether to use
the default Derby database or the embedded PostgreSQL database. If the latter, type the database name, host,
and user credentials that you specified when you created the database.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
you add and start the Cloudera Navigator Audit Server role as described in Adding Cloudera Navigator Roles on
page 70. For information on Cloudera Navigator, see Cloudera Navigator Data Management Overview.
• IP addresses
• Rack name
8. When you are satisfied with the assignments, click Continue.
9. Choose the database type:
• Keep the default setting of Use Embedded Database to have Cloudera Manager create and configure required
databases. Record the auto-generated passwords.
• Select Use Custom Databases to specify the external database host and enter the database type, database
name, username, and password for the custom database.
• If you are adding the Oozie service, you can change your Oozie configuration to control when data is purged
to improve performance, reduce database disk usage, improve upgrade performance, or to keep the history
for a longer period of time. See Configuring Oozie Data Purge Settings Using Cloudera Manager.
10. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information
you have supplied. If the test succeeds in all cases, click Continue; otherwise, check and correct the information
you have provided for the database and then try the test again. (For some servers, if you are using the embedded
database, you will see a message saying the database will be created at a later step in the installation process.)
The Cluster Setup Review Changes screen displays.
11. Review the configuration changes to be applied. Confirm the settings entered for file system paths. The file paths
required vary based on the services to be installed. If you chose to add the Sqoop service, indicate whether to use
the default Derby database or the embedded PostgreSQL database. If the latter, type the database name, host,
and user credentials that you specified when you created the database.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
you add and start the Cloudera Navigator Audit Server role as described in Adding Cloudera Navigator Roles on
page 70. For information on Cloudera Navigator, see Cloudera Navigator Data Management Overview.
If you want to use the Cloudera Navigator Metadata Server, add its role following the instructions in Adding Cloudera
Navigator Roles on page 70.
Renewing a License
1. Download the license file and save it locally.
2. In Cloudera Manager, go to the Home page.
3. Select Administration > License.
4. Click Upload License.
5. Browse to the license file you downloaded.
6. Click Upload.
Cloudera Manager requires a restart for the new license to take effect.
3. To set the day and time of day that the collection will be performed, click Scheduled Diagnostic Data Collection
Time and specify the date and time in the pop-up control.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
You can see the current setting of the data collection frequency by viewing Support > Scheduled Diagnostics: in the
main navigation bar.
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
Disabling the Automatic Sending of Diagnostic Data from a Manually Triggered Collection
If you do not want data automatically sent to Cloudera after manually triggering data collection, you can disable this
feature. The data you collect will be saved and can be downloaded for sending to Cloudera Support at a later time.
1. Select Administration > Settings.
2. Under the Support category, uncheck the box for Send Diagnostic Data to Cloudera Automatically.
3. Enter a Reason for change, and then click Save Changes to commit the changes.
• Attach the bundle to the SFDC case. Do not rename the bundle as this can cause a delay in processing
the bundle.
ssh my_cloudera_manager_server_host
cat /etc/cloudera-scm-server/db.properties
For example:
...
com.cloudera.cmf.db.type=...
com.cloudera.cmf.db.host=database_hostname:database_port
com.cloudera.cmf.db.name=scm
com.cloudera.cmf.db.user=scm
com.cloudera.cmf.db.password=SOME_PASSWORD
3. Collect information (host name, port number, database name, user name and password) for the following databases.
• Reports Manager
• Navigator Audit Server
• Navigator Metadata Server
• Activity Monitor
You can find the database information by using the Cloudera Manager Admin Console. Go to Clusters > Cloudera
Management Service > Configuration and select the Database category. You may need to contact your database
administrator to obtain the passwords.
4. Find the host where the Service Monitor, Host Monitor and Event Server roles are running. Go to Clusters >
Cloudera Manager Management Service > Instances and note which hosts are running these roles.
5. Identify the location of the Cloudera Navigator Metadata Server storage directory:
a. Go to Clusters > Cloudera Management Service > Instances.
b. Click the Configuration tab.
c. Select Scope > Navigator Metadata Server.
d. The Navigator Metadata Server Storage Dir property stores the location of the directory.
6. Ensure that Navigator Metadata Server Java heap is large enough to complete the upgrade. You can estimate the
amount of heap needed from the number of elements and relations stored in the Solr storage directory.
a. Go to Clusters > Cloudera Management Service > Instances.
b. In the list of instances, click Navigator Metadata Server.
c. Select Log Files > Role Log File.
d. Search the log file for solr core nav_elements and note the number of element documents.
e. Search the log file for solr core nav_relations and note the number of relation documents.
f. Multiply the total number of documents by 200 bytes per document and add to it a baseline of 2 GB:
For example, if you had 68813088 elements and 78813930 relations, the recommended Java heap size is ~30
GB:
g. Set the heap value in the Java Heap Size of Navigator Metadata Server in Bytes property in Clusters > Cloudera
Management Service > Configuration.
Note: Commands are provided below to backup various files and directories used by Cloudera Manager
Agents. If you have configured custom paths for any of these, substitute those paths in the commands.
The commands also provide destination paths to store the backups, defined by the environment
variable CM_BACKUP_DIR, which is used in all the backup commands. You may change these destination
paths in the command as needed for your deployment.
The tar commands in the steps below may return the following message. It is safe to ignore this
message:
SLES
Debian / Ubuntu
Note: Commands are provided below to backup various files and directories used by Cloudera Manager
Agents. If you have configured custom paths for any of these, substitute those paths in the commands.
The commands also provide destination paths to store the backups. You may change these destination
paths in the command as needed for your deployment.
1. On the host where the Service Monitor role is configured to run, backup the following directory:
2. On the host where the Host Monitor role is configured to run, backup the following directory:
3. On the host where the Event Server role is configured to run, back up the following directory:
2. Make sure a purge task has run recently to clear stale and deleted entities.
• You can see when the last purge tasks were run in the Cloudera Navigator console (From the Cloudera Manager
Admin console, go to Clusters > Cloudera Navigator. Select Administration > Purge Settings.)
• If a purge hasn't run recently, run it by editing the Purge schedule on the same page.
• Set the purge process options to clear out as much of the backlog of data as you can tolerate for your upgraded
system. See Managing Metadata Storage with Purge.
3. Stop the Navigator Metadata Server.
a. Go to Clusters > Cloudera Management Service > Instances.
b. Select Navigator Metadata Server.
c. Click Actions for Selected > Stop.
4. Back up the Cloudera Navigator Solr storage directory.
5. If you are using an Oracle database for audit, in SQL*Plus, ensure that the following additional privileges are set:
ssh my_cloudera_manager_server_host
Note: If the db.properties file does not contain a port number, omit the port number
parameter from the above command.
PostgreSQL/Embedded
Oracle
Work with your database administrator to ensure databases are properly backed up.
For more information about backing up databases, see Backing Up Databases on page 652.
2. Back up All other Cloudera Manager databases - Use the database information that you collected in a previous
step. You may need to contact your database administrator to obtain the passwords.
These databases can include the following:
• Reports Manager
• Navigator Audit Server
• Navigator Metadata Server
• Activity Monitor (Only used for MapReduce 1 monitoring).
Run the following commands to back up the databases. (The command displayed below depends on the database
you selected in the form at the top of this page. Replace placeholders with the actual values.):
MySQL
PostgreSQL/Embedded
Oracle
Work with your database administrator to ensure databases are properly backed up.
Note: Commands are provided below to backup various files and directories used by Cloudera Manager
Agents. If you have configured custom paths for any of these, substitute those paths in the commands.
The commands also provide destination paths to store the backups, defined by the environment
variable CM_BACKUP_DIR, which is used in all the backup commands. You may change these destination
paths in the command as needed for your deployment.
The tar commands in the steps below may return the following message. It is safe to ignore this
message:
ssh my_cloudera_manager_server_host
SLES
Debian / Ubuntu
ssh my_cloudera_manager_server_host
Starting cloudera-scm-server: [ OK ]
Settings
The Settings page provides a number of categories as follows:
• Performance - Set the Cloudera Manager Agent heartbeat interval. See Configuring Agent Heartbeat and Health
Status Options on page 47.
• Advanced - Enable API debugging and other advanced options.
• Monitoring - Set Agent health status parameters. For configuration instructions, see Configuring Cloudera Manager
Agents on page 47.
• Security - Set TLS encryption settings to enable TLS encryption between the Cloudera Manager Server, Agents,
and clients. For configuration instructions, see Manually Configuring TLS Encryption for Cloudera Manager. You
can also:
– Set the realm for Kerberos security and point to a custom keytab retrieval script. For configuration instructions,
see Cloudera Security.
– Specify session timeout and a "Remember Me" option.
• Ports and Addresses - Set ports for the Cloudera Manager Admin Console and Server. For configuration instructions,
see Configuring Cloudera Manager Server Ports on page 33.
• Other
– Enable Cloudera usage data collection For configuration instructions, see Managing Anonymous Usage Data
Collection on page 57.
– Set a custom header color and banner text for the Admin console.
– Set an "Information Assurance Policy" statement – this statement will be presented to every user before they
are allowed to access the login dialog box. The user must click "I Agree" in order to proceed to the login dialog
box.
– Disable/enable the auto-search for the Events panel at the bottom of a page.
• Support
– Configure diagnostic data collection properties. See Diagnostic Data Collection on page 58.
– Configure how to access Cloudera Manager help files.
• External Authentication - Specify the configuration to use LDAP, Active Directory, or an external program for
authentication. See Configuring External Authentication and Authorization for Cloudera Manager for instructions.
• Parcels - Configure settings for parcels, including the location of remote repositories that should be made available
for download, and other settings such as the frequency with which Cloudera Manager will check for new parcels,
limits on the number of downloads or concurrent distribution uploads. See Parcels for more information.
• Network - Configure proxy server settings. See Configuring Network Settings on page 50.
• Custom Service Descriptors - Configure custom service descriptor properties for Add-on Services on page 251.
Alerts
See Managing Alerts on page 349.
Users
See Cloudera Manager User Accounts.
Kerberos
See Enabling Kerberos Authentication for CDH.
License
See Managing Licenses on page 50.
Peers
See Designating a Replication Source on page 543.
• Service Monitor - collects health and metric information about services and activity information from the YARN
and Impala services
• Event Server - aggregates relevant Hadoop events and makes them available for alerting and searching
• Alert Publisher - generates and delivers alerts for certain types of events
• Reports Manager - generates reports that provide an historical view into disk utilization by user, user group, and
directory, processing activities by user and YARN pool, and HBase tables and namespaces. This role is not added
in Cloudera Express.
Cloudera Manager manages each role separately, instead of as part of the Cloudera Manager Server, for scalability
(for example, on large deployments it's useful to put the monitor roles on their own hosts) and isolation.
In addition, for certain editions of the Cloudera Enterprise license, the Cloudera Management Service provides the
Navigator Audit Server and Navigator Metadata Server roles for Cloudera Navigator.
2. Click Start to confirm. The Command Details window shows the progress of starting the roles.
3. When Command completed with n/n successful subcommands appears, the task is complete. Click Close.
2. Click Stop to confirm. The Command Details window shows the progress of stopping the roles.
3. When Command completed with n/n successful subcommands appears, the task is complete. Click Close.
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
• Select Use Custom Databases to specify the external database host and enter the database type, database
name, username, and password for the custom database.
• If you are adding the Oozie service, you can change your Oozie configuration to control when data is purged
to improve performance, reduce database disk usage, improve upgrade performance, or to keep the history
for a longer period of time. See Configuring Oozie Data Purge Settings Using Cloudera Manager.
7. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information
you have supplied. If the test succeeds in all cases, click Continue; otherwise, check and correct the information
you have provided for the database and then try the test again. (For some servers, if you are using the embedded
database, you will see a message saying the database will be created at a later step in the installation process.)
The Cluster Setup Review Changes screen displays.
8. Click Finish.
5. Check the checkboxes next to the Navigator Audit Server and Navigator Metadata Server roles.
6. Select Actions for Selected > Delete. Click Delete to confirm the deletion.
In contrast, the HDFS role instances (for example, NameNode and DataNode) obtain their configurations from a private
per-process directory, under /var/run/cloudera-scm-agent/process/unique-process-name. Giving each process
its own private execution and configuration environment allows Cloudera Manager to control each process
independently. For example, here are the contents of an example 879-hdfs-NAMENODE process directory:
$ tree -a /var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
/var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
cloudera_manager_Agent_fencer.py
cloudera_manager_Agent_fencer_secret_key.txt
cloudera-monitor.properties
core-site.xml
dfs_hosts_allow.txt
dfs_hosts_exclude.txt
event-filter-rules.json
hadoop-metrics2.properties
hdfs.keytab
hdfs-site.xml
log4j.properties
logs
stderr.log
stdout.log
topology.map
topology.py
• If the property allows a list of values, click the icon to the right of the edit field to add an additional field.
An example of this is the HDFS DataNode Data Directory property, which can have a comma-delimited list of
directories as its value. To remove an item from such a list, click the icon to the right of the field you want
to remove.
Many configuration properties have different values that are configured by multiple role groups. (See Role Groups
on page 268).
To edit configuration values for multiple role groups:
1. Go to the property, For example, the configuration panel for the Heap Dump Directory property displays the
DataNode Default Group (a role group), and a link that says ... and 6 others.
2. Click the ... and 6 others link to display all of the role groups:
3. Click the Show fewer link to collapse the list of role groups.
If you edit the single value for this property, Cloudera Manager applies the value to all role groups. To edit
the values for one or more of these role groups individually, click Edit Individual Values. Individual fields
display where you can edit the values for each role group. For example:
5. Click Save Changes to commit the changes. You can add a note that is included with the change in the Configuration
History. This changes the setting for the role group, and applies to all role instances associated with that role
group. Depending on the change you made, you may need to restart the service or roles associated with the
configuration you just changed. Or, you may need to redeploy your client configuration for the service. You should
see a message to that effect at the top of the Configuration page, and services will display an outdated configuration
(Restart Needed), (Refresh Needed), or outdated client configuration indicator. Click the indicator to display
the Stale Configurations on page 91 page.
icon. The default value is inserted and the icon turns into an Undo icon
( .)
Explicitly setting a configuration to the same value as its default (inherited value) has the same effect as using the
icon.
There is no mechanism for resetting to an autoconfigured value. However, you can use the configuration history and
rollback feature to revert any configuration changes.
Autoconfiguration
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
In some of these wizards, Cloudera Manager uses a set of rules to automatically configure certain settings to best suit
the characteristics of the deployment. For example, the number of hosts in the deployment drives the memory
requirements for certain monitoring daemons: the more hosts, the more memory is needed. Additionally, wizards that
are tasked with creating new roles will use a similar set of rules to determine an ideal host placement for those roles.
Scope
The following table shows, for each wizard, the scope of entities it affects during autoconfiguration and role-host
placement.
Certain autoconfiguration rules are unscoped, that is, they configure settings belonging to entities that aren't necessarily
the entities under the wizard's scope. These exceptions are explicitly listed.
Autoconfiguration
Cloudera Manager employs several different rules to drive automatic configuration, with some variation from wizard
to wizard. These rules range from the simple to the complex.
Configuration Scope
One of the points of complexity in autoconfiguration is configuration scope. The configuration hierarchy as it applies
to services is as follows: configurations may be modified at the service level (affecting every role in the service), role
group level (affecting every role instance in the group), or role level (affecting one role instance). A configuration found
in a lower level takes precedence over a configuration found in a higher level.
With the exception of the Static Service Pools, and the Import MapReduce wizard, all Cloudera Manager wizards follow
a basic pattern:
1. Every role in scope is moved into its own, new, role group.
2. This role group is the receptacle for the role's "idealized" configuration. Much of this configuration is driven by
properties of the role's host, which can vary from role to role.
3. Once autoconfiguration is complete, new role groups with common configurations are merged.
4. The end result is a smaller set of role groups, each with an "idealized" configuration for some subset of the roles
in scope. A subset can have any number of roles; perhaps all of them, perhaps just one, and so on.
The Static Service Pools and Import MapReduce wizards configure role groups directly and do not perform any merging.
• Solr Servers
• Spark Standalone Workers
• Accumulo Tablet Servers
• Add-on services
YARN
yarn.nodemanager.resource.cpu-vcores - For each NodeManager role group, set to number of cores,
including hyperthreads, on one NodeManager member's host * service percentage chosen in
wizard.
All Services
Cgroup cpu.shares - For each role group that supports cpu.shares, set to max(20, (service percentage
chosen in wizard) * 20).
Cgroup blkio.weight - For each role group that supports blkio.weight, set to max(100, (service percentage
chosen in wizard) * 10).
Data Directories
Several autoconfiguration rules work with data directories, and there's a common sub-rule used by all such rules to
determine, out of all the mountpoints present on a host, which are appropriate for data. The subrule works as follows:
• The initial set of mountpoints for a host includes all those that are disk-backed. Network-backed mountpoints are
excluded.
• Mountpoints beginning with /boot, /cdrom, /usr, /tmp, /home, or /dev are excluded.
• Mountpoints beginning with /media are excluded, unless the backing device's name contains /xvd somewhere
in it.
• Mountpoints beginning with /var are excluded, unless they are /var or /var/lib.
• The largest mount point (in terms of total space, not available space) is determined.
• Other mountpoints with less than 1% total space of the largest are excluded.
• Mountpoints beginning with /var or equal to / are excluded unless they’re the largest mount point.
• Remaining mountpoints are sorted lexicographically and retained for future use.
Memory
The rules used to autoconfigure memory reservations are perhaps the most complicated rules employed by Cloudera
Manager. When configuring memory, Cloudera Manager must take into consideration which roles are likely to enjoy
more memory, and must not over commit hosts if at all possible. To that end, it needs to consider each host as an
entire unit, partitioning its available RAM into segments, one segment for each role. To make matters worse, some
roles have more than one memory segment. For example, a Solr server has two memory segments: a JVM heap used
for most memory allocation, and a JVM direct memory pool used for HDFS block caching. Here is the overall flow during
memory autoconfiguration:
1. The set of participants includes every host under scope as well as every {role, memory segment} pair on those
hosts. Some roles are under scope while others are not.
2. For each {role, segment} pair where the role is under scope, a rule is run to determine four different values for
that pair:
• Minimum memory configuration. Cloudera Manager must satisfy this minimum, possibly over-committing
the host if necessary.
• Minimum memory consumption. Like the above, but possibly scaled to account for inherent overhead. For
example, JVM memory values are multiplied by 1.3 to arrive at their consumption value.
• Ideal memory configuration. If RAM permits, Cloudera Manager will provide the pair with all of this memory.
• Ideal memory consumption. Like the above, but scaled if necessary.
3. For each {role, segment} pair where the role is not under scope, a rule is run to determine that pair's existing
memory consumption. Cloudera Manager will not configure this segment but will take it into consideration by
setting the pair's "minimum" and "ideal" to the memory consumption value.
4. For each host, the following steps are taken:
a. 20% of the host's available RAM is subtracted and reserved for the OS.
b. sum(minimum_consumption) and sum(ideal_consumption) are calculated.
c. An "availability ratio" is built by comparing the two sums against the host's available RAM.
a. If RAM < sum(minimum) ratio = 0
b. If RAM >= sum(ideal) ratio = 1
d. If the host has more available memory than the total of the ideal memory for all roles assigned to the host,
each role is assigned its ideal memory and autoconfiguration is finished.
e. Cloudera Manager assigns all available host memory by setting each {role, segment} pair to the same
consumption value, except in cases where that value is below the minimum memory or above the ideal
memory for that pair. In that case, it is set to the minimum memory or the ideal memory as appropriate. This
ensures that pairs with low ideal memory requirements are completely satisfied before pairs with higher
ideal memory requirements.
5. The {role, segment} pair is set with the value from the previous step. In the Static Service Pools wizard, the role
group is set just once (as opposed to each role).
6. Custom post-configuration rules are run.
Customization rules are applied in steps 2, 3 and 7. In step 2, there's a generic rule for most cases, as well as a series
of custom rules for certain {role, segment} pairs. Likewise, there's a generic rule to calculate memory consumption in
step 3 as well as some custom consumption functions for certain {role, segment} pairs.
Step 2 Generic Rule Excluding Static Service Pools Wizard
For every {role, segment} pair where the segment defines a default value, the pair's minimum is set to the segment's
minimum value (or 0 if undefined), and the ideal is set to the segment's default value.
Step 2 Custom Rules Excluding Static Service Pools Wizard
HDFS
For the NameNode and Secondary NameNode JVM heaps, the minimum is 50 MB and the ideal is max(4 GB,
sum_over_all(DataNode mountpoints’ available space) / 0.000008).
MapReduce
For the JobTracker JVM heap, the minimum is 50 MB and the ideal is max(1 GB, round((1 GB * 2.3717181092
* ln(number of TaskTrackers in MapReduce service)) - 2.6019933306)). If the number of TaskTrackers
<= 5, the ideal is 1 GB.
For the mapper JVM heaps, the minimum is 1 and the ideal is the number of cores, including hyperthreads, on the
TaskTracker host. Memory consumption is scaled by mapred_child_java_opts_max_heap (the size of a task's
heap).
For the reducer JVM heaps, the minimum is 1 and the ideal is (number of cores, including hyperthreads,
on the TaskTracker host) / 2. Memory consumption is scaled by mapred_child_java_opts_max_heap
(the size of a task's heap).
HBase
For the memory total allowed for HBase RegionServer JVM heap, the minimum is 50 MB and the ideal is min (31 GB
,(total RAM on region server host) * 0.64)
YARN
For the memory total allowed for containers, the minimum is 1 GB and the ideal is (total RAM on NodeManager
host) * 0.64.
Hue
With the exception of the Beeswax Server (only in CDH 4), Hue roles do not have memory limits. Therefore, Cloudera
Manager treats them as roles that consume a fixed amount of memory by setting their minimum and ideal consumption
values, but not their configuration values. The two consumption values are set to 256 MB.
Impala
With the exception of the Impala daemon, Impala roles do not have memory limits. Therefore, Cloudera Manager
treats them as roles that consume a fixed amount of memory by setting their minimum/ideal consumption values, but
not their configuration values. The two consumption values are set to 150 MB for the Catalog Server and 64 MB for
the StateStore.
For the Impala Daemon memory limit, the minimum is 256 MB and the ideal is (total RAM on daemon host) *
0.64.
Solr
For the Solr Server JVM heap, the minimum is 50 MB and the ideal is min(64 GB, (total RAM on Solr Server
host) * 0.64) / 2.6. For the Solr Server JVM direct memory segment, the minimum is 256 MB and the ideal is
min(64 GB, (total RAM on Solr Server host) * 0.64) / 2.
Additionally, there's a generic rule to handle cgroup.memory_limit_in_bytes, which is unused by Cloudera services
but is available for add-on services. Its behavior varies depending on whether the role in question has segments or
not.
With Segments
The minimum is the min(cgroup.memory_limit_in_bytes_min (if exists) or 0, sum_over_all(segment
minimum consumption)), and the ideal is the sum of all segment ideal consumptions.
Without Segments
The minimum is cgroup.memory_limit_in_bytes_min (if exists) or 0, and the ideal is (total RAM on
role's host * 0.8 * service percentage chosen in wizard).
YARN
For the memory total allowed for containers, the minimum is 1 GB and the ideal is min(8 GB, (total RAM on
NodeManager host) * 0.8 * service percentage chosen in wizard).
Impala
For the Impala Daemon memory limit, the minimum is 256 MB and the ideal is ((total RAM on Daemon host)
* 0.8 * service percentage chosen in wizard).
MapReduce
• Mapper JVM heaps - the minimum is 1 and the ideal is (number of cores, including hyperthreads, on the TaskTracker
host * service percentage chosen in wizard). Memory consumption is scaled by
mapred_child_java_opts_max_heap (the size of a given task's heap).
• Reducer JVM heaps - the minimum is 1 and the ideal is (number of cores, including hyperthreads on the TaskTracker
host * service percentage chosen in wizard) / 2. Memory consumption is scaled by
mapred_child_java_opts_max_heap (the size of a given task's heap).
Impala
For the Impala Daemon, the memory consumption is 0 if YARN Service for Resource Management is set. If the memory
limit is defined but not -1, its value is used verbatim. If it's defined but -1, the consumption is equal to the total RAM
on the Daemon host. If it is undefined, the consumption is (total RAM * 0.8).
MapReduce
See Step 3 Custom Rules for Static Service Pools Wizard on page 84.
Solr
For the Solr Server JVM direct memory segment, the consumption is equal to the value verbatim provided
solr.hdfs.blockcache.enable and solr.hdfs.blockcache.direct.memory.allocation are both true.
Otherwise, the consumption is 0.
HDFS
• NameNode JVM heaps are equalized. For every pair of NameNodes in an HDFS service with different heap sizes,
the larger heap size is reset to the smaller one.
• JournalNode JVM heaps are equalized. For every pair of JournalNodes in an HDFS service with different heap sizes,
the larger heap size is reset to the smaller one.
• NameNode and Secondary NameNode JVM heaps are equalized. For every {NameNode, Secondary NameNode}
pair in an HDFS service with different heap sizes, the larger heap size is reset to the smaller one.
HBase
Master JVM heaps are equalized. For every pair of Masters in an HBase service with different heap sizes, the larger
heap size is reset to the smaller one.
Hive
Hive on Spark rules apply only when Hive depends on YARN. The following rules are applied:
• Spark executor cores - Set to 4, 5, or 6. The value that results in the fewest "wasted" cores across the cluster is
used, where the number of cores wasted per host is the remainder of
yarn.nodemanager.resource.cpu-vcores / spark.executor.cores. In case of a tie, use the larger
value of Spark executor cores. If no host on the cluster has 4 or more cores, then sets the value to the smallest
value of yarn.nodemanager.resource.cpu-vcores on the cluster.
• Spark executor memory - 85% of Spark executor memory allocated to spark.executor.memory and 15%
allocated to spark.yarn.executor.memoryOverhead. The total memory is the YARN container memory split
evenly between the maximum number of executors that can run on a host. This is
yarn.nodemanager.resource.memory-mb / floor(yarn.nodemanager.resource.cpu-vcores /
spark.executor.cores). When the memory or vcores vary across hosts in the cluster, choose the smallest
calculated value for Spark executor memory.
• Spark driver memory - 90% of Spark driver memory allocated to spark.driver.memory and 10% allocated to
spark.yarn.driver.memoryOverhead. The total memory is based on the lowest value of
yarn.nodemanager.resource.memory-mb across the cluster.
• Total memory is:
– 12 GB when yarn.nodemanager.resource.memory-mb > 50 GB.
– 4 GB when yarn.nodemanager.resource.memory-mb < 50 GB && >= 12 GB
– 1 GB when yarn.nodemanager.resource.memory-mb < 12 GB
– 256 MB when yarn.nodemanager.resource.memory-mb < 1 GB.
Impala
If an Impala service has YARN Service for Resource Management set, every Impala Daemon memory limit is set to the
value of (yarn.nodemanager.resource.memory-mb * 1 GB) if there's a YARN NodeManager co-located with the
Impala Daemon.
MapReduce
JobTracker JVM heaps are equalized. For every pair of JobTrackers in an MapReduce service with different heap sizes,
the larger heap size is reset to the smaller one.
Oozie
Oozie Server JVM heaps are equalized. For every pair of Oozie Servers in an Oozie service with different heap sizes,
the larger heap size is reset to the smaller one.
YARN
ResourceManager JVM heaps are equalized. For every pair of ResourceManagers in a YARN service with different heap
sizes, the larger heap size is reset to the smaller one.
ZooKeeper
ZooKeeper Server JVM heaps are equalized. For every pair of servers in a ZooKeeper service with different heap sizes,
the larger heap size is reset to the smaller one.
General Rules
HBase
• hbase.replication - For each HBase service, set to true if there's a Key-Value Store Indexer service in the
cluster. This rule is unscoped; it can fire even if the HBase service is not under scope.
• replication.replicationsource.implementation - For each HBase service, set to
com.ngdata.sep.impl.SepReplicationSource if there's a Keystore Indexer service in the cluster. This rule
is unscoped; it can fire even if the HBase service is not under scope.
HDFS
• dfs.datanode.du.reserved - For each DataNode, set to min((total space of DataNode host largest
mountpoint) / 10, 10 GB).
• dfs.namenode.name.dir - For each NameNode, set to the first two mountpoints on the NameNode host with
/dfs/nn appended.
• dfs.namenode.checkpoint.dir - For each Secondary NameNode, set to the first mountpoint on the Secondary
NameNode host with /dfs/snn appended.
• dfs.datanode.data.dir - For each DataNode, set to all the mountpoints on the host with /dfs/dn appended.
• dfs.journalnode.edits.dir - For each JournalNode, set to the first mountpoint on the JournalNode host
with /dfs/jn appended.
• dfs.datanode.failed.volumes.tolerated - For each DataNode, set to (number of mountpoints on DataNode
host) / 2.
• dfs.namenode.service.handler.count and dfs.namenode.handler.count - For each NameNode, set
to ln(number of DataNodes in this HDFS service) * 20.
• dfs.datanode.hdfs-blocks-metadata.enabled - For each HDFS service, set to true if there's an Impala
service in the cluster. This rule is unscoped; it can fire even if the HDFS service is not under scope.
• dfs.client.read.shortcircuit - For each HDFS service, set to true if there's an Impala service in the cluster.
This rule is unscoped; it can fire even if the HDFS service is not under scope.
• dfs.datanode.data.dir.perm - For each DataNode, set to 755 if there's an Impala service in the cluster and
the cluster isn’t Kerberized. This rule is unscoped; it can fire even if the HDFS service is not under scope.
• fs.trash.interval - For each HDFS service, set to 1.
Hue
• WebHDFS dependency - For each Hue service, set to either the first HttpFS role in the cluster, or, if there are
none, the first NameNode in the cluster.
• HBase Thrift Server dependency- For each Hue service in a CDH 4.4 or higher cluster, set to the first HBase Thrift
Server in the cluster.
Impala
For each Impala service, set Enable Audit Collection and Enable Lineage Collection to true if there's a Cloudera
Management Service with a Navigator Audit Server and Navigator Metadata Server roles. This rule is unscoped; it can
fire even if the Impala service is not under scope.
MapReduce
• mapred.local.dir - For each JobTracker, set to the first mountpoint on the JobTracker host with /mapred/jt
appended.
• mapred.local.dir - For each TaskTracker, set to all the mountpoints on the host with /mapred/local
appended.
• mapred.reduce.tasks - For each MapReduce service, set to max(1, sum_over_all(TaskTracker number
of reduce tasks (determined via mapred.tasktracker.reduce.tasks.maximum for that
TaskTracker, which is configured separately)) / 2).
• mapred.job.tracker.handler.count - For each JobTracker, set to max(10, ln(number of TaskTrackers
in this MapReduce service) * 20).
• mapred.submit.replication - If there's an HDFS service in the cluster, for each MapReduce service, set to
max(min(number of DataNodes in the HDFS service, value of HDFS Replication Factor),
sqrt(number of DataNodes in the HDFS service)).
• mapred.tasktracker.instrumentation - If there's a management service, for each MapReduce service, set
to org.apache.hadoop.mapred.TaskTrackerCmonInst. This rule is unscoped; it can fire even if the MapReduce
service is not under scope.
YARN
• yarn.nodemanager.local-dirs - For each NodeManager, set to all the mountpoints on the NodeManager
host with /yarn/nm appended.
• yarn.nodemanager.resource.cpu-vcores - For each NodeManager, set to the number of cores (including
hyperthreads) on the NodeManager host.
• mapred.reduce.tasks - For each YARN service, set to max(1,sum_over_all(NodeManager number of
cores, including hyperthreads) / 2).
• yarn.resourcemanager.nodemanagers.heartbeat-interval-ms - For each NodeManager, set to max(100,
10 * (number of NodeManagers in this YARN service)).
• yarn.scheduler.maximum-allocation-vcores - For each ResourceManager, set to
max_over_all(NodeManager number of vcores (determined via
yarn.nodemanager.resource.cpu-vcores for that NodeManager, which is configured
separately)).
• yarn.scheduler.maximum-allocation-mb - For each ResourceManager, set to max_over_all(NodeManager
amount of RAM (determined via yarn.nodemanager.resource.memory-mb for that NodeManager,
which is configured separately)).
• mapreduce.client.submit.file.replication - If there's an HDFS service in the cluster, for each YARN
service, set to max(min(number of DataNodes in the HDFS service, value of HDFS Replication
Factor), sqrt(number of DataNodes in the HDFS service)).
All Services
If a service dependency is unset, and a service with the desired type exists in the cluster, set the service dependency
to the first such target service. Applies to all service dependencies except YARN Service for Resource Management.
Applies only to the Installation and Add Cluster wizards.
Role-Host Placement
Cloudera Manager employs the same role-host placement rule regardless of wizard. The set of hosts considered
depends on the scope. If the scope is a cluster, all hosts in the cluster are included. If a service, all hosts in the service's
cluster are included. If the Cloudera Management Service, all hosts in the deployment are included. The rules are as
follows:
1. The hosts are sorted from most to least physical RAM. Ties are broken by sorting on hostname (ascending) followed
by host identifier (ascending).
2. The overall number of hosts is used to determine which arrangement to use. These arrangements are hard-coded,
each dictating for a given "master" role type, what index (or indexes) into the sorted host list in step 1 to use.
Custom Configuration
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
Cloudera Manager exposes properties that allow you to insert custom configuration text into XML configuration,
property, and text files, or into an environment. The naming convention for these properties is: XXX Advanced
Configuration Snippet (Safety Valve) for YYY or XXX YYY Advanced Configuration Snippet (Safety Valve), where XXX
is a service or role and YYY is the target.
The values you enter into a configuration snippet must conform to the syntax of the target. For an XML configuration
file, the configuration snippet must contain valid XML property definitions. For a properties file, the configuration
snippet must contain valid property definitions. Some files simply require a list of host addresses.
The configuration snippet mechanism is intended for use in cases where there is configuration setting that is not
exposed as a configuration property in Cloudera Manager. Configuration snippets generally override normal configuration.
Contact Cloudera Support if you are required to use a configuration snippet that is not explicitly documented.
Service-wide configuration snippets apply to all roles in the service; a configuration snippet for a role group applies to
all instances of the role associated with that role group.
Server and client configurations have separate configuration snippets. In general after changing a server configuration
snippet you must restart the server, and after changing a client configuration snippet you must redeploy the client
configuration. Sometimes you can refresh instead of restart. In some cases however, you must restart a dependent
server after changing a client configuration. For example, changing a MapReduce client configuration marks the
dependent Hive server as stale, which must be restarted. The Admin Console displays an indicator when a server must
be restarted. In addition, the All Configuration Issues tab on the Home page indicates the actions you must perform
to resolve stale configurations.
Metrics
Set properties to configure Hadoop metrics in a hadoop-metrics.properties or hadoop-metrics2.properties
file.
Syntax:
key1=value1
key2=value2
For example:
*.sink.foo.class=org.apache.hadoop.metrics2.sink.FileSink
namenode.sink.foo.filename=/tmp/namenode-metrics.out
secondarynamenode.sink.foo.filename=/tmp/secondarynamenode-metrics.out
Whitelists and blacklists
Specify a list of host addresses that are allowed or disallowed from accessing a service.
Syntax:
host1.domain1 host2.domain2
1. Click to add a property. Enter the property name, value, and optional description. To indicate that
the property value cannot be overridden by another , select the Final checkbox.
• XML text field - Enter the property name, value, and optional description in as XML elements.
<property>
<name>name</name>
<value>property_value</value>
<final>final_value</final>
</property>
2. Specify the snippet properties. If the snippet is an XML file, you have the option to use a snippet editor (the default)
or an XML text field:
• Snippet editor
1. Click to add a property. Enter the property name, value, and optional description. To indicate that
the property value cannot be overridden by another , select the Final checkbox.
• XML text field - Enter the property name, value, and optional description in as XML elements.
<property>
<name>name</name>
<value>property_value</value>
<final>final_value</final>
</property>
Stale Configurations
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
The Stale Configurations page provides differential views of changes made in a cluster. For any configuration change,
the page contains entries of all affected attributes. For example, the following File entry shows the change to the file
hdfs-site.xml when you update the property controlling how much disk space is reserved for non-HDFS use on
each DataNode:
To display the entities affected by a change, click the Show button at the right of the entry. The following dialog box
shows that three DataNodes were affected by the disk space change:
Attribute Categories
The categories of attributes include:
• Environment - represents environment variables set for the role. For example, the following entry shows the
change to the environment that occurs when you update the heap memory configuration of the
SecondaryNameNode.
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
To allow clients to use the HBase, HDFS, Hive, MapReduce, and YARN services, Cloudera Manager creates zip archives
of the configuration files containing the service properties. The zip archive is referred to as a client configuration file.
Each archive contains the set of configuration files needed to access the service: for example, the MapReduce client
configuration file contains copies of core-site.xml, hadoop-env.sh, hdfs-site.xml, log4j.properties, and
mapred-site.xml.
Client configuration files are generated automatically by Cloudera Manager based on the services and roles you have
installed and Cloudera Manager deploys these configurations automatically when you install your cluster, add a service
on a host, or add a gateway role on a host. Specifically, for each host that has a service role instance installed, and for
each host that is configured as a gateway role for that service, the deploy function downloads the configuration zip
file, unzips it into the appropriate configuration directory, and uses the Linux alternatives mechanism to set a given,
configurable priority level. If you are installing on a system that happens to have pre-existing alternatives, then it is
possible another alternative may have higher priority and will continue to be used. The alternatives priority of the
Cloudera Manager client configuration is configurable under the Gateway scope of the Configuration tab for the
appropriate service.
You can also manually distribute client configuration files to the clients of a service.
The main circumstance that may require a redeployment of the client configuration files is when you have modified a
configuration. In this case you will typically see a message instructing you to redeploy your client configurations. The
affected service(s) will also display a icon. Click the indicator to display the Stale Configurations on page 91 page.
Page Procedure
Home 1. On the Home > Status tab, click
to the right of the cluster name and select View Client Configuration URLs. A
pop-up window with links to the configuration files for the services you have
installed displays.
2. Click a link or save the link URL and download the file using wget or curl.
Note: If you are deploying client configurations on a host that has multiple services installed, some
of the same configuration files, though with different configurations, will be installed in the conf
directories for each service. Cloudera Manager uses the priority parameter in the alternatives
--install command to ensure that the correct configuration directory is made active based on the
combination of services on that host. The priority order is YARN > MapReduce > HDFS. The priority
can be configured under the Gateway sections of the Configuration tab for the appropriate service.
to the right of the cluster name and select Deploy Client Configuration.
2. Click Deploy Client Configuration.
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
Whenever you change and save a set of configuration settings for a service or role instance or a host, Cloudera Manager
saves a revision of the previous settings and the name of the user who made the changes. You can then view past
revisions of the configuration settings, and, if desired, roll back the settings to a previous state.
• By default, or if you click Show All, a list of all revisions is shown. If you are viewing a service or role instance,
all service/role group related revisions are shown. If you are viewing a host or all hosts, all host/all hosts
related revisions are shown.
• To list only the configuration revisions that were done in a particular time period, use the Time Range Selector
to select a time range. Then, click Show within the Selected Time Range.
3. Click the Details... link. The Revision Details dialog box displays.
Important: This feature can only be used to revert changes to configuration values. You cannot use
this feature to:
• Revert NameNode high availability. You must perform this action by explicitly disabling high
availability.
• Disable Kerberos security.
• Revert role group actions (creating, deleting, or moving membership among groups). You must
perform these actions explicitly in the Role Groups on page 268 feature.
Managing Clusters
Cloudera Manager can manage multiple clusters, however each cluster can only be associated with a single Cloudera
Manager Server or Cloudera Manager HA pair. Once you have successfully installed your first cluster, you can add
additional clusters, running the same or a different version of CDH. You can then manage each cluster and its services
independently.
On the Home > Status tab you can access many cluster-wide actions by selecting
to the right of the cluster name: add a service, start, stop, restart, deploy client configurations, enable Kerberos, and
perform cluster refresh, rename, upgrade, and maintenance mode actions.
Important: As of February 1, 2021, all downloads of CDH and Cloudera Manager require a username
and password and use a modified URL. You must use the modified URL, including the username and
password when downloading the repository contents described below. You may need to upgrade
Cloudera Manager to a newer version that uses the modified URLs.
This can affect new installations, upgrades, adding new hosts to a cluster, downloading a new parcel,
and adding a cluster.
For more information, see Updating an existing CDH/Cloudera Manager deployment to access
downloads with authentication.
Cloudera Manager can manage multiple clusters. Furthermore, the clusters do not need to run the same major version
of CDH. Starting with Cloudera Enterprise 6.2, you can create dedicated compute clusters with access to data in a base
cluster. For more information, see Virtual Private Clusters and Cloudera SDX on page 108.
Cluster Basics
The Cluster Basics page allows you to specify the Cluster Name and select the Cluster Type:
• Regular Cluster: A Regular Cluster contains storage nodes, compute nodes, and other services such as metadata
and security collocated in a single cluster.
• Compute Cluster: A Compute Cluster consists of only compute nodes. To connect to existing storage, metadata
or security services, you must first choose or create a Data Context on a Base Cluster.
If you are performing a new installation, Regular Cluster is the only option. You cannot add a compute cluster if you
do not have an existing base cluster.
For more information on regular and compute clusters, and data contexts, see Virtual Private Clusters and Cloudera
SDX on page 108.
If you are adding a compute cluster to an existing base cluster, click Choose Data Context... to create or select a Data
Context.
After selecting a cluster type and data context (if applicable), enter a cluster name and then click Continue.
Setup Auto-TLS
The Setup Auto-TLS page provides instructions for initializing the certificate manager for auto-TLS if you have not done
so already. If you already initialized the certificate manager in Step 3: Install Cloudera Manager Server, the wizard
displays a message indicating that auto-TLS has been initialized. Click Continue to proceed with the installation.
If you have not already initialized the certificate manager, and you want to enable auto-TLS, follow the instructions
provided on the page before continuing. When you reload the page as instructed, you are redirected to
https://<server_host>:7183, and a security warning is displayed. You might need to indicate that you trust the
certificate, or click to proceed to the Cloudera Manager Server host. You might also be required to log in again and
re-complete the previous steps in the wizard.
For more information, see Configuring TLS Encryption for Cloudera Manager and CDH Using Auto-TLS.
If you do not want to enable auto-TLS at this time, click Continue to proceed.
Specify Hosts
Note: This section covers the procedure for new hosts only. For existing managed hosts, see Adding
a Cluster Using Currently Managed Hosts on page 101.
Choose which hosts will run CDH and other managed services.
1. To enable Cloudera Manager to automatically discover hosts on which to install CDH and managed services, enter
the cluster hostnames or IP addresses in the Hostnames field. You can specify hostname and IP address ranges
as follows:
Important: Unqualified hostnames (short names) must be unique in a Cloudera Manager instance.
For example, you cannot have both host01.example.com and host01.standby.example.com
managed by the same Cloudera Manager Server.
You can specify multiple addresses and address ranges by separating them with commas, semicolons, tabs, or
blank spaces, or by placing them on separate lines. Use this technique to make more specific searches instead of
searching overly wide ranges. Only scans that reach hosts running SSH will be selected for inclusion in your cluster
by default. You can enter an address range that spans over unused addresses and then clear the nonexistent hosts
later in the procedure, but wider ranges require more time to scan.
2. Click Search. If there are a large number of hosts on your cluster, wait a few moments to allow them to be
discovered and shown in the wizard. If the search is taking too long, you can stop the scan by clicking Abort Scan.
You can modify the search pattern and repeat the search as many times as you need until you see all of the
expected hosts.
Note: Cloudera Manager scans hosts by checking for network connectivity. If there are some
hosts where you want to install services that are not shown in the list, make sure you have network
connectivity between the Cloudera Manager Server host and those hosts, and that firewalls and
SE Linux are not blocking access.
3. Verify that the number of hosts shown matches the number of hosts where you want to install services. Clear
host entries that do not exist or where you do not want to install services.
4. Click Continue.
The Select Repository screen displays.
Select Repository
Important: You cannot install software using both parcels and packages in the same cluster.
The Select Repository page allows you to specify repositories for Cloudera Manager Agent and CDH and other software.
In the Cloudera Manager Agent section:
1. Select either Public Cloudera Repository or Custom Repository for the Cloudera Manager Agent software.
2. If you select Custom Repository, do not include the operating system-specific paths in the URL. For instructions
on setting up a custom repository, see Configuring a Local Package Repository.
In the CDH and other software section:
1. Select the repository type to use for the installation. In the Install Method section select one of the following:
• Use Parcels (Recommended)
A parcel is a binary distribution format containing the program files, along with additional metadata used by
Cloudera Manager. Parcels are required for rolling upgrades. For more information, see Parcels.
• Use Packages
A package is a standard binary distribution format that contains compiled code and meta-information such
as a package description, version, and dependencies. Packages are installed using your operating system
package manager.
2. Select the version of CDH to install. For compute clusters using parcels, the supported CDH versions display
(Supported) next to the parcel name. For compute clusters using packages, you must make sure that you have
installed a supported CDH version on all compute cluster hosts.
a. If you selected Use Parcels and you do not see the version you want to install, click the More Options button
to add the repository URL for your version. Repository URLs for CDH 6 version are documented in CDH 6
Download Information. After adding the repository, click Save Changes and wait a few seconds for the version
to appear. If your Cloudera Manager host uses an HTTP proxy, click the Proxy Settings button to configure
your proxy.
Note: Cloudera Manager only displays CDH versions it can support. If an available CDH
version is too new for your Cloudera Manager version, it is not displayed.
b. If you selected Use Packages, and the version you want to install is not listed, you can select Custom Repository
to specify a repository that contains the desired version. Repository URLs for CDH 6 version are documented
in CDH 6 Download Information.
3. If you selected Use Parcels, specify any Additional Parcels you want to install. If you are installing CDH 6, do not
select the KAFKA, KUDU, or SPARK parcels, because they are included in CDH 6.
4. Click Continue.
The Accept JDK License page displays.
Note: Cloudera, Inc. acquired Oracle JDK software under the Oracle Binary Code License Agreement.
Pursuant to Item D(v)(a) of the SUPPLEMENTAL LICENSE TERMS of the Oracle Binary Code License
Agreement, use of JDK software is governed by the terms of the Oracle Binary Code License Agreement.
By installing the JDK software, you agree to be bound by these terms. If you do not wish to be bound
by these terms, then do not install the Oracle JDK.
To allow Cloudera Manager to automatically install the Oracle JDK on cluster hosts, read the JDK license and check the
box labeled Install Oracle Java SE Development Kit (JDK8) if you accept the terms. If you installed your own Oracle
JDK version in Step 2: Install Java Development Kit, leave the box unchecked.
If you allow Cloudera Manager to install the JDK, a second checkbox appears, labeled Install Java Unlimited Strength
Encryption Policy Files. These policy files are required to enable AES-256 encryption in JDK versions lower than 1.8u161.
JDK 1.8u161 and higher enable unlimited strength encryption by default, and do not require policy files.
After reading the license terms and checking the applicable boxes, click Continue.
Install Agents
The Install Agents page displays the progress of the installation. You can click on the Details link for any host to view
the installation log. If the installation is stalled, you can click the Abort Installation button to cancel the installation
and then view the installation logs to troubleshoot the problem.
If the installation fails on any hosts, you can click the Retry Failed Hosts to retry all failed hosts, or you can click the
Retry link on a specific host.
If you selected the option to manually install agents, see Manually Install Cloudera Manager Agent Packages for the
procedure and then continue with the next steps on this page.
After installing the Cloudera Manager Agent on all hosts, click Continue.
If you are using parcels, the Install Parcels page displays. If you chose to install using packages, the Detecting CDH
Versions page displays.
Install Parcels
If you selected parcels for the installation method, the Install Parcels page reports the installation progress of the
parcels you selected earlier. After the parcels are downloaded, progress bars appear representing each cluster host.
You can click on an individual progress bar for details about that host.
After the installation is complete, click Continue.
The Inspect Cluster page displays.
Inspect Cluster
The Inspect Cluster page provides a tool for inspecting network performance as well as the Host Inspector to search
for common configuration problems. Cloudera recommends that you run the inspectors sequentially:
1. Run the Inspect Network Performance tool. You can click Advanced Options to customize some ping parameters.
2. After the network inspector completes, click Show Inspector Results to view the results in a new tab.
3. Address any reported issues, and click Run Again (if applicable).
4. Click Inspect Hosts to run the Host Inspector utility.
5. After the host inspector completes, click Show Inspector Results to view the results in a new tab.
6. Address any reported issues, and click Run Again (if applicable).
If the reported issues cannot be resolved in a timely manner, and you want to abandon the cluster creation wizard to
address them, select the radio button labeled Quit the wizard and Cloudera Manager will delete the temporarily
created cluster and then click Continue.
Otherwise, after addressing any identified problems, select the radio button labeled I understand the risks, let me
continue with cluster creation, and then click Continue.
This completes the Add Cluster - Installation wizard and launches the Add Cluster - Configuration wizard. For further
instructions, see Step 7: Set Up a Cluster Using the Wizard in the installation guide.
Important: This section covers the procedure for creating a cluster from existing managed hosts only.
For instructions using new (currently unmanaged) hosts, see Adding a Cluster Using New Hosts on
page 97.
Before continuing, make sure that the managed hosts have the desired CDH version packages
pre-installed.
On the Cloudera Manager Home page, click the Add drop-down button at the top right, or the Clusters drop-down
button at the top left, and then click Add Cluster. This launches the Add Cluster - Installation wizard, which allows you
to create either a regular cluster or a compute cluster .
You can also launch the wizard by selecting Add Compute Cluster from the drop-down menu next to the cluster name.
Launching the wizard from there skips the Welcome page and restricts the wizard to creating only a compute cluster.
The following sections guide you through each page of the wizard:
Cluster Basics
The Cluster Basics page allows you to specify the Cluster Name and select the Cluster Type:
• Regular Cluster: A Regular Cluster contains storage nodes, compute nodes, and other services such as metadata
and security collocated in a single cluster.
• Compute Cluster: A Compute Cluster consists of only compute nodes. To connect to existing storage, metadata
or security services, you must first choose or create a Data Context on a Base Cluster.
For more information on regular and compute clusters, and data contexts, see Virtual Private Clusters and Cloudera
SDX on page 108.
If you are adding a compute cluster to an existing base cluster, click Choose Data Context... to create or select a Data
Context.
After selecting a cluster type and data context (if applicable), enter a cluster name and then click Continue.
Setup Auto-TLS
The Setup Auto-TLS page provides instructions for initializing the certificate manager for auto-TLS if you have not done
so already. If you already initialized the certificate manager in Step 3: Install Cloudera Manager Server, the wizard
displays a message indicating that auto-TLS has been initialized. Click Continue to proceed with the installation.
If you have not already initialized the certificate manager, and you want to enable auto-TLS, follow the instructions
provided on the page before continuing. When you reload the page as instructed, you are redirected to
https://<server_host>:7183, and a security warning is displayed. You might need to indicate that you trust the
certificate, or click to proceed to the Cloudera Manager Server host. You might also be required to log in again and
re-complete the previous steps in the wizard.
For more information, see Configuring TLS Encryption for Cloudera Manager and CDH Using Auto-TLS.
If you do not want to enable auto-TLS at this time, click Continue to proceed.
Specify Hosts
Note: This section covers the procedure for creating a cluster from existing managed hosts only. For
instructions using new (currently unmanaged) hosts, see Adding a Cluster Using New Hosts on page
97.
Select the hosts for your cluster by clicking the Currently Managed Hosts tab. This tab does not appear if you have no
unassigned managed hosts. You cannot select a mixture of new hosts and currently managed hosts.
If you are installing CDH and other services using packages instead of parcels, make sure that you have manually
installed the CDH packages on each host before continuing.
Select the hosts you want to add to the cluster, and then click Continue.
Select Repository
Important: You cannot install software using both parcels and packages in the same cluster.
The Select Repository page allows you to specify repositories for Cloudera Manager Agent and CDH and other software.
If you are installing on currently managed hosts, the Cloudera Manager Agent section is not displayed.
In the Cloudera Manager Agent section:
1. Select either Public Cloudera Repository or Custom Repository for the Cloudera Manager Agent software.
2. If you select Custom Repository, do not include the operating system-specific paths in the URL. For instructions
on setting up a custom repository, see Configuring a Local Package Repository.
In the CDH and other software section:
1. Select the repository type to use for the installation. In the Install Method section select one of the following:
• Use Parcels (Recommended)
A parcel is a binary distribution format containing the program files, along with additional metadata used by
Cloudera Manager. Parcels are required for rolling upgrades. For more information, see Parcels.
• Use Packages
A package is a standard binary distribution format that contains compiled code and meta-information such
as a package description, version, and dependencies. Packages are installed using your operating system
package manager.
If you select Use Packages, make sure that you have manually installed the CDH packages on each host before
continuing.
2. Select the version of CDH to install. For compute clusters using parcels, the supported CDH versions display
(Supported) next to the parcel name. For compute clusters using packages, you must make sure that you have
installed a supported CDH version on all compute cluster hosts.
If you selected Use Parcels and you do not see the version you want to install, click the More Options button to
add the repository URL for your version. Repository URLs for CDH 6 version are documented in CDH 6 Download
Information. After adding the repository, click Save Changes and wait a few seconds for the version to appear. If
your Cloudera Manager host uses an HTTP proxy, click the Proxy Settings button to configure your proxy.
Note: Cloudera Manager only displays CDH versions it can support. If an available CDH version
is too new for your Cloudera Manager version, it is not displayed.
3. If you selected Use Parcels, specify any Additional Parcels you want to install. If you are installing CDH 6, do not
select the KAFKA, KUDU, or SPARK parcels, because they are included in CDH 6.
4. Click Continue.
If you are using parcels, the Install Parcels page displays. If you chose to install using packages, the Detecting CDH
Versions page displays.
Install Parcels
If you selected parcels for the installation method, the Install Parcels page reports the installation progress of the
parcels you selected earlier. After the parcels are downloaded, progress bars appear representing each cluster host.
You can click on an individual progress bar for details about that host.
After the installation is complete, click Continue.
The Inspect Cluster page displays.
Inspect Cluster
The Inspect Cluster page provides a tool for inspecting network performance as well as the Host Inspector to search
for common configuration problems. Cloudera recommends that you run the inspectors sequentially:
1. Run the Inspect Network Performance tool. For compute clusters, you can test the network performance between
the compute cluster and its base cluster, as well as within the compute cluster itself.
You can also click Advanced Options to customize some ping parameters.
2. After the network inspector completes, click Show Inspector Results to view the results in a new tab.
3. Address any reported issues, and click Run Again (if applicable).
4. Click Inspect Hosts to run the Host Inspector utility.
5. After the host inspector completes, click Show Inspector Results to view the results in a new tab.
6. Address any reported issues, and click Run Again (if applicable).
If the reported issues cannot be resolved in a timely manner, and you want to abandon the cluster creation wizard to
address them, select the radio button labeled Quit the wizard and Cloudera Manager will delete the temporarily
created cluster and then click Continue.
Otherwise, after addressing any identified problems, select the radio button labeled I understand the risks, let me
continue with cluster creation, and then click Continue.
This completes the Add Cluster - Installation wizard and launches the Add Cluster - Configuration wizard. For further
instructions, see Step 7: Set Up a Cluster Using the Wizard in the installation guide.
Deleting a Cluster
To delete a cluster:
1. Stop the cluster.
2. On the Home > Status tab, click the drop-down arrow to the right of the cluster name and select Delete.
Starting a Cluster
1. On the Home > Status tab, click
Note: The cluster-level Start action starts only CDH and other product services (Impala, Cloudera
Search). It does not start the Cloudera Management Service. You must start the Cloudera Management
Service separately if it is not already running.
Stopping a Cluster
1. On the Home > Status tab, click
Note: The cluster-level Stop action does not stop the Cloudera Management Service. You must stop
the Cloudera Management Service separately.
Refreshing a Cluster
Runs a cluster refresh action to bring the configuration up to date without restarting all services. For example, certain
masters (for example NameNode and ResourceManager) have some configuration files (for example,
fair-scheduler.xml, mapred_hosts_allow.txt, topology.map) that can be refreshed. If anything changes in
those files then a refresh can be used to update them in the master. Here is a summary of the operations performed
in a refresh action:
Restarting a Cluster
1. On the Home > Status tab, click
Important: Pausing a cluster requires using EBS volumes for all storage, both on management and
worker nodes. Data stored on ephemeral disks will be lost after EC2 instances are stopped.
Shutdown procedure
To pause the cluster, take the following steps:
1. Navigate to the Cloudera Manager web UI.
2. Stop the cluster.
a. On the Home > Status tab, click
Startup procedure
To restart the cluster after a pause, the steps are reversed:
1. In AWS, start all cluster EC2 instances.
2. Navigate to the Cloudera Manager UI.
3. Start the Cloudera Management Service.
a. On the Home > Status tab, click
More information
For more information about stopping the Cloudera Management Service, see Stopping the Cloudera Management
Service in the Cloudera Enterprise documentation.
For more information about restarting the Cloudera Management Service, see Restarting the Cloudera Management
Service in the Cloudera Enterprise documentation.
For more information about starting and stopping a cluster in Cloudera Manager, see Starting, Stopping, Refreshing,
and Restarting a Cluster on page 104 in the Cloudera Enterprise documentation.
For more information about stopping and starting EC2 instances, see Stop and Start Your Instance in the AWS
documentation.
Renaming a Cluster
Minimum Required Role: Full Administrator
1. On the Home > Status tab, click
Cluster-Wide Configuration
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
To make configuration changes that apply to an entire cluster, do one of the following to open the configuration page:
• all clusters
1. Select Configuration and then select one of the following classes of properties:
• Advanced Configuration Snippets
• Databases
• Disk Space Thresholds
• Local Data Directories
• Local Data Files
• Log Directories
• Navigator Settings
• Non-default Values - properties whose value differs from the default value
• Non-uniform Values - properties whose values are not uniform across the cluster or clusters
• Port Configurations
• Service Dependencies
You can also select Configuration Issues to view a list of configuration issues for all clusters.
• specific cluster
1. On the Home page, click a cluster name.
2. Select Configuration and then select one of the classes of properties listed above.
You can also apply the following filters to limit the displayed properties:
• Enter a search term in the Search box to search for properties by name or description.
• Expand the Status filter to select options that limit the displayed properties to those with errors or warnings,
properties that have been edited, properties with non-default values, or properties with overrides. Select All to
remove any filtering by Status.
• Expand the Scope filter to display a list of service types. Expand a service type heading to filter on Service-Wide
configurations for a specific service instance or select one of the default role groups listed under each service
type. Select All to remove any filtering by Scope.
• Expand the Category filter to filter using a sub-grouping of properties. Select All to remove any filtering by Category.
Overview
A Virtual Private Cluster uses the Cloudera Shared Data Experience (SDX) to simplify deployment of both on-premise
and cloud-based applications and enable workloads running in different clusters to securely and flexibly share data.
This architecture provides many advantages for deploying workloads and sharing data among applications, including
a shared catalog, unified security, consistent governance, and data lifecycle management.
In traditional CDH deployments, a Regular cluster contains storage nodes, compute nodes, and other services such as
metadata services and security services that are collocated in a single cluster. This traditional architecture provides
many advantages where computational services such as Impala and YARN can access collocated data sources such as
HDFS or Hive.
With Virtual Private Clusters and the SDX framework, a new type of cluster is available in Cloudera Manager 6.2 and
higher, called a Compute cluster. A Compute cluster runs computational services such as Impala, Hive Execution Service,
Spark, or YARN but you configure these services to access data hosted in another Regular CDH cluster, called the Base
cluster. Using this architecture, you can separate compute and storage resources in a variety of ways to flexibly maximize
resources.
Architecture
A Compute cluster is configured with compute resources such as YARN, Spark, Hive Execution, or Impala. Workloads
running on these clusters access data by connecting to a Data Context for the Base cluster. A Data Context is a connector
to a Regular cluster that is designated as the Base cluster. The Data context defines the data, metadata and security
services deployed in the Base cluster that are required to access the data. Both the Compute cluster and Base cluster
are managed by the same instance of Cloudera Manager. A Base cluster must have an HDFS service deployed and can
contain any other CDH services -- but only HDFS, Hive, Sentry, Amazon S3, and Microsoft ADLS can be shared using
the data context.
A compute cluster requires an HDFS service to store temporary files used in multi-stage MapReduce jobs. In addition,
the following services may deployed as needed:
• Hive Execution Service (This service supplies the HiveServer2 role only.)
• Hue
• Impala
• Kafka
• Spark 2
• Oozie (only when Hue is available, and is a requirement for Hue)
• YARN
• HDFS (required)
The functionality of a Virtual Private cluster is a subset of the functionality available in Regular clusters, and the versions
of CDH that you can use are limited. For more information, see Compatibility Considerations for Virtual Private Clusters
on page 113.
For details on networking requirements and topology, see Networking Considerations for Virtual Private Clusters on
page 127.
4. If you already have a Data Context defined, select it from the drop-down list.
5. To create new Data Context:
1. Select Create Data Context from the drop-down list.
The Create Data Context dialog box displays.
2. Enter a unique name for the Data Context.
3. Select the Base cluster from the drop-down list.
4. Select the Data Services, Metadata Services and Security Services you want to expose in the Data Context.
You can choose the following:
• HDFS (required)
• Amazon S3 (must be configured on the base cluster)
• Microsoft ADLS (must be configured on the base cluster)
• Hive Metadata Service
• Sentry
5. Click Create.
The Cluster Basics page displays your selections.
6. Click Continue.
7. Continue with the next steps in the Add Cluster Wizard to specify hosts and credentials, and install the Agent and
CDH software.
The Select Repository screen will examine the CDH version of the case cluster and recommend a supported version.
Cloudera recommends that your Base and Compute clusters each run the same version of CDH. The Wizard will
offer the option to choose other versions, but these combinations have not been tested and are not supported
for production use.
8. On the Select Services screen, choose any of the pre-configured combinations of services listed on this page, or
you can select Custom Services and choose the services you want to install. The following services can be installed
on a Compute cluster:
• Hive Execution Service (This service supplies the HiveServer2 role only.)
• Hue
• Impala
• Kafka
• Spark 2
• Oozie (only when Hue is available, and is a requirement for Hue)
• YARN
• HDFS (required)
9. If you have enabled Kerberos authentication on the Base cluster, you must also enable Kerberos on the Compute
cluster.
Impala Improvements
Impala Data Caching
When the storage of an Impala cluster is not co-located with the Impala executor nodes (e.g. S3, remote HDFS data
node in multi-cluster configuration), reads from remote storage may result in higher latency or become network bound
for scan-heavy queries. Starting CDH 6.3.0, Impala implements a data cache backed by local storage of the executors
to allow remote data to be cached in the local storage. The fixed capacity cache transparently caches HDFS blocks of
remote reads and uses LRU eviction policy. If sized correctly, the query latency will be on-par with HDFS local reads
configuration. The cache will not persist across restart. The cache is disabled by default.
See Impala Remote Data Cache for the information and steps to enable remote data cache.
Automatic refresh/invalidation of tables on CatalogD
When there are multiple Impala compute clusters deployed that all talk to the same HMS, it is possible that data
inserted from one Impala cluster is not visible to another Impala cluster without issuing a manual refresh/invalidate
command. Starting 6.3.0 Impala will generate INSERT events when data is inserted in to tables. These INSERT events
are fetched at regular intervals by all the Impala clusters from HMS to automatically refresh the tables as needed. Note
that this behavior is applicable only when --hms_event_polling_interval_s is set to a non-zero value.
See Metadata Management
Hive Improvements
The scratch directory for temporary tables is now stored on HDFS in the compute cluster; this helps with avoiding
contention on the base cluster HDFS when multiple Hive / HiveServer 2 compute clusters are set up and temporary
tables are used during Hive queries.
Hue Improvements
Hue file browser now points at the base cluster HDFS instead of the Compute cluster’s local HDFS. This helps the Hue
end user on compute clusters see the base cluster HDFS.
Licensing Requirements
A valid Cloudera Enterprise license is required to use Virtual Private Clusters. You cannot access this feature with a
Trial or Cloudera Express license.
CDH Components
• SOLR – Not supported on Compute clusters
• Kudu – Not supported on Compute clusters.
Storage locations on HiveMetastore using Kudu are not available to Impala on Compute clusters.
• HDFS
– A “local” HDFS service is required on Compute clusters as temporary, persistent space; the intent is to use
this for Hive temporary queries and is also recommended for multi-stage Spark ETL jobs.
– Cloudera recommends a minimum storage of 1TB per host, configured as the HDFS DataNode storage directory.
– The Base cluster must have an HDFS service.
http://myco-1.prod.com:7180/cmf/clusters/1/status
• Backup and Disaster Recovery (BDR) – not supported when the source cluster is a Compute cluster and the target
cluster is running a version of Cloudera Manager lower than 6.2.
• YARN and MapReduce
– If both MapReduce (MR1, Deprecated in CM6) and YARN (MR2) are available on the Base cluster, dependent
services in the Compute cluster such as Hive Execution Service will use MR1 because of the way service
dependencies are handled in Cloudera Manager. To use YARN, you can update the configuration to make
these Compute cluster services depend on YARN (MR2) before using YARN in your applications.
• Impala
– Ingesting new data or metadata into the Base cluster affecting the Hive Metastore requires running the
INVALIDATE METADATA or REFRESH METADATA statement on each Compute cluster with Impala installed,
prior to use.
– Consistency in Impala is guaranteed by table-level locks in the Catalog Service (catalogd). With multiple
Compute clusters, accessing the same table or data from multiple clusters through multiple catalogds can
lead to problems. For example, query can fail when removing files, or incorrect data gets into metadata when
a refresh running on one cluster picks up half the data that's being ingested from another cluster.
To avoid consistency problems, your Impala clusters should operate on mutually exclusive sets of tables and
data.
• Hue
– Only one Hue service instance is supported on a Compute cluster.
– The Hue service on Compute clusters will not share user-specific query history with the Hue service on other
Compute clusters or Hue services on the Base cluster
– Hue examples may not install correctly due to differing permissions for creating tables and inserting data.
You can work around this problem by deleting the sample tables and then re-adding them.
– If you add Hue to a Compute cluster after the cluster has already been created, you will need to manually
configure any dependencies on other services (such as Hive, Hive Execution Service, and Impala) in the
Compute cluster.
• Hive Execution Service
The newly introduced “Hive execution service” is only supported on Compute clusters, and is not supported on
Base or Regular clusters.
To enable Hue to run Hive queries on a Compute cluster, you must install the Hive Execution Service on the
Compute cluster. See Add Hive Execution Service on Compute Cluster for Hue.
Security
• KMS
– Base cluster:
– Hadoop KMS is not supported.
– KeyTrustee KMS is supported on Base cluster.
– Compute cluster: Any type of KMS is not supported.
• Authentication/User directory
– Users should be identically configured on hosts on compute and Base clusters, as if the nodes are part of the
same cluster. This includes Linux local users, LDAP, Active Directory, or other third-party user directory
integrations.
• Kerberos
– If a Base cluster has Kerberos installed, then the Compute and Base clusters must both use Kerberos in the
same Kerberos realm. The Cloudera Manager Admin Console helps facilitate this configuration in the cluster
creation process to ensure compatible Kerberos configuration.
• TLS
– If the Base cluster has TLS configured for cluster services, Compute cluster services must also have TLS
configured for services that access those Base cluster services.
– Cloudera strongly recommends enabling Auto-TLS to ensure that services on Base and Compute clusters
uniformly use TLS for communication.
– If you have configured TLS, but are not using Auto-TLS, note the following:
– You must create an identical configuration on the Compute cluster hosts before you add them to a
Compute cluster using Cloudera Manager. Copy all files located in the directories specified by the following
configuration properties, from the Base clusters to the Compute cluster hosts:
– hadoop.security.group.mapping.ldap.ssl.keystore
– ssl.server.keystore.location
– ssl.client.truststore.location
– Cloudera Manager will copy the following configurations to the compute cluster when you create the
Compute cluster.
– hadoop.security.group.mapping.ldap.use.ssl
– hadoop.security.group.mapping.ldap.ssl.keystore
– hadoop.security.group.mapping.ldap.ssl.keystore.password
– hadoop.ssl.enabled
– ssl.server.keystore.location
– ssl.server.keystore.password
– ssl.server.keystore.keypassword
– ssl.client.truststore.location
– ssl.client.truststore.password
Networking
Workloads running on Compute clusters will communicate heavily with hosts on the Base cluster; Customers should
have network monitoring in place for networking hardware (such as switches including top-of-rack, spine/leaf routers
and so on) to track and adjust bandwidth between racks hosting hosts utilized in Compute and Base clusters.
You can also use the Cloudera Manager Network Performance Inspector to evaluate your network.
For more information on how to set up networking, see Networking Considerations for Virtual Private Clusters on page
127.
Altus Director
Altus Director does not support running Compute clusters and cannot be used to create Compute clusters.
Tutorial: Using Impala, Hive and Hue with Virtual Private Clusters
Workflows:
Set Up an Environment
Set up your environment with Compute and Base clusters as follows: (See Adding a Compute Cluster and Data Context
on page 111.)
1. Create clusters where the Cloudera Manager and CDH version match, for example both are 6.2.0. The clusters
must use Kerberos and TLS.
2. If Base cluster has Sentry, then make sure the user executing cross cluster queries is added to the correct role
that has all the necessary privileges to create/insert data into tables. (more in workflow #3).
3. Configure a Regular cluster called Cluster 1 to be used as a Base cluster. This cluster must have high availability
enabled.
4. Create two Compute clusters called Compute 1 and Compute 2.
5. Configure services for the three clusters as shown below:
This tutorial uses a kerberized environment with TLS, so you must kinit the user first. If you want to add a new user,
see Step 6: Get or Create a Kerberos Principal for Each User Account and Enabling Sentry Authorization for Impala for
documentation to create and add the user to the Kerberos principal and the required Linux groups.
1. Identify a host running Impala Daemon to launch impala-shell using this breadcrumb. In the Cloudera Manager
Admin Console, go to Cloudera Manager > Compute Cluster 1 > IMPALA-1 > Instances.
2. Note the hostname of a host that is running the Impala Daemon and open an ssh session to that host.
ssh <hostname>
[vpc_host-cqdbms-2.tut.myco.com:25003] default>
7. Verify that the table has been created on the Base cluster HDFS
Query: insert into table test_table values (2018, 'France'), (2014, 'Germany'), (2010,
'Spain'), (2006, 'Italy')
Query progress can be monitored at:
https://vpc_host-cqdbms-2.tut.myco.com:25000/query_plan?query_id=334fc3bd7e421cce:540f171500000000
| year | winner |
+------+---------+
| 2018 | France |
| 2014 | Germany |
| 2010 | Spain |
| 2006 | Italy |
+------+---------+
9. Log in using ssh to the host running HiveServer2 on the Compute cluster. You can find a host in the Cloudera
Manager Admin Console by going to Clusters > Compute 2 > Hive Execution Service > Instances.
10. Because this is a Kerberized environment, kinit the user:
• Auto-TLS is enabled:
/CA_STANDARD/truststore.jks;trustStorePassword=cloudera;principal=hive/<HiveServer2 Host
URL>@VPC.CLOUDERA.COM'
12. Access the tables created through Impala in the previous section:
INFO : Compiling
command(queryId=hive_20190309192547_09146fd4-58b9-4f60-ad40-c9de3f98d470); Time taken:
0.987 seconds
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing
command(queryId=hive_20190309192547_09146fd4-58b9-4f60-ad40-c9de3f98d470); Time taken:
0.041 seconds
INFO : OK
+----------------+
| database_name |
+----------------+
| default |
| test_data |
+----------------+
INFO : Compiling
command(queryId=hive_20190309192621_701914ad-0417-4639-9209-335a63818b82): select * from
test_table
command(queryId=hive_20190309192621_701914ad-0417-4639-9209-335a63818b82); Time taken:
0.38 seconds
+------------------+--------------------+
| test_table.year | test_table.winner |
+------------------+--------------------+
| 2018 | France |
| 2014 | Germany |
| 2010 | Spain |
| 2006 | Italy |
+------------------+--------------------+
INFO : Compiling
INFO : Executing
command(queryId=hive_20190309192705_218b79aa-aa94-4102-95ab-a1d4bc7a0381): insert into
test_table values (2002, 'Brazil')
WARN :
INFO : Query ID = hive_20190309192705_218b79aa-aa94-4102-95ab-a1d4bc7a0381
INFO : Total jobs = 3
INFO : Launching Job 1 out of 3
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Submitting tokens for job: job_1552095496593_0001
INFO : The url to track the job:
https://vpc_host-nnwznq-1.tut.myco.com:8090/proxy/application_1552095496593_0001/
INFO : Starting Job = job_1552095496593_0001, Tracking URL =
https://vpc_host-nnwznq-1.tut.myco.com:8090/proxy/application_1552095496593_0001/
hdfs://ns1/user/hive/warehouse/test_data.db/test_table/.hive-staging_hive_2019-03-09_19-27-05_193_3963732700280111926-1/-ext-10000
from
hdfs://ns1/user/hive/warehouse/test_data.db/test_table/.hive-staging_hive_2019-03-09_19-27-05_193_3963732700280111926-1/-ext-10002
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Loading data to table test_data.test_table from
hdfs://ns1/user/hive/warehouse/test_data.db/test_table/.hive-staging_hive_2019-03-09_19-27-05_193_3963732700280111926-1/-ext-10000
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 2.4 sec HDFS Read: 4113 HDFS Write:
88 HDFS EC Read: 0 SUCCESS
INFO : Total MapReduce CPU Time Spent: 2 seconds 400 msec
INFO : Completed executing
command(queryId=hive_20190309192705_218b79aa-aa94-4102-95ab-a1d4bc7a0381); Time taken:
31.097 seconds
INFO : OK
1 row affected (31.434 seconds)
14. Verify and track the Yarn job submitted by the Hive Execution Service using the Cloudera Manager Admin Console
by going to Clusters > Compute 2 > YARN 2 > Applications.
YARN Job:
1. Open the Cloudera Manager Admin Console and view the HDFS hierarchy on the Base cluster HDFS service by
opening the File Browser: Cluster 1 > HDFS-1 > File Browser.
All the logs pertaining to Compute clusters are under the “mc” directory.
This Base cluster has 2 Compute clusters associated with it, Compute 1 and Compute 2.
Each Compute cluster (based on its ID) gets a folder under this directory, so folder 2 belongs to Compute 1 and 3
belongs to Compute 2. The ID of the cluster can be identified from the URL used to access the cluster. Click on
Compute 1 in the CM Cluster view and inspect the URL.
http://quasar-wfrgnj-1.vpc.cloudera.com:7180/cmf/clusters/2/status
The ID is the segment following /clusters in the URL. This is also the subfolder name under the /mc folder.
This is the directory where all the logs for services in Compute 1 are stored.
2. Navigate to the file browser of a Compute cluster.
Note that folder 2 which is dedicated for the Compute 1 cluster is not visible to the Compute 2 cluster.
Navigating to folders below this hierarchy, you can see the folders created for services present on the Compute
2 cluster.
The user will also need to be created and added to the group on all the hosts of the Base cluster.
4. kinit the user:
5. Start spark-shell:
scala> tableTestData.show()
+----+-------+
|year| winner|
+----+-------+
|2002| Brazil|
|2018| France|
|2014|Germany|
|2010| Spain|
|2006| Italy|
|1998| France|
+----+-------+
7. Verify and track the queries in the Yarn service application on the Compute cluster:
Note: For the purposes of this guide, we are going to use the following terms:
• North-South (NS) traffic patterns indicate network traffic between Compute and Storage tiers.
• East-West (EW) traffic patterns indicate internal network traffic within the storage or compute
clusters.
• Factor that the total network bandwidth is shared between EW and NS traffic patterns. So for 1 Gbps of NS traffic,
we should plan for 1 Gbps of EW traffic as well.
• Network oversubscription between compute and storage tiers is 1:1
• Backend comprises of 5 nodes (VMs), each with 8 SATA drives.
– This implies that the ideal IO throughput (for planning) of the entire backend would be ~ 4GB/s, which is ~
800MB/s per Node. which is the equivalent of ~ 32 Gbps network throughput for the cluster, and ~ 7 Gbps
per node for NS traffic pattern
– Factoring in EW + NS we would need 14 Gbps per Node to handle the 800MB/s IO throughput per Node.
Assume a Compute to Storage Node ratio - 4:1. This will vary depending on in-depth sizing and a more definitive sizing
will be predicated by a good understanding of the workloads that are being intended to run on said infrastructure.
The two tables below illustrate the sizing exercise through a scenario that involves a storage cluster of 50 Nodes, and
following the 4:1 assumption, 200 compute nodes.
• 50 Storage nodes
• 200 Compute nodes (4:1 ratio)
Tier Per node IO Per node NS per node EW Num of Cluster IO Cluster NS NS
(MB/s) network (mbps) Nodes (MB/s) (Network) oversubscription
(mbps) (mbps)
HDFS 400 3200 3200 50 20000 160000 1
Compute 100 800 800 200 20000 160000 1
Node
Compute 100 800 800 200 10000 80000 2
Node
Compute 100 800 800 200 6667 53333 3
Node
Compute 100 800 800 200 5000 40000 4
Node
Tier Per node IO Per node NS per node EW Num of Nodes Hypervisor
(MB/s) network (mbps) (mbps) Oversubscription
Compute 600 4800 4800 33 6
Hypervisor
Compute 400 3200 3200 50 4
Hypervisor
One can see that the table above provides the means to ascertain the capacity of the hardware for each tier of the
private cloud, given different consolidation ratios and different throughput requirements.
Physical Network Topology
The best network topology for Hadoop clusters is spine-leaf. Each rack of hardware has its own leaf switch and each
leaf is connected to every spine switch. Ideally we would not like to have any oversubscription between spine and leaf.
That would result in having full line rate between any two nodes in the cluster.
The choice of switches, bandwidth and so on would of course be predicated on the calculations in the previous section.
If the Storage nodes and the Workload nodes happen to reside in clearly separate racks, then it is necessary to ensure
that between the workload rack switches and the Storage rack switches, there is at least as much uplink bandwidth
as the theoretical max offered by the Storage cluster. In other words, the sum of the bandwidth of all workload-only
racks needs to be equivalent to the sum of the bandwidth of all storage-only racks.
For instance, taking the example from the previous section, we should have at least 60 Gbps uplink between the Storage
cluster and the workload cluster nodes.
If we are to build the desired network topology based on the sizing exercise from above, we would need the following.
Assuming all the nodes are 2 RU in form factor, we would require 5 x 42 RU racks to house this entire set up.
If we place the nodes from each layer as evenly distributed across the 5 Racks as we can, we would end up with following
configuration.
So that implies that we would have to choose ToR (Top of Rack) switches that have at least 20 x 25 Gbps ports and 8
x 100 Gbps uplink ports. Also the Spine switches would need at least 22 x 100 Gbps ports.
Using eight 100 Gbps uplinks from the leaf switches would result in almost 1:1 (1.125:1) oversubscription ratio between
leaves (upto 20 x 25 Gbps ports) and the spine ( 4 x 100 Gbps per spine switch).
Mixing the Workload and Storage nodes in the way shown below, will help localize some traffic to each leaf and thereby
reduce the pressure of N-S traffic (between Workload and Storage clusters) patterns.
Note: For sake of clarity, the spine switches have been shown outside the racks
The following diagram illustrates the logical topology at a virtual machine level.
The Storage E-W, Compute N-S and Compute E-W components shown above are not separate networks, but are the
same network with different traffic patterns, which have been broken out in order clearly denote the different traffic
patterns.
Managing Services
The following sections cover the configuration and management of individual CDH and other services that have specific
and unique requirements or options.
Managing HDFS
The section contains configuration tasks for the HDFS service. For information on configuring HDFS for high availability,
see HDFS High Availability on page 478.
Data Durability
This page contains the following topics:
Overview
Data durability describes how resilient data is to loss. When data is stored in HDFS, CDH provides two options for data
durability. You can use replication, which HDFS was originally built on, or Erasure Coding (EC). The comparisons between
EC and replication on this page use a replication factor of 3 (three copies of data are maintained) since that is the
default.
Replication
HDFS creates two copies of data, resulting in three total instances of data. These copies are stored on separate
DataNodes to guard against data loss when a node is unreachable. When the data stored on a node is lost or
inaccessible, it is replicated from one of the other nodes to a new node so that there are always multiple copies.
The number of replications is configurable, but the default is three. Cloudera recommends keeping the replication
factor to at least three when you have three or more DataNodes. A lower replication factor leads to a situation
where the data is more vulnerable to DataNode failures since there are fewer copies of data spread out across
fewer DataNodes..
When data is written to an HDFS cluster that uses replication, additional copies of the data are automatically created.
No additional steps are required.
Replication supports all data processing engines that CDH supports.
Erasure Coding (EC)
EC is an alternative to replication. When an HDFS cluster uses EC, no additional direct copies of the data are generated.
Instead, data is striped into blocks and encoded to generate parity blocks. If there are any missing or corrupt blocks,
HDFS uses the remaining data and parity blocks to reconstruct the missing pieces in the background. This process
provides a similar level of data durability to 3x replication but at a lower storage cost.
Additionally, EC is applied when data is written. This means that to use EC, you must first create a directory and
configure it for EC. Then, you can either replicate existing data or write new data into this directory.
EC supports the following data processing engines:
• Hive
• MapReduce
• Spark
With both data durability schemes, replication and EC, recovery happens in the background and requires no direct
input from a user.
Replication or EC can be the only data durability scheme on a cluster. Alternatively, you can create a hybrid deployment
where replication and EC are both used. This decision should be based on the temperature of the data (how often the
data is accessed) stored in HDFS. Hot data, data that is accessed frequently, should use replication. Cold data, data
that is accessed less frequently, can take advantage of EC's storage savings. See Comparing Replication and Erasure
Coding on page 134 for more information.
For information about how to enable EC, see Enabling Erasure Coding on page 136.
Understanding Erasure Coding Policies
The EC policy determines how data is encoded and decoded. An EC policy is made up of the following parts:
codec-number of data blocks-number of parity blocks-cell size.
• Codec: The erasure codec that the policy uses. CDH currently supports Reed-Solomon (RS).
• Number of Data Blocks: The number of data blocks per stripe. The higher this number, the more nodes that need
to be read when accessing data because HDFS attempts to distribute the blocks evenly across DataNodes.
• Number of Parity Blocks: The number of parity blocks per stripe. Even if a file does not use up all the data blocks
available to it, the number of parity blocks will always be the total number listed in the policy.
• Cell Size: The size of one basic unit of striped data.
For example, a RS-6-3-1024k policy has the following attributes:
• Codec: Reed-Solomon
• Number of Data Blocks: 6
• Number of Parity Blocks: 3
• Cell Size: 1024k
The sum of the number of data blocks and parity blocks is the data-stripe width. When you make hardware plans for
your cluster, the number of racks should at least equal the stripe width in order for the data to be resistant to rack
failures.
The following image compares the data durability and storage efficiency of different RS codecs and replication:
Storage efficiency is the ratio of data blocks to total blocks as represented by the following formula: data blocks / (data
blocks + parity blocks)
Comparing Replication and Erasure Coding
Consider the following factors when you examine which data protection scheme to use:
Data Temperature
Data temperature refers to how often data is accessed. EC works best with cold data that is accessed and modified
infrequently. There is no data locality, and all reads are remote. Replication is more suitable for hot data, data that
is accessed and modified frequently because data locality is a part of replication.
I/O Cost
EC has higher I/O costs than replication for the following reasons:
• EC spreads data across nodes and racks, which means reading and writing data comes at a higher network
cost.
• A parity block is generated when data is written, thus impacting write speed. This can be slower than writing
to a file when the replication factor is one but is faster than writing two or more replicas.
• If data is missing or corrupt, a DataNode reads the remaining data and parity blocks in order to reconstruct
the data. This process requires CPU and network resources.
Cloudera recommends at least a 10 GB network connection if you want to use EC.
Storage Cost
EC has a lower storage overhead than replication because multiple copies of data are not maintained. Instead, a
number of parity blocks are generated based on the EC policy. For the same amount of data, EC will store fewer
blocks than 3x replication in most cases. For example with a Reed-Solomon (6,3), HDFS stores three parity blocks
for each set of 6 data blocks. With replication, HDFS stores 12 replica blocks for every six data blocks, the original
block and three replicas. The case where 3x replication requires fewer blocks is when data is stored in small files.
File Size
Erasure coding works best with larger files. The total number of blocks is determined by data blocks + parity blocks,
which is the data-stripe width discussed earlier.
128 MB is the default block size. With RS (6,3), each block group can hold up to (128 MB * 6) = 768 MB of data.
Inside each block group, there will be 9 total blocks, 6 data blocks, each holding up to 128 MB, and 3 parity blocks.
This is why EC works best with larger files. For a chunk of data less than the block size, replication writes one data
block to three DataNodes; EC, on the other hand, still needs to stripe the data to data blocks and calculate parity
blocks. This leads to a situation where erasure coded files will generate more blocks than replication because of
the parity blocks required.
The figure below shows how replication (with a replication factor of 3) compares to EC based on the number of
blocks relative to file size. For example, a 128 MB file only requires three blocks because each file fills one block,
and three total blocks are needed because of the two additional copies of data that replication maintains. As the
file size goes up though, the number of blocks required for data durability with replication goes up.
The number of blocks required for EC remains static up to the 768 MB threshold for RS (6,3).
Supported Engines
Replication supports all data processing engines that CDH supports.
EC supports the following data processing engines: Hive, MapReduce, and Spark.
Unsupported Features
The XOR codec for EC is not supported. Additionally, certain HDFS functions are not supported with EC: hflush,
hsync, concat, setReplication, truncate and append. For more information, see Erasure Coding Limitations. and
HDFS Erasure Coding Limitations.
Best Practices for Rack and Node Setup for EC
When setting up a cluster to take advantage of EC, consider the number of racks and nodes in your setup.
Ideally, the number of racks exceed the data-stripe width to account for downtime and outages. If there are fewer
racks than the data-stripe width, HDFS spreads data blocks across multiple nodes to maintain fault tolerance at the
node level. When distributing blocks to racks, HDFS attempts to distribute the blocks evenly across all racks. Because
of this behavior, Cloudera recommends setting up each rack with a similar number of DataNodes. Otherwise, racks
with fewer DataNodes may be filled up faster than racks with more DataNodes.
To achieve node-level fault tolerance, the number of nodes needs to equal the data-stripe width. For example, in order
for a RS-6-3-1024k policy to be node failure tolerant, you need at least nine nodes. Note that the write will fail if the
number of DataNodes is less than the policy's number of data blocks. The write will succeed but show a warning
message if the number of DataNodes is less than the policy's number of data blocks + parity blocks. For example, with
RS(6,3), if there are six to eight DataNodes, the write will succeed but show a warning message. If there are less than
six DataNodes, the write fails.
For rack-level fault tolerance, there must be enough number of racks such that each rack contains at most the same
number of blocks as the erasure coding parity blocks. The number of racks is calculated as (data stripe width) / (number
of parity blocks). For example, with RS(6,3), the minimum number of racks must be (6+3)/3 = 3. Cloudera recommends
at least nine racks, which leads to one data or parity block on each rack. Ideally, the number of racks exceeds the sum
of the number of data and parity blocks.
If you decide to use three racks, spread the nine nodes evenly across three racks. The data and parity blocks, when
distributed evenly, lead to the following placement on the racks:
• Rack 1 with three nodes: Three blocks
• Rack 2 with three nodes: Three blocks
• Rack 3 with three nodes: Three blocks
A policy with a wide data-stripe width like RS-6-3-1024k comes with a tradeoff though. Data must be read from six
blocks, increasing the overall network load. Choose your EC policy based on your network settings and expected storage
efficiency. Note, the larger the cluster and colder the data, the more appropriate it is to use EC policies with large
data-stripe widths. Larger data-stripe widths have the benefit of a better storage efficiency.
Enabling Erasure Coding
This page contains the following topics:
Before You Begin
Before you enable Erasure Coding (EC), perform the following tasks:
• Note the limitations for EC.
• Verify that the clusters run CDH 6.1.0 or higher.
• Determine which EC policy you want to use: Understanding Erasure Coding Policies on page 133
• Determine if you want to use EC for existing data or new data. If you want to use EC for existing data, you need
to replicate that data with distcp or BDR.
• Verify that your cluster setup meets the rack and node requirements described in Best Practices for Rack and
Node Setup for EC on page 135.
Limitations
EC does not support the following:
• XOR codecs
• Certain HDFS functions: hflush, hsync, concat, setReplication, truncate and append. For more information, see
Erasure Coding Limitations.
Using Erasure Coding for Existing Data
The procedure described in this section explains how to use EC for existing data. To customize or tune EC, see Advanced
Erasure Coding Configuration on page 138.
To use EC for existing data, complete the following procedure:
1. Create a new directory or choose an existing directory.
hdfs ec -listPolicies
4. Set the EC policy for the directory you want to use with the following command:
• path. Required. Specify the HDFS directory you want to apply the EC policy to.
• policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this
parameter, the EC policy you specified in the Default Policy when Setting Erasure Coding setting from Cloudera
Manager is used.
5. Copy the data to the directory you set an EC policy for. You can use the distcp tool or Cloudera Manager's Backup
and Disaster Recovery process.
Using Erasure Coding for New Data
The procedure described in this section explains how to use EC with new data. To customize or tune EC, see Advanced
Erasure Coding Configuration on page 138.
To use EC for new data, complete the following steps:
1. Create a new directory or choose an existing directory.
2. View the supported EC policies with the following command:
hdfs ec -listPolicies
4. Set the EC policy for the directory you want to use with the following command:
• path. Required. Specify the HDFS directory you want to apply the EC policy to.
• policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this
parameter, the EC policy you specified in the Fallback Erasure Coding Policy setting from Cloudera Manager
is used.
5. Set the destination for the data to the director you enabled EC for. No action beyond that is required. When data
is written to the directory, it will be erasure coded based on the policy you set.
Using ISA-L
Intel Intelligent Storage Acceleration Library (ISA-L) is an open-source collection of optimized low-level functions used
for storage applications. The library can improve EC performance when the Reed-Solomon (RS) codecs are used. ISA-L
is packaged and shipped with CDH. Additionally it is enabled by default. .
You can verify that it is being used by running the following command:
hadoop checknative
The command returns the libraries that the cluster uses. ISA-L should be listed as one of the enabled libraries. The
following example shows what the command might return:
Note: You must disable erasure coding in small clusters to prevent potential data loss. If the
cluster falls back to an Erasure Coding policy value that requires a greater number of
DataNodes or racks than available, this situation could result in a potential data loss. As a
result, the Erasure Coding Policy Verification Test returns "Concerning" health. To prevent
the issue, set the value of the fallback Erasure Coding policy to No Default Erasure Coding
Policy. In addition, you must disable all the erasure coding policies that are enabled on the
cluster.
NameNodes
NameNodes maintain the namespace tree for HDFS and a mapping of file blocks to DataNodes where the data is stored.
A simple HDFS cluster can have only one primary NameNode, supported by a secondary NameNode that periodically
compresses the NameNode edits log file that contains a list of HDFS metadata modifications. This reduces the amount
of disk space consumed by the log file on the NameNode, which also reduces the restart time for the primary NameNode.
A high availability cluster contains two NameNodes: active and standby.
Formatting the NameNode and Creating the /tmp Directory
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
When you add an HDFS service, the wizard automatically formats the NameNode and creates the /tmp directory on
HDFS. If you quit the wizard or it does not finish, you can format the NameNode and create the /tmp directory outside
the wizard by doing these steps:
1. Stop the HDFS service if it is running. See Starting, Stopping, and Restarting Services on page 253.
2. Click the Instances tab.
3. Click the NameNode role instance.
4. Select Actions > Format.
5. Start the HDFS service.
6. Select Actions > Create /tmp Directory.
Backing Up and Restoring HDFS Metadata
Backing Up HDFS Metadata Using Cloudera Manager
HDFS metadata backups can be used to restore a NameNode when both NameNode roles have failed. In addition,
Cloudera recommends backing up HDFS metadata before a major upgrade.
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
This backup method requires you to shut down the cluster.
1. Note the active NameNode.
2. Stop the cluster. It is particularly important that the NameNode role process is not running so that you can make
a consistent backup.
3. Go to the HDFS service.
4. Click the Configuration tab.
5. In the Search field, search for "NameNode Data Directories" and note the value.
6. On the active NameNode host, back up the directory listed in the NameNode Data Directories property. If more
than one is listed, make a backup of one directory, because each directory is a complete copy. For example, if the
NameNode data directory is /data/dfs/nn, do the following as root:
# cd /data/dfs/nn
# tar -cvf /root/nn_backup_data.tar .
/dfs/nn/current
./
./VERSION
./edits_0000000000000000001-0000000000000008777
./edits_0000000000000008778-0000000000000009337
./edits_0000000000000009338-0000000000000009897
./edits_0000000000000009898-0000000000000010463
./edits_0000000000000010464-0000000000000011023
<snip>
./edits_0000000000000063396-0000000000000063958
./edits_0000000000000063959-0000000000000064522
./edits_0000000000000064523-0000000000000065091
./edits_0000000000000065092-0000000000000065648
./edits_inprogress_0000000000000065649
./fsimage_0000000000000065091
./fsimage_0000000000000065091.md5
./fsimage_0000000000000065648
./fsimage_0000000000000065648.md5
./seen_txid
If a file with the extension lock exists in the NameNode data directory, the NameNode most likely is still running.
Repeat the steps, beginning with shutting down the NameNode role.
Restoring HDFS Metadata From a Backup Using Cloudera Manager
The following process assumes a scenario where both NameNode hosts have failed and you must restore from a
backup.
1. Remove the NameNode, JournalNode, and Failover Controller roles from the HDFS service.
2. Add the host on which the NameNode role will run.
3. Create the NameNode data directory, ensuring that the permissions, ownership, and group are set correctly.
4. Copy the backed up files to the NameNode data directory.
5. Add the NameNode role to the host.
6. Add the Secondary NameNode role to another host.
7. Enable high availability. If not all roles are started after the wizard completes, restart the HDFS service. Upon
startup, the NameNode reads the fsimage file and loads it into memory. If the JournalNodes are up and running
and there are edit files present, any edits newer than the fsimage are applied.
Moving NameNode Roles
This section describes two procedures for moving NameNode roles. Both procedures require cluster downtime. If
highly availability is enabled for the NameNode, you can use a Cloudera Manager wizard to automate the migration
process. Otherwise you must manually delete and add the NameNode role to a new host.
After moving a NameNode, if you have a Hive or Impala service, perform the steps in NameNode Post-Migration Steps
on page 142.
Moving Highly Available NameNode, Failover Controller, and JournalNode Roles Using the Migrate Roles Wizard
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
The Migrate Roles wizard allows you to move roles of a highly available HDFS service from one host to another. You
can use it to move NameNode, JournalNode, and Failover Controller roles.
• If CDH and HDFS metadata was recently upgraded, and the metadata upgrade was not finalized, finalize the
metadata upgrade.
• IP addresses
• Rack name
Select the checkboxes next to the desired host. The list of available roles to migrate displays. Clear any roles you
do not want to migrate. When migrating a NameNode, the co-located Failover Controller must be migrated as
well.
6. Click the Destination Host text field and specify the host to which the roles will be migrated. On destination hosts,
indicate whether to delete data in the NameNode data directories and JournalNode edits directory. If you choose
not to delete data and such role data exists, the Migrate Roles command will not complete successfully.
7. Acknowledge that the migration process incurs service downtime by selecting the Yes, I am ready to restart the
cluster now checkbox.
8. Click Continue. The Command Progress screen displays listing each step in the migration process.
9. When the migration completes, click Finish.
Moving a NameNode to a Different Host Using Cloudera Manager
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
1. If the host to which you want to move the NameNode is not in the cluster, follow the instructions in Adding a Host
to the Cluster on page 228 to add the host.
2. Stop all cluster services.
3. Make a backup of the dfs.name.dir directories on the existing NameNode host. Make sure you back up the
fsimage and edits files. They should be the same across all of the directories specified by the dfs.name.dir
property.
4. Copy the files you backed up from dfs.name.dir directories on the old NameNode host to the host where you
want to run the NameNode.
5. Go to the HDFS service.
6. Click the Instances tab.
7. Select the checkbox next to the NameNode role instance and then click the Delete button. Click Delete again to
confirm.
8. In the Review configuration changes page that appears, click Skip.
9. Click Add Role Instances to add a NameNode role instance.
10. Select the host where you want to run the NameNode and then click Continue.
11. Specify the location of the dfs.name.dir directories where you copied the data on the new host, and then click
Accept Changes.
12. Start cluster services. After the HDFS service has started, Cloudera Manager distributes the new configuration
files to the DataNodes, which will be configured with the IP address of the new NameNode host.
NameNode Post-Migration Steps
After moving a NameNode, if you have a Hive or Impala service, perform the following steps:
1. Go to the Hive service.
2. Stop the Hive service.
3. Select Actions > Update Hive Metastore NameNodes.
4. If you have an Impala service, restart the Impala service or run an INVALIDATE METADATA query.
Sizing NameNode Heap Memory
Each workload has a unique byte-distribution profile. Some workloads can use the default JVM settings for heap
memory and garbage collection, but others require tuning. This topic provides guidance on sizing your NameNode JVM
if the dynamic heap settings cause a bottleneck.
All Hadoop processes run on a Java Virtual Machine (JVM). The number of JVMs depend on your deployment mode:
• Local (or standalone) mode - There are no daemons and everything runs on a single JVM.
• Pseudo-distributed mode - Each daemon (such as the NameNode daemon) runs on its own JVM on a single host.
• Distributed mode - Each daemon runs on its own JVM across a cluster of hosts.
The legacy NameNode configuration is one active (and primary) NameNode for the entire namespace and one Secondary
NameNode for checkpoints (but not failover). The recommended high-availability configuration replaces the Secondary
NameNode with a Standby NameNode that prevents a single point of failure. Each NameNode uses its own JVM.
Environment Variables
HADOOP_HEAPSIZE sets the JVM heap size for all Hadoop project servers such as HDFS, YARN, and MapReduce.
HADOOP_HEAPSIZE is an integer passed to the JVM as the maximum memory (Xmx) argument. For example:
HADOOP_HEAPSIZE=1024
HADOOP_NAMENODE_OPTS is specific to the NameNode and sets all JVM flags, which must be specified.
HADOOP_NAMENODE_OPTS overrides the HADOOP_HEAPSIZE Xmx value for the NameNode. For example:
• NameNode Web UI: Scroll down to the Summary and look for "Heap Memory used."
• Command line: Generate a heap dump.
Important: The NameNode keeps the entire namespace image in memory. The Secondary NameNode,
on its own JVM, does the same when creating an image checkpoint.
On average, each file consumes 1.5 blocks of storage. That is, the average file is split into two block files—one that
consumes the entire allocated block size and a second that consumes half of that. On the NameNode, this same average
file requires three namespace objects—one file inode and two blocks.
Disk Space versus Namespace
The CDH default block size (dfs.blocksize) is set to 128 MB. Each namespace object on the NameNode consumes
approximately 150 bytes.
On DataNodes, data files are measured by disk space consumed—the actual data length—and not necessarily the full
block size. For example, a file that is 192 MB consumes 192 MB of disk space and not some integral multiple of the
block size. Using the default block size of 128 MB, a file of 192 MB is split into two block files, one 128 MB file and one
64 MB file. On the NameNode, namespace objects are measured by the number of files and blocks. The same 192 MB
file is represented by three namespace objects (1 file inode + 2 blocks) and consumes approximately 450 bytes of
memory.
Large files split into fewer blocks generally consume less memory than small files that generate many blocks. One data
file of 128 MB is represented by two namespace objects on the NameNode (1 file inode + 1 block) and consumes
approximately 300 bytes of memory. By contrast, 128 files of 1 MB each are represented by 256 namespace objects
(128 file inodes + 128 blocks) and consume approximately 38,400 bytes. The optimal split size, then, is some integral
multiple of the block size, for memory management as well as data locality optimization.
By default, Cloudera Manager allocates a maximum heap space of 1 GB for every million blocks (but never less than 1
GB). How much memory you actually need depends on your workload, especially on the number of files, directories,
and blocks generated in each namespace. If all of your files are split at the block size, you could allocate 1 GB for every
million files. But given the historical average of 1.5 blocks per file (2 block objects), a more conservative estimate is 1
GB of memory for every million blocks.
Important: Cloudera recommends 1 GB of NameNode heap space per million blocks to account for
the namespace objects, necessary bookkeeping data structures, and the remote procedure call (RPC)
workload. In practice, your heap requirements will likely be less than this conservative estimate.
Replication
The default block replication factor (dfs.replication) is three. Replication affects disk space but not memory consumption.
Replication changes the amount of storage required for each block but not the number of blocks. If one block file on
a DataNode, represented by one block on the NameNode, is replicated three times, the number of block files is tripled
but not the number of blocks that represent them.
With replication off, one file of 192 MB consumes 192 MB of disk space and approximately 450 bytes of memory. If
you have one million of these files, or 192 TB of data, you need 192 TB of disk space and, without considering the RPC
workload, 450 MB of memory: (1 million inodes + 2 million blocks) * 150 bytes. With default replication on, you need
576 TB of disk space: (192 TB * 3) but the memory usage stay the same, 450 MB. When you account for bookkeeping
and RPCs, and follow the recommendation of 1 GB of heap memory for every million blocks, a much safer estimate
for this scenario is 2 GB of memory (with or without replication).
Examples
Example 1: Estimating NameNode Heap Memory Used
Alice, Bob, and Carl each have 1 GB (1024 MB) of data on disk, but sliced into differently sized files. Alice and Bob have
files that are some integral of the block size and require the least memory. Carl does not and fills the heap with
unnecessary namespace objects.
Alice: 1 x 1024 MB file
• 1 file inode
• 8 blocks (1024 MB / 128 MB)
Total = 9 objects * 150 bytes = 1,350 bytes of heap memory
Bob: 8 x 128 MB files
• 8 file inodes
• 8 blocks
Total = 16 objects * 150 bytes = 2,400 bytes of heap memory
Carl: 1,024 x 1 MB files
• 1,024 file inodes
• 1,024 blocks
Total = 2,048 objects * 150 bytes = 307,200 bytes of heap memory
Example 2: Estimating NameNode Heap Memory Needed
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters
physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how
many namespace blocks represent these block files.
Cluster A: 200 hosts of 24 TB each = 4800 TB.
• Blocksize=128 MB, Replication=1
• Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
• Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
• Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum
heap space.
Cluster B: 200 hosts of 24 TB each = 4800 TB.
• Blocksize=128 MB, Replication=3
• Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
• Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
• Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum
heap space.
Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and
represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.
Backing Up and Restoring NameNode Metadata
This topic describes the steps for backing up and restoring NameNode metadata.
1. Make a single backup of the VERSION file. This does not need to be backed up regularly as it does not change, but
it is important since it contains the clusterID, along with other details.
2. Use the following command to back up the NameNode metadata. It automatically determines the active NameNode,
retrieves the current fsimage, and places it in the defined backup_dir.
On startup, the NameNode process reads the fsimage file and commits it to memory. If the JournalNodes are up and
running, and there are edit files present, any edits newer than the fsimage are also applied. If the JournalNodes are
unavailable, it is possible to lose any data transferred in the interim.
DataNodes
DataNodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated
on multiple DataNodes for reliability and so that localized computation can be executed near the data. Within a cluster,
DataNodes should be uniform. If they are not uniform, issues can occur. For example, DataNodes with less memory
fill up more quickly than DataNodes with more memory, which can result in job failures.
Important: The default replication factor for HDFS is three. That is, three copies of data are maintained
at all times. Cloudera recommends that you do not configure a lower replication factor when you
have at least three DataNodes. A lower replication factor may lead to data loss.
Important: There must be at least as many DataNodes running as the replication factor or the
decommissioning process will not complete.
5. Once decommissioning is completed, stop the DataNode role. When asked to select the role instance to stop,
select the DataNode role instance.
6. Verify that the integrity of the HDFS service:
a. Run the following command to identify any problems in the HDFS file system:
hdfs fsck /
b. Fix any errors reported by the fsck command. If required, create a Cloudera support case.
7. After all errors are resolved:
a. Remove the DataNode role.
b. Manually remove the DataNode data directories. You can determine the location of these directories by
examining the DataNode Data Directory property in the HDFS configuration. In Cloudera Manager, go to the
HDFS service, select the Configuration tab and search for the property.
Important: You must exercise caution when deleting files to ensure that you do not lose any important
data. Cloudera recommends that you contact support for further assistance for any guidance on
deleting files.
Underreplicated blocks: HDFS automatically attempts to fix this issue by replicating the underreplicated blocks to
other DataNodes and match the replication factor. If the automatic replication does not work, you can run the HDFS
Balancer to address the issue.
Misreplicated blocks: Run the hdfs fsck -replicate command to trigger the replication of misreplicated blocks.
This ensures that the blocks are correctly replicated across racks in the cluster.
Configuring Storage Directories for DataNodes
Adding and Removing Storage Directories Using Cloudera Manager
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Restart the DataNode.
Important: You must restart the DataNodes for heterogeneous storage configuration changes
to take effect.
By default a DataNode writes new block replicas to disk volumes solely on a round-robin basis. You can configure a
volume-choosing policy that causes the DataNode to take into account how much space is available on each volume
when deciding where to place a new replica.
You can configure
• how much DataNode volumes are allowed to differ in terms of bytes of free disk space before they are considered
imbalanced, and
• what percentage of new block allocations will be sent to volumes with more available disk space than others.
Configuring Storage Balancing for DataNodes Using Cloudera Manager
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
1. Go to the HDFS service.
2. Click the Configuration tab.
3. Select Scope > DataNode.
4. Select Category > Advanced.
5. Configure the following properties (you can use the Search box to locate the properties):
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
7. Restart the role.
Performing Disk Hot Swap for DataNodes
This section describes how to replace HDFS disks without shutting down a DataNode. This is referred to as hot swap.
Warning: Change the value of this property only for the specific DataNode instance where
you are planning to hot swap the disk. Do not edit the role group value for this property.
Doing so will cause data loss.
2. Enter a Reason for change, and then click Save Changes to commit the changes.
3. Refresh the affected DataNode. Select Actions > Refresh DataNode configuration.
4. Remove the old disk and add the replacement disk.
5. Change the value of the DataNode Data Directory property to add back the directories that are mount points for
the disk you added.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
7. Refresh the affected DataNode. Select Actions > Refresh DataNode configuration.
8. Run the hdfs fsck / command to validate the health of HDFS.
JournalNodes
High-availability clusters use JournalNodes to synchronize active and standby NameNodes. The active NameNode
writes to each JournalNode with changes, or "edits," to HDFS namespace metadata. During failover, the standby
NameNode applies all edits from the JournalNodes before promoting itself to the active state.
Moving the JournalNode Edits Directory
Moving the JournalNode Edits Directory for an Role Instance Using Cloudera Manager
To change the location of the edits directory for one JournalNode instance:
1. Reconfigure the JournalNode Edits Directory.
a. Go to the HDFS service in Cloudera Manager.
b. Click JournalNode under Status Summary.
c. Click the JournalNode link for the instance you are changing.
d. Click the Configuration tab.
e. Set dfs.journalnode.edits.dir to the path of the new jn directory.
f. Click Save Changes.
2. Move the location of the JournalNode (jn) directory at the command line:
a. Connect to host of the JournalNode.
b. Copy the JournalNode (jn) directory to its new location with the -a option to preserve permissions:
cp -a /<old_path_to_jn_dir>/jn /<new_path_to_jn_dir>/jn
mv /<old_path_to_jn_dir>/jn /<old_path_to_jn_dir>/jn_to_delete
cp -a /<old_path_to_jn_dir>/jn /<new_path_to_jn_dir>/jn
mv /<old_path_to_jn_dir>/jn /<old_path_to_jn_dir>/jn_to_delete
Important:
• The trash feature is enabled by default. Cloudera recommends that you enable it on all production
clusters.
• The trash feature works by default only for files and directories deleted using the Hadoop shell.
Files or directories deleted programmatically using other interfaces (WebHDFS or the Java APIs,
for example) are not moved to trash, even if trash is enabled, unless the program has implemented
a call to the trash functionality.
Users can bypass trash when deleting files using the shell by specifying the -skipTrash option
to the hadoop fs -rm -r command. This can be useful when it is necessary to delete files that
are too large for the user's quota.
when you create an encryption zone, /enc_zone, HDFS will also create the /enc_zone/.Trash/ sub-directory. Files
deleted from enc_zone are moved to /enc_zone/.Trash/<username>/Current/. After the checkpoint, the
Current directory is renamed to the current timestamp, /enc_zone/.Trash/<username>/<timestamp>.
If you delete the entire encryption zone, it will be moved to the .Trash directory under the user's home directory,
/users/<username>/.Trash/Current/enc_zone. However, if the user's home directory is already part of an
encryption zone, then attempting to delete an encryption zone will fail because you cannot move or rename directories
across encryption zones.
If you have upgraded your cluster to CDH 5.7.1 (or higher), and you have an encryption zone that was created before
the upgrade, create the .Trash directory using the -provisionTrash option as follows:
In CDH 5.7.0, HDFS does not automatically create the .Trash directory when an encryption zone is created. However,
you can use the following commands to manually create the .Trash directory within an encryption zone. Make sure
you run the commands as an admin user.
Note: The trash interval is measured from the point at which the files are moved to trash, not
from the last time the files were modified.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Restart all NameNodes.
4. Enter a Reason for Change, and then click Save Changes to save the property changes.
5. Restart the HDFS service.
HDFS Balancers
HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new
DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data
across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the
utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization
of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold
percentage. The balancer does not balance between individual volumes on a single DataNode.
Configuring and Running the HDFS Balancer Using Cloudera Manager
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
In Cloudera Manager, the HDFS balancer utility is implemented by the Balancer role. The Balancer role usually shows
a health of None on the HDFS Instances tab because it does not run continuously.
The Balancer role is normally added (by default) when the HDFS service is installed. If it has not been added, you must
add a Balancer role to rebalance HDFS and to see the Rebalance action.
process to complete more quickly, decreasing the value allows rebalancing to complete more slowly, but is less likely
to compete for resources with other tasks on the DataNode. To use this property, you need to set the value on both
the DataNode and the Balancer.
• To configure the DataNode:
– Go to the HDFS service.
– Click the Configuration tab.
– Search for DataNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
– Add the following code to the configuration field, for example, setting the value to 50.
<property>
<name>dfs.datanode.balance.max.concurrent.moves</name>
<value>50</value>
</property>
<property>
<name>dfs.datanode.balance.max.concurrent.moves</name>
<value>50</value>
</property>
Property Values for Running the Balancer in the Value for Running the Balancer at Maximum
Background Speed
DataNode
dfs.datanode.balance.bandwidthPerSec 10 MB 10 GB
Balancer
dfs.balancer.moverThreads 1000 20000
dfs.balancer.max-size-to-move 10 GB 100 GB
dfs.balancer.getBlocks.min-block-size 10 MB 100 MB
Enabling WebHDFS
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
To enable WebHDFS, proceed as follows:
1. Select the HDFS service.
2. Click the Configuration tab.
3. Select Scope > HDFS-1 (Service Wide)
4. Select the Enable WebHDFS property.
5. Click the Save Changes button.
6. Restart the HDFS service.
WebHDFS uses the following prefix and URI format: webhdfs://<HOST>:<HTTP_PORT>/<PATH>
Secure WebHDFS uses the following prefix and URI format: swebhdfs://<HOST>:<HTTP_PORT>/<PATH>
You can find a full explanation of the WebHDFS API in the WebHDFS API documentation.
Adding HttpFS
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
Apache Hadoop HttpFS is a service that provides HTTP access to HDFS.
HttpFS has a REST HTTP API supporting all HDFS filesystem operations (both read and write).
Common HttpFS use cases are:
• Read and write data in HDFS using HTTP utilities (such as curl or wget) and HTTP libraries from languages other
than Java (such as Perl).
• Transfer data between HDFS clusters running different versions of Hadoop (overcoming RPC versioning issues),
for example using Hadoop DistCp.
• Accessing WebHDFS using the Namenode WebUI port (default port 50070). Access to all data hosts in the cluster
is required, because WebHDFS redirects clients to the datanode port (default 50075). If the cluster is behind a
firewall, and you use WebHDFS to read and write data to HDFS, then Cloudera recommends you use the HttpFS
server. The HttpFS server acts as a gateway. It is the only system that is allowed to send and receive data through
the firewall.
HttpFS supports Hadoop pseudo-authentication, HTTP SPNEGO Kerberos, and additional authentication mechanisms
using a plugin API. HttpFS also supports Hadoop proxy user functionality.
The webhdfs client file system implementation can access HttpFS using the Hadoop filesystem command (hadoop
fs), by using Hadoop DistCp, and from Java applications using the Hadoop file system Java API.
The HttpFS HTTP REST API is interoperable with the WebHDFS REST HTTP API.
For more information about HttpFS, see Hadoop HDFS over HTTP.
The HttpFS role is required for Hue when you enable HDFS high availability.
3. Enter the hostname and port for the load balancer in the following format:
<hostname>:<port>
Note:
When you set this property, Cloudera Manager regenerates the keytabs for HttpFS roles. The principal
in these keytabs contains the load balancer hostname.
If there is a Hue service that depends on this HDFS service, the Hue service has the option to use the
load balancer as its HDFS Web Interface Role.
Important:
HDFS does not currently provide ACL support for an NFS gateway.
Before you start: You must have a working HDFS cluster and know the hostname and port that your NameNode exposes.
If you use parcels to install CDH, you do not need to install the FUSE packages.
To install hadoop-hdfs-fuses On Red Hat-compatible systems:
You now have everything you need to begin mounting HDFS on Linux.
To set up and test your mount point in a non-HA installation:
mkdir -p <mount_point>
hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point>
mkdir -p <mount_point>
hadoop-fuse-dfs dfs://<nameservice_id> <mount_point>
where nameservice_id is the value of fs.defaultFS. In this case the port defined for
dfs.namenode.rpc-address.[nameservice ID].[name node ID] is used automatically. See Enabling HDFS
HA on page 481 for more information about these properties.
You can now run operations as if they are on your mount point. Press Ctrl+C to end the fuse-dfs program, and
umount the partition if it is still mounted.
Note:
To find its configuration directory, hadoop-fuse-dfs uses the HADOOP_CONF_DIR configured at
the time the mount command is invoked.
umount <mount_point>
You can now add a permanent HDFS mount which persists through reboots.
To add a system mount:
1. Open /etc/fstab and add lines to the bottom similar to these:
For example:
Note:
In an HA deployment, use the HDFS nameservice instead of the NameNode URI; that is, use the
value of dfs.nameservices in hdfs-site.xml.
mount <mount_point>
Your system is now configured to allow you to use the ls command and use that mount point as if it were a normal
system disk.
For more information, see the help for hadoop-fuse-dfs:
hadoop-fuse-dfs --help
• By default, the CDH package installation creates the /etc/default/hadoop-fuse file with a maximum heap
size of 128 MB. You might need to change the JVM minimum and maximum heap size for better performance.
For example:
Be careful not to set the minimum to a higher value than the maximum.
In this architecture, the NameNode is responsible for coordinating all the DataNode off-heap caches in the cluster. The
NameNode periodically receives a "cache report" from each DataNode which describes all the blocks cached on a given
DataNode. The NameNode manages DataNode caches by piggybacking cache and uncache commands on the DataNode
heartbeat.
The NameNode queries its set of cache directives to determine which paths should be cached. Cache directives are
persistently stored in the fsimage and edit log, and can be added, removed, and modified using Java and command-line
APIs. The NameNode also stores a set of cache pools, which are administrative entities used to group cache directives
together for resource management and enforcing permissions.
The NameNode periodically rescans the namespace and active cache directories to determine which blocks need to
be cached or uncached and assigns caching to DataNodes. Rescans can also be triggered by user actions such as adding
or removing a cache directive or removing a cache pool.
Currently, blocks that are under construction, corrupt, or otherwise incomplete are not cached. If a cache directive
covers a symlink, the symlink target is not cached. Caching is currently done on a per-file basis (and not at the block-level).
Concepts
Cache Directive
A cache directive defines a path that should be cached. Paths can be either directories or files. Directories are cached
non-recursively, meaning only files in the first-level listing of the directory are cached.
Directives have parameters, such as the cache replication factor and expiration time. Replication factor specifies the
number of block replicas to cache. If multiple cache directives refer to the same file, the maximum cache replication
factor is applied. Expiration time is specified on the command line as a time-to-live (TTL), a relative expiration time
in the future. After a cache directive expires, it is no longer considered by the NameNode when making caching decisions.
Cache Pool
A cache pool is an administrative entity used to manage groups of cache directives. Cache pools have UNIX-like
permissions that restrict which users and groups have access to the pool. Write permissions allow users to add and
remove cache directives to the pool. Read permissions allow users to list the cache directives in a pool, as well as
additional metadata. Execute permissions are not used.
Cache pools are also used for resource management. Pools can enforce a maximum limit that restricts the aggregate
number of bytes that can be cached by directives in the pool. Normally, the sum of the pool limits roughly equals the
amount of aggregate memory reserved for HDFS caching on the cluster. Cache pools also track a number of statistics
to help cluster users determine what is and should be cached.
Pools also enforce a maximum time-to-live. This restricts the maximum expiration time of directives being added to
the pool.
cacheadmin Command-Line Interface
On the command-line, administrators and users can interact with cache pools and directives using the hdfs cacheadmin
subcommand. Cache directives are identified by a unique, non-repeating 64-bit integer ID. IDs are not reused even if
a cache directive is later removed. Cache pools are identified by a unique string name.
Cache Directive Commands
addDirective
Description: Add a new cache directive.
Usage: hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication
<replication>] [-ttl <time-to-live>]
time-to-live: Time period for which the directive is valid. Can be specified in seconds, minutes, hours, and days,
for example: 30m, 4h, 2d. The value never indicates a directive that never expires. If unspecified, the directive never
expires.
removeDirective
Description: Remove a cache directive.
Usage: hdfs cacheadmin -removeDirective <id>
Where, id: The id of the cache directive to remove. You must have write permission on the pool of the directive to
remove it. To see a list of PathBasedCache directive IDs, use the -listDirectives command.
removeDirectives
Description: Remove every cache directive with the specified path.
Usage: hdfs cacheadmin -removeDirectives <path>
Where, path: The path of the cache directives to remove. You must have write permission on the pool of the directive
to remove it.
listDirectives
Description: List PathBasedCache directives.
Usage: hdfs cacheadmin -listDirectives [-stats] [-path <path>] [-pool <pool>]
Where, path: List only PathBasedCache directives with this path. Note that if there is a PathBasedCache directive for
path in a cache pool that we do not have read access for, it will not be listed.
addPool
Description: Add a new cache pool.
Usage: hdfs cacheadmin -addPool <name> [-owner <owner>] [-group <group>] [-mode <mode>]
[-limit <limit>] [-maxTtl <maxTtl>]
group: Group of the pool. Defaults to the primary group name of the current user.
mode: UNIX-style permissions for the pool. Permissions are specified in octal, for example: 0755. By default, this is set
to 0755.
limit: The maximum number of bytes that can be cached by directives in this pool, in aggregate. By default, no limit
is set.
maxTtl: The maximum allowed time-to-live for directives being added to the pool. This can be specified in seconds,
minutes, hours, and days, for example: 120s, 30m, 4h, 2d. By default, no maximum is set. A value of never specifies
that there is no limit.
modifyPool
Description: Modify the metadata of an existing cache pool.
Usage: hdfs cacheadmin -modifyPool <name> [-owner <owner>] [-group <group>] [-mode <mode>]
[-limit <limit>] [-maxTtl <maxTtl>]
maxTtl: The maximum allowed time-to-live for directives being added to the pool.
removePool
Description: Remove a cache pool. This also uncaches paths associated with the pool.
Usage: hdfs cacheadmin -removePool <name>
Where, name: Name of the cache pool to remove.
listPools
Description: Display information about one or more cache pools, for example: name, owner, group, permissions, and
so on.
Usage: hdfs cacheadmin -listPools [-stats] [<name>]
Where, name: If specified, list only the named cache pool.
stats: Display additional cache pool statistics.
help
Description: Get detailed help about a command.
Usage: hdfs cacheadmin -help <command-name>
Where, command-name: The command for which to get detailed help. If no command is specified, print detailed help
for all commands.
Configuration
Native Libraries
To lock block files into memory, the DataNode relies on native JNI code found in libhadoop.so. Be sure to enable
JNI if you are using HDFS centralized cache management.
Configuration Properties
Required
Be sure to configure the following in /etc/default/hadoop/conf/hdfs-default.xml:
• dfs.datanode.max.locked.memory: The maximum amount of memory a DataNode uses for caching (in bytes).
The "locked-in-memory size" ulimit (ulimit -l) of the DataNode user also needs to be increased to match this
parameter (see OS Limits). When setting this value, remember that you need space in memory for other things
as well, such as the DataNode and application JVM heaps and the operating system page cache.
Optional
The following properties are not required, but may be specified for tuning:
• dfs.namenode.path.based.cache.refresh.interval.ms: The NameNode uses this as the amount of
milliseconds between subsequent path cache rescans. This calculates the blocks to cache and each DataNode
containing a replica of the block that should cache it. By default, this parameter is set to 30000, which is 30 seconds.
• dfs.datanode.fsdatasetcache.max.threads.per.volume: The DataNode uses this as the maximum
number of threads per volume to use for caching new data. By default, this parameter is set to 4.
• dfs.cachereport.intervalMsec: The DataNode uses this as the amount of milliseconds between sending a
full report of its cache state to the NameNode. By default, this parameter is set to 10000, which is 10 seconds.
• dfs.namenode.path.based.cache.block.map.allocation.percent: The percentage of the Java heap
which we will allocate to the cached blocks map. The cached blocks map is a hash map which uses chained hashing.
Smaller maps may be accessed more slowly if the number of cached blocks is large; larger maps will consume
more memory. By default, this parameter is set to 0.25 percent.
OS Limits
If you get the error, Cannot start datanode because the configured max locked memory size... is
more than the datanode's available RLIMIT_MEMLOCK ulimit, the operating system is imposing a lower
limit on the amount of memory that you can lock than what you have configured. To fix this, adjust the DataNode
ulimit -l value. Usually, this value is configured in /etc/security/limits.conf; but varies depending on your
operating system and distribution.
You have correctly configured this value when you can run ulimit -l from the shell and get back either a higher
value than what you have configured with dfs.datanode.max.locked.memory, or the string unlimited, indicating
that there is no limit. It is typical for ulimit -l to output the memory lock limit in KB, but
dfs.datanode.max.locked.memory must be specified in bytes.
<property>
<name>hadoop.proxyuser.alice.groups</name>
<value>group_a,group_b</value>
</property>
To limit the hosts from which impersonated connections are allowed, use hadoop.proxyuser.<proxy_user>.hosts.
For example, to allow user alice impersonated connections only from host_a and host_b:
<property>
<name>hadoop.proxyuser.alice.hosts</name>
<value>host_a,host_b</value>
</property>
If the configuration properties described are not present, impersonation is not allowed and connections will fail.
For looser restrictions, use a wildcard (*) to allow impersonation from any host and of any user. For example, to allow
user bob to impersonate any user belonging to any group, and from any host, set the properties as follows:
<property>
<name>hadoop.proxyuser.bob.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.bob.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.super.hosts</name>
<value>10.222.0.0/16,10.113.221.221</value>
</property>
<property>
<name>hadoop.proxyuser.super.users</name>
<value>user1,user2</value>
</property>
Warning: Dell EMC Isilon is supported only on CDH 6.3.1 and higher.
Dell EMC Isilon is a storage service with a distributed filesystem that can used in place of HDFS to provide storage for
CDH services.
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Note: This documentation covers only the Cloudera Manager portion of using EMC Isilon storage
with CDH. For information about tasks performed on Isilon OneFS, see the Dell EMC Community's
Isilon Info Hub.
Supported Versions
For Cloudera and Isilon compatibility information, see the product compatibility matrix for Product Compatibility for
Dell EMC Isilon.
Differences Between Isilon HDFS and CDH HDFS
The following features of HDFS are not implemented with Isilon OneFS:
• HDFS caching
• HDFS encryption
• HDFS ACLs
Installing Cloudera Manager and CDH with Isilon
For instructions on configuring Isilon and installing Cloudera Manager and CDH with Isilon, see the following EMC
documentation:
• EMC Isilon OneFS with Hadoop and Cloudera for Kerberos Installation Guide (PDF)
• EMC Isilon OneFS with Cloudera Hadoop Installation Guide (PDF)
The typical use case for Impala and Isilon together is to use Isilon for the default filesystem, replacing HDFS entirely.
In this configuration, when you create a database, table, or partition, the data always resides on Isilon storage and you
do not need to specify any special LOCATION attribute. If you do specify a LOCATION attribute, its value refers to a
path within the Isilon filesystem. For example:
Impala can write to, delete, and rename data files and database, table, and partition directories on Isilon storage.
Therefore, Impala statements such as CREATE TABLE, DROP TABLE, CREATE DATABASE, DROP DATABASE, ALTER
TABLE, and INSERT work the same with Isilon storage as with HDFS.
When the Impala spill-to-disk feature is activated by a query that approaches the memory limit, Impala writes all the
temporary data to a local (not Isilon) storage device. Because the I/O bandwidth for the temporary data depends on
the number of local disks, and clusters using Isilon storage might not have as many local disks attached, pay special
attention on Isilon-enabled clusters to any queries that use the spill-to-disk feature. Where practical, tune the queries
or allocate extra memory for Impala to avoid spilling. Although you can specify an Isilon storage device as the destination
for the temporary data for the spill-to-disk feature, that configuration is not recommended due to the need to transfer
the data both ways using remote I/O.
When tuning Impala queries on HDFS, you typically try to avoid any remote reads. When the data resides on Isilon
storage, all the I/O consists of remote reads. Do not be alarmed when you see non-zero numbers for remote read
measurements in query profile output. The benefit of the Impala and Isilon integration is primarily convenience of not
having to move or copy large volumes of data to HDFS, rather than raw query performance. You can increase the
performance of Impala I/O for Isilon systems by increasing the value for the --num_remote_hdfs_io_threads
configuration parameter, in the Cloudera Manager user interface for clusters using Cloudera Manager, or through the
--num_remote_hdfs_io_threads startup option for the impalad daemon on clusters not using Cloudera Manager.
For information about managing Isilon storage devices through Cloudera Manager, see .
Required Configurations
Specify the following configurations in Cloudera Manager on the Clusters > Isilon Service > Configuration tab:
• In HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml hdfs-site.xml and the
Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties for the Isilon service,
set the value of the dfs.client.file-block-storage-locations.timeout.millis property to 10000.
• In the Isilon Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property for the Isilon
service, set the value of the hadoop.security.token.service.use_ip property to FALSE.
• If you see errors that reference the .Trash directory, make sure that the Use Trash property is selected.
Configuring Replication with Kerberos and Isilon
If you plan to use replication between clusters that use Isilon storage and that also have enabled Kerberos, do the
following:
1. Create a custom Kerberos Keytab and Kerberos principal that the replication jobs use to authenticate to storage
and other CDH services. See Authentication.
2. In Cloudera Manager, select Administration > Settings.
3. Search for and enter values for the following properties:
• Custom Kerberos Keytab Location – Enter the location of the Custom Kerberos Keytab.
• Custom Kerberos Principal Name – Enter the principal name to use for replication between secure clusters.
4. When you create a replication schedule, enter the Custom Kerberos Principal Name in the Run As Username
field. See Configuring Replication of HDFS Data on page 548 and Configuring Replication of Hive/Impala Data on
page 562.
5. Ensure that both the source and destination clusters have the same set of users and groups. When you set
ownership of files (or when maintaining ownership), if a user or group does not exist, the chown command fails
on Isilon. See Performance and Scalability Limitations on page 545
6. Cloudera recommends that you do not select the Replicate Impala Metadata option for Hive/Impala replication
schedules. If you need to use this feature, create a custom principal of the form hdfs/hostname@realm or
impala/hostname@realm.
7. Add the following property and value to the HDFS Service Advanced Configuration Snippet (Safety Valve) for
hdfs-site.xml and Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties:
hadoop.security.token.service.use_ip = false
If the replication MapReduce job fails with the an error similar to the following:
Set the Isilon cluster-wide time-to-live setting to a higher value on the destination cluster for the replication: Note that
higher values may affect load balancing in the Isilon cluster by causing workloads to be less distributed. A value of 60
is a good starting point. For example:
You can view the settings for a subnet with a command similar to the following:
Overview
Each DataNode in a cluster is configured with a set of data directories. You can configure each data directory with a
storage type. The storage policy dictates which storage types to use when storing the file or directory.
Some reasons to consider using different types of storage are:
• You have datasets with temporal locality (for example, time-series data). The latest data can be loaded initially
into SSD for improved performance, then migrated out to disk as it ages.
• You need to move cold data to denser archival storage because the data will rarely be accessed and archival
storage is much cheaper. This could be done with simple age-out policies: for example, moving data older than
six months to archival storage.
Storage Types
The storage type identifies the underlying storage media. HDFS supports the following storage types:
• ARCHIVE - Archival storage is for very dense storage and is useful for rarely accessed data. This storage type is
typically cheaper per TB than normal hard disks.
• DISK - Hard disk drives are relatively inexpensive and provide sequential I/O performance. This is the default
storage type.
• SSD - Solid state drives are useful for storing hot data and I/O-intensive applications.
• RAM_DISK - This special in-memory storage type is used to accelerate low-durability, single-replica writes.
When you add the DataNode Data Directory, you can specify which type of storage it uses, by prefixing the path with
the storage type, in brackets. If you do not specify a storage type, it is assumed to be DISK. See Adding Storage
Directories on page 147.
Storage Policies
A storage policy contains information that describes the type of storage to use. This policy also defines the fallback
storage type if the primary type is out of space or out of quota. If a target storage type is not available, HDFS attempts
to place replicas on the default storage type.
Each storage policy consists of a policy ID, a policy name, a list of storage types, a list of fallback storage types for file
creation, and a list of fallback storage types for replication.
HDFS has six preconfigured storage policies.
• Hot - All replicas are stored on DISK.
• Cold - All replicas are stored ARCHIVE.
• Warm - One replica is stored on DISK and the others are stored on ARCHIVE.
• All_SSD - All replicas are stored on SSD.
• One_SSD - One replica is stored on SSD and the others are stored on DISK.
• Lazy_Persist - The replica is written to RAM_DISK and then lazily persisted to DISK.
Note: You cannot create your own storage policy. You must use one of the six pre-configured policies.
HDFS clients such as HBase may support different storage policies.
7. Start HBase.
Setting a Storage Policy for HDFS
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
To set a storage policy on a DataNode Data Directory using Cloudera Manager, perform the following tasks:
1. Check the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml to be sure that
dfs.storage.policy.enabled has not been changed from its default value of true.
2. Specify the storage types for each DataNode Data Directory that is not a standard disk, by adding the storage type
in brackets at the beginning of the directory path. For example:
[SSD]/dfs/dn1
[DISK]/dfs/dn2
[ARCHIVE]/dfs/dn3
3. Open a terminal session on any HDFS host. Run the following hdfs command for each path on which you want to
set a storage policy:
4. To move the data to the appropriate storage based on the current storage policy, use the mover utility, from any
HDFS host. Use mover -h to get a list of available options. To migrate all data at once (this may take a long time),
you can set the path to /.
Note: Quotas are enforced at the time you set the storage policy or when writing the file, not
when quotas are changed. The Mover tool does not recognize quota violations. It only verifies
that a file is stored on the storage types specified in its policy. For more information about quotas,
see Setting HDFS Quotas on page 157.
• To reset a storage policy, follow the steps used in Setting a Storage Policy for HDFS on page 169.
Migrating Existing Data
To move the data to the appropriate storage based on the current storage policy, use the mover utility, from any HDFS
host. Use mover -h to get a list of available options. To migrate all data at once (this may take a long time), you can
set the path to /.
Note: Quotas are enforced at the time you set the storage policy or when writing the file, not when
quotas are changed. The Mover tool does not recognize quota violations. It only verifies that a file is
stored on the storage types specified in its policy. For more information about quotas, see Setting
HDFS Quotas on page 157.
Managing Hue
Hue is a set of web UIs that enable you to interact with a CDH cluster. This section describes tasks for managing Hue.
6. Select Use Custom Databases for production clusters and input values for database hostname, type, name,
username, and password.
7. Click Test Connection, and when green, click Continue. Cloudera Manager starts the Hue service.
8. Click Continue and Finish.
9. If your cluster uses Kerberos, Cloudera Manager automatically adds a Hue Kerberos Ticket Renewer role to each
host where you assigned the Hue Server role instance. See Enable Hue to Use Kerberos for Authentication.
Adding a Hue Role Instance
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
Roles are functions that comprise a service and role instances must be assigned to one or more hosts. You can easily
assign roles to hosts in Cloudera Manager.
1. Go to the Hue service.
2. Click the Instances tab.
3. Click the Add Role Instances button.
4. Click Continue to accept the default role assignments; or click the gray field below each role to open the hosts
dialog, customize assignments, and click OK to save.
If a drop down menu displays (indicating that all hosts apply), select All Hosts, or else click Custom to display the
hosts dialog. Click OK to accept custom assignments.
The wizard evaluates host hardware configurations to determine the best hosts for each role. All worker roles are
automatically assigned to the same set of hosts as the HDFS DataNode. You can reassign if necessary. Specify
hostnames by IP address, rack name, or by range:
5. If your cluster uses Kerberos, you must manually add the Hue Kerberos Ticket Renewer role to each host where
you assigned the Hue Server role instance. Cloudera Manager throws a validation error if the new Hue Server role
does not have a colocated KT Renewer role. See Enable Hue to Use Kerberos for Authentication.
6. Click Continue.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
7. Restart the Hue service.
Note: Enabling impersonation requires that you grant Hbase permissions to each individual user.
Otherwise, grant all HBase permissions to the Hue user.
4. If you have a Kerberos cluster with doAs and force principal names to lower case, be sure to exclude the HTTP
principal:
a. Go to the HDFS service.
b. Filter by Scope > HDFS (Service-Wide) and Category > Security.
c. Search on Additional Rules to Map Kerberos Principals to Short Names (auth_to_local) and add two
HTTP rules above your existing rules:
# Exclude HTTP
RULE:[1:$1@$0](HTTP@\QEXAMPLE.COM\E$)s/@\Q.EXAMPLE.COM\E$//
RULE:[2:$1@$0](HTTP@\QEXAMPLE.COM\E$)s/@\Q.EXAMPLE.COM\E$//
Property Description
Enable TLS/SSL for HBase Thrift Server Encrypt communication between clients and HBase Thrift Server
over HTTP over HTTP using Transport Layer Security (TLS).
HBase Thrift Server over HTTP TLS/SSL Path to the TLS/SSL keystore file (in JKS format) with the TLS/SSL
Server JKS Keystore File Location server certificate and private key. Used when HBase Thrift Server
over HTTP acts as a TLS/SSL server.
HBase Thrift Server over HTTP TLS/SSL Password for the HBase Thrift Server JKS keystore file.
Server JKS Keystore File Password
HBase Thrift Server over HTTP TLS/SSL Password that protects the private key contained in the JKS
Server JKS Keystore Key Password keystore used when HBase Thrift Server over HTTP acts as a
TLS/SSL server.
c. Enter a Reason for change, and then click Save Changes to commit the changes.
d. Restart the HBase service.
6. Configure Hue to point to the Thrift Server and to a valid HBase configuration directory:
a. Go to the Hue service and click the Configuration tab.
b. Filter by Scope > All and Category > Main.
c. Set the property, HBase Service, to the service for which you enabled the Thrift Server role (if you have more
than one HBase service instance).
d. Set the property, HBase Thrift Server, to the Thrift Server role for Hue to use.
e. Filter by Category > Advanced.
f. Edit the property, Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, by
adding a valid HBase configuration directory as follows:
[hbase]
hbase_conf_dir={{HBASE_CONF_DIR}}
g. Enter a Reason for change, and then click Save Changes to commit the changes.
Managing Impala
The first part of this section describes the following Impala configuration topics:
• The Impala Service on page 174
• Post-Installation Configuration for Impala on page 178
• Impala Security Overview
• Modifying Impala Startup Options on page 177
The rest of this section describes how to configure Impala to accept connections from applications that use popular
programming APIs:
• Configuring Impala to Work with ODBC on page 180
• Configuring Impala to Work with JDBC on page 182
This type of configuration is especially useful when using Impala in combination with Business Intelligence tools, which
use these standard interfaces to query different kinds of database and Big Data systems.
Note:
When you set this property, Cloudera Manager regenerates the keytabs for Impala Daemon roles.
The principal in these keytabs contains the load balancer hostname.
If there is a Hue service that depends on this Impala service, it also uses the load balancer to
communicate with Impala.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
Impala Web Servers
• Impala Daemon
1. Go to the Impala service.
2. Click the Instances tab.
3. Click an Impala Daemon instance.
4. Click Impala Daemon Web UI.
• Impala Catalog Server
1. Go to the Impala service.
2. Select Web UI > Impala Catalog Web UI.
• Impala Llama ApplicationMaster
1. Go to the Impala service.
2. Click the Instances tab.
3. Click a Impala Llama ApplicationMaster instance.
4. Click Llama Web UI.
Important: If Cloudera Manager cannot find the .pem file on the host for a specific role instance,
that role will fail to start.
When you access the Web UI for the Impala Catalog Server, Daemon, and StateStore, https will be used.
Note: Queries that exceed the specified memory limit are aborted. Percentage limits are based
on the physical memory of the machine and do not consider cgroups.
• Core dump enablement. Enable the Enable Core Dump setting for the Impala service.
Note:
• The location of core dump files may vary according to your operating system configuration.
• Other security settings may prevent Impala from writing core dumps even when this option
is enabled.
• The default location for core dumps is on a temporary filesystem, which can lead to
out-of-space issues if the core dumps are large, frequent, or not removed promptly. To
specify an alternative location for the core dumps, filter the Impala configuration settings
to find the core_dump_dir option. This option lets you specify a different directory for core
dumps for each of the Impala-related daemons.
• Authorization using the open source Sentry plugin. See Enabling Sentry for Impala in Cloudera Manager for details.
• Auditing for successful or blocked Impala queries, another aspect of security. Specify the
--audit_event_log_dir=directory_path option and optionally the
--max_audit_event_log_file_size=number_of_queries and --abort_on_failed_audit_event
options as part of the IMPALA_SERVER_ARGS settings, for each Impala node, to enable and customize auditing.
See Auditing Impala Operations for details.
• Password protection for the Impala web UI, which listens on port 25000 by default. This feature involves adding
some or all of the --webserver_password_file, --webserver_authentication_domain, and
--webserver_certificate_file options to the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS
settings. See Security Guidelines for Impala for details.
• Another setting you might add to IMPALA_SERVER_ARGS is a comma-separated list of query options and values:
--default_query_options='option=value,option=value,...'
These options control the behavior of queries performed by this impalad instance. The option values you specify
here override the default values for Impala query options, as shown by the SET statement in impala-shell.
• During troubleshooting, might direct you to change other values, particularly for IMPALA_SERVER_ARGS, to work
around issues or gather debugging information.
Note:
These startup options for the impalad daemon are different from the command-line options for the
impala-shell command. For the impala-shell options, see impala-shell Configuration Options.
Note: If you use Cloudera Manager, you can enable short-circuit reads through a checkbox in the
user interface and that setting takes effect for Impala as well.
1. Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory
to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
2. On all Impala nodes, configure the following properties in Impala's copy of hdfs-site.xml as shown:
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hdfs-sockets/dn</value>
</property>
<property>
<name>dfs.client.file-block-storage-locations.timeout.millis</name>
<value>10000</value>
</property>
Note: If you are also going to enable block location tracking, you can skip copying configuration
files and restarting DataNodes and go straight to Optional: Block Location Tracking. Configuring
short-circuit reads and block location tracking require the same process of copying files and
restarting services, so you can complete that process once when you have completed all
configuration changes. Whether you copy files and restart services now or during configuring
block location tracking, short-circuit reads are not enabled until you complete those final steps.
<property>
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>
</property>
2. Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory
to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
3. After applying these changes, restart all DataNodes.
Important: As of late 2015, most business intelligence applications are certified with the 2.x ODBC
drivers. Although the instructions on this page cover both the 2.x and 1.x drivers, expect to use the
2.x drivers exclusively for most ODBC applications connecting to Impala. CDH 6.0 has been tested with
the Impala ODBC driver version 2.5.42, and Cloudera recommends that you use this version when you
start using CDH 6.0.
See the database drivers section on the Cloudera downloads web page to download and install the driver.
Configuring the ODBC Port
Versions 2.5 and 2.0 of the Cloudera ODBC Connector, currently certified for some but not all BI applications, use the
HiveServer2 protocol, corresponding to Impala port 21050. Impala supports Kerberos authentication with all the
supported versions of the driver, and requires ODBC 2.05.13 for Impala or higher for LDAP username/password
authentication.
Version 1.x of the Cloudera ODBC Connector uses the original HiveServer1 protocol, corresponding to Impala port
21000.
Example of Setting Up an ODBC Application for Impala
To illustrate the outline of the setup process, here is a transcript of a session to set up all required drivers and a business
intelligence application that uses the ODBC driver, under Mac OS X. Each .dmg file runs a GUI-based installer, first for
the underlying IODBC driver needed for non-Windows systems, then for the Cloudera ODBC Connector, and finally for
the BI tool itself.
$ ls -1
Cloudera-ODBC-Driver-for-Impala-Install-Guide.pdf
BI_Tool_Installer.dmg
iodbc-sdk-3.52.7-macosx-10.5.dmg
ClouderaImpalaODBC.dmg
$ open iodbc-sdk-3.52.7-macosx-10.dmg
Install the IODBC driver using its installer
$ open ClouderaImpalaODBC.dmg
Install the Cloudera ODBC Connector using its installer
$ installer_dir=$(pwd)
$ cd /opt/cloudera/impalaodbc
$ ls -1
Cloudera ODBC Driver for Impala Install Guide.pdf
Readme.txt
Setup
lib
ErrorMessages
Release Notes.txt
Tools
$ cd Setup
$ ls
odbc.ini odbcinst.ini
$ cp odbc.ini ~/.odbc.ini
$ vi ~/.odbc.ini
$ cat ~/.odbc.ini
[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.
# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=hostname.sample.example.com
PORT=21050
Schema=default
# General settings
TSaslTransportBufSize=1000
RowsFetchedPerBlock=10000
SocketTimeout=0
StringColumnLength=32767
UseNativeQuery=0
$ pwd
/opt/cloudera/impalaodbc/Setup
$ cd $installer_dir
$ open BI_Tool_Installer.dmg
Install the BI tool using its installer
$ ls /Applications | grep BI_Tool
BI_Tool.app
$ open -a BI_Tool.app
In the BI tool, connect to a data source using port 21050
Notes about JDBC and ODBC Interaction with Impala SQL Features
Most Impala SQL features work equivalently through the impala-shell interpreter of the JDBC or ODBC APIs. The
following are some exceptions to keep in mind when switching between the interactive shell and applications using
the APIs:
Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy,
be cautious about reusing the connections. If the load balancer has set up connection timeout values,
either check the connection frequently so that it never sits idle longer than the load balancer timeout
value, or check the connection validity before using it and create a new one if the connection has
been closed.
Note: The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance
improvements for Impala queries that return large result sets. Impala 2.0 and later are compatible
with the Hive 0.13 driver. If you already have an older JDBC driver installed, and are running Impala
2.0 or higher, consider upgrading to the latest Hive JDBC driver for best performance with JDBC
applications.
If you are using JDBC-enabled applications on hosts outside the CDH cluster, you cannot use the CDH install procedure
on the non-CDH hosts. Install the JDBC driver on at least one CDH host using the preceding procedure. Then download
the JAR files to each client machine that will use JDBC with Impala:
commons-logging-X.X.X.jar
hadoop-common.jar
hive-common-X.XX.X-cdhX.X.X.jar
hive-jdbc-X.XX.X-cdhX.X.X.jar
hive-metastore-X.XX.X-cdhX.X.X.jar
hive-service-X.XX.X-cdhX.X.X.jar
httpclient-X.X.X.jar
httpcore-X.X.X.jar
libfb303-X.X.X.jar
libthrift-X.X.X.jar
log4j-X.X.XX.jar
slf4j-api-X.X.X.jar
slf4j-logXjXX-X.X.X.jar
To enable JDBC support for Impala on the system where you run the JDBC application:
1. Download the JAR files listed above to each client machine.
Note: For Maven users, see this sample github page for an example of the dependencies you
could add to a pom file instead of downloading the individual JARs.
2. Store the JAR files in a location of your choosing, ideally a directory already referenced in your CLASSPATH setting.
For example:
• On Linux, you might use a location such as /opt/jars/.
• On Windows, you might use a subdirectory underneath C:\Program Files.
3. To successfully load the Impala JDBC driver, client programs must be able to locate the associated JAR files. This
often means setting the CLASSPATH for the client process to include the JARs. Consult the documentation for
your JDBC client for more details on how to install new JDBC drivers, but some examples of how to set CLASSPATH
variables include:
• On Linux, if you extracted the JARs to /opt/jars/, you might issue the following command to prepend the
JAR files path to an existing classpath:
export CLASSPATH=/opt/jars/*.jar:$CLASSPATH
• On Windows, use the System Properties control panel item to modify the Environment Variables for your
system. Modify the environment variables to include the path to which you extracted the files.
Note: If the existing CLASSPATH on your client machine refers to some older version of the
Hive JARs, ensure that the new JARs are the first ones listed. Either put the new JAR files
earlier in the listings, or delete the other references to Hive JAR files.
Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy,
be cautious about reusing the connections. If the load balancer has set up connection timeout values,
either check the connection frequently so that it never sits idle longer than the load balancer timeout
value, or check the connection validity before using it and create a new one if the connection has
been closed.
jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
jdbc:hive2://myhost.example.com:21050/;auth=noSasl
To connect to an instance of Impala that requires Kerberos authentication, use a connection string of the form
jdbc:hive2://host:port/;principal=principal_name. The principal must be the same user principal you
used when starting Impala. For example, you might use:
jdbc:hive2://myhost.example.com:21050/;principal=impala/[email protected]
To connect to an instance of Impala that requires LDAP authentication, use a connection string of the form
jdbc:hive2://host:port/db_name;user=ldap_userid;password=ldap_password. For example, you might
use:
jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123
Note:
Prior to CDH 5.7 / Impala 2.5, the Hive JDBC driver did not support connections that use both Kerberos
authentication and SSL encryption. If your cluster is running an older release that has this restriction,
to use both of these security features with Impala through a JDBC application, use the Cloudera JDBC
Connector as the JDBC driver.
Notes about JDBC and ODBC Interaction with Impala SQL Features
Most Impala SQL features work equivalently through the impala-shell interpreter of the JDBC or ODBC APIs. The
following are some exceptions to keep in mind when switching between the interactive shell and applications using
the APIs:
• Complex type considerations:
– Queries involving the complex types (ARRAY, STRUCT, and MAP) require notation that might not be available
in all levels of JDBC and ODBC drivers. If you have trouble querying such a table due to the driver level or
inability to edit the queries used by the application, you can create a view that exposes a “flattened” version
of the complex columns and point the application at the view. See Complex Types ( or higher only) for details.
– The complex types available in and higher are supported by the JDBC getColumns() API. Both MAP and
ARRAY are reported as the JDBC SQL Type ARRAY, because this is the closest matching Java SQL type. This
behavior is consistent with Hive. STRUCT types are reported as the JDBC SQL Type STRUCT.
To be consistent with Hive's behavior, the TYPE_NAME field is populated with the primitive type name for
scalar types, and with the full toSql() for complex types. The resulting type names are somewhat inconsistent,
because nested types are printed differently than top-level types. For example, the following list shows how
toSQL() for Impala types are translated to TYPE_NAME values:
to the right of the cluster name and select Add a Service. A list of service types display. You can add one type of
service at a time.
2. Select the Key-Value Store Indexer service and click Continue.
3. Select the services on which the new service should depend. All services must depend on the same ZooKeeper
service. Click Continue.
4. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of the
hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of hosts to
which the HDFS DataNode role is assigned. You can reassign role instances.
Click a field below a role to display a dialog box containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the hosts dialog box.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
5. Click Continue.
6. Review the configuration changes to be applied. Confirm the settings entered for file system paths. The file paths
required vary based on the services to be installed. If you chose to add the Sqoop service, indicate whether to use
the default Derby database or the embedded PostgreSQL database. If the latter, type the database name, host,
and user credentials that you specified when you created the database.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
7. Click Continue.
8. Click Finish.
Managing Kudu
This topic describes the tasks you can perform to manage the Kudu service using Cloudera Manager. You can use
Cloudera Manager to upgrade the Kudu service, start and stop the Kudu service, monitor operations, and configure
the Kudu master and tablet servers, among other tasks. Depending on your deployment, there are several different
configuration settings you may need to modify.
For detailed information about Apache Kudu, view the Apache Kudu Guide.
• To configure a different dump directory for the Kudu master, modify the value of the Kudu Master Core
Dump Directory property.
• To configure a different dump directory for the Kudu tablet servers, modify the value of the Kudu Tablet
Server Core Dump Directory property.
6. Click Save Changes.
Managing Solr
You can install the Solr service through the Cloudera Manager installation wizard, using either parcels or packages.
See Cloudera Installation Guide.
You can elect to have the service created and started as part of the Installation wizard. If you elect not to create the
service using the Installation wizard, you can use the Add Service wizard to perform the installation. The wizard will
automatically configure and start the dependent services and the Solr service. See Adding a Service on page 249 for
instructions.
For further information on the Solr service, see Search Guide.
The following sections describe how to configure other CDH components to work with the Solr service.
Configuring the Flume Morphline Solr Sink for Use with the Solr Service
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
To use a Flume Morphline Solr sink, the Flume service must be running on your cluster. See the Flume Near Real-Time
Indexing Reference for information about the Flume Morphline Solr Sink and Configuring Apache Flume.
1. Go to the Flume service.
2. Click the Configuration tab.
3. Select Scope > Agent
4. Select Category > Flume-NG Solr Sink.
5. Edit the following settings, which are templates that you must modify for your deployment:
• Morphlines File (morphlines.conf) - Configures Morphlines for Flume agents. You must use $ZK_HOST in
this field instead of specifying a ZooKeeper quorum. Cloudera Manager automatically replaces the $ZK_HOST
variable with the correct value during the Flume configuration deployment.
• Custom MIME-types File (custom-mimetypes.xml) - Configuration for the detectMimeTypes command.
See the Cloudera Morphlines Reference Guide for details on this command.
• Grok Dictionary File (grok-dictionary.conf) - Configuration for the grok command. See the Cloudera
Morphlines Reference Guide for details on this command.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
Once configuration is complete, Cloudera Manager automatically deploys the required files to the Flume agent process
directory when it starts the Flume agent. Therefore, you can reference the files in the Flume agent configuration using
their relative path names. For example, you can use the name morphlines.conf to refer to the location of the
Morphlines configuration file.
Note:
When you set this property, Cloudera Manager regenerates the keytabs for Solr roles. The principal
in these keytabs contains the load balancer hostname.
If there is a Hue service that depends on this Solr service, it also uses the load balancer to
communicate with Solr.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
Note: Information about Solr nodes can also be found in clusterstate.json, but that file
only lists nodes currently hosting replicas. Nodes running Solr but not currently hosting replicas
are not listed in clusterstate.json.
2. Add the new replica on solr02.example.com using the ADDREPLICA API call.
http://solr01.example.com:8983/solr/admin/collections?action=ADDREPLICA&collection=email&shard=shard1&node=solr02.example.com:8983_solr
3. Verify that the replica creation succeeds and moves from recovery state to ACTIVE. You can check the replica
status in the Cloud view, which can be found at a URL similar to:
http://solr02.example.com:8983/solr/#/~cloud.
Note: Do not delete the original replica until the new one is in the ACTIVE state. When the newly
added replica is listed as ACTIVE, the index has been fully replicated to the newly added replica.
The total time to replicate an index varies according to factors such as network bandwidth and
the size of the index. Replication times on the scale of hours are not uncommon and do not
necessarily indicate a problem.
You can use the details command to get an XML document that contains information about
replication progress. Use curl or a browser to access a URI similar to:
http://solr02.example.com:8983/solr/email_shard1_replica2/replication?command=details
Accessing this URI returns an XML document that contains content about replication progress. A
snippet of the XML content might appear as follows:
...
<str name="numFilesDownloaded">126</str>
<str name="replication StartTime">Tue Jan 21 14:34:43 PST 2014</str>
<str name="timeElapsed">457s</str>
<str name="currentFile">4xt_Lucene41_0.pos</str>
<str name="currentFileSize">975.17 MB</str>
<str name="currentFileSizeDownloaded">545 MB</str>
<str name="currentFileSizePercent">55.0</str>
<str name="bytesDownloaded">8.16 GB</str>
<str name="totalPercent">73.0</str>
<str name="timeRemaining">166s</str>
<str name="downloadSpeed">18.29 MB</str>
...
4. Use the CLUSTERSTATUS API call to retrieve information about the cluster, including current cluster status:
http://solr01.example.com:8983/solr/admin/collections?action=clusterstatus&wt=json&indent=true
Review the returned information to find the correct replica to remove. An example of the JSON file might appear
as follows:
5. Delete the old replica on solr01.example.com server using the DELETEREPLICA API call:
http://solr01.example.com:8983/solr/admin/collections?action=DELETEREPLICA&collection=email&shard=shard1&replica=core_node2
Managing Spark
Apache Spark is a general framework for distributed computing that offers high performance for both batch and
interactive processing.
To run applications distributed across a cluster, Spark requires a cluster manager. In CDH 6, Cloudera supports only
the YARN cluster manager. When run on YARN, Spark application processes are managed by the YARN ResourceManager
and NodeManager roles. Spark Standalone is no longer supported.
In CDH 6, Cloudera only supports running Spark applications on a YARN cluster manager. The Spark Standalone cluster
manager is not supported.
Related Information
• Spark Guide
• Monitoring Spark Applications on page 340
• Tuning Apache Spark Applications on page 413
• Spark Authentication
• Cloudera Spark forum
• Apache Spark documentation
This section describes how to manage Spark services.
to the right of the cluster name and select Add a Service. A list of service types display. You can add one type of
service at a time.
2. Select the Sqoop 1 Client service and click Continue.
3. Select the services on which the new service should depend. All services must depend on the same ZooKeeper
service. Click Continue.
4. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of the
hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of hosts to
which the HDFS DataNode role is assigned. You can reassign role instances.
Click a field below a role to display a dialog box containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the hosts dialog box.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
5. Click Continue. The client configuration deployment command runs.
6. Click Continue and click Finish.
The following sections show how to install the most common JDBC drivers.
Note:
• The JDBC drivers need to be installed only on the machine where Sqoop runs; you do not need
to install them on all hosts in your Hadoop cluster.
• Kerberos authentication is not supported by the Sqoop Connector for Teradata.
• Use the JDBC driver jar that your database server and java version support.
mkdir -p /var/lib/sqoop
chown sqoop:sqoop /var/lib/sqoop
chmod 755 /var/lib/sqoop
Note:
At the time of publication, version was 5.1.31, but the version may have changed by the time you
read this.
Important:
Make sure you have at least version 5.1.31. Some systems ship with an earlier version
that may not work correctly with Sqoop.
curl -L
'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz'
| tar xz
sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/
curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o
postgresql-9.2-1002.jdbc4.jar
sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop/
jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:mysql://my_mysql_server_hostname:3306/my_database_name
jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>
Example:
jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name
jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name
Example:
jdbc:netezza://my_netezza_server_hostname:5480/my_database_name
Note: Kerberos authentication is not supported by the Sqoop Connector for Teradata.
CDP does not support the Sqoop exports using the Hadoop jar command (the Java API). The connector documentation
from Teradata includes instructions that include the use of this API. CDP users have reportedly mistaken the unsupported
API commands, such as -forcestage, for the supported Sqoop commands, such as –-staging-force. Cloudera supports
the use of Sqoop only with commands documented in Using the Cloudera Connector Powered by Teradata. Cloudera
does not support using Sqoop with Hadoop JAR commands, such as those described in the Teradata Connector for
Hadoop Tutorial.
Syntax:
jdbc:teradata://<HOST>/DBS_PORT=1025/DATABASE=<DATABASE_NAME>
Example:
jdbc:teradata://my_teradata_server_hostname/DBS_PORT=1025/DATABASE=my_database_name
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
CDH supports two versions of the MapReduce computation framework: MRv1 and MRv2, which are implemented by
the MapReduce (MRv1) and YARN (MRv2) services. YARN is backwards-compatible with MapReduce. (All jobs that run
against MapReduce also run in a YARN cluster).
The MapReduce v2 (MRv2) or YARN architecture splits the two primary responsibilities of the JobTracker — resource
management and job scheduling/monitoring — into separate daemons: a global ResourceManager and per-application
ApplicationMasters. With YARN, the ResourceManager and per-host NodeManagers form the data-computation
framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers
run on worker hosts instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a
framework-specific library and negotiates resources from the ResourceManager and works with the NodeManagers
to run and monitor the tasks. For details of this architecture, see Apache Hadoop NextGen MapReduce (YARN).
• The Cloudera Manager Admin Console has different methods for displaying MapReduce and YARN job history.
See Monitoring MapReduce Jobs on page 309 and Monitoring YARN Applications on page 327.
• For information on configuring the MapReduce and YARN services for high availability, see MapReduce (MRv1)
and YARN (MRv2) High Availability on page 488.
• For information on configuring MapReduce and YARN resource management features, see Resource Management
on page 431.
Once you have migrated to YARN and deleted the MapReduce service, you can remove local data from each TaskTracker
host. The mapred.local.dir parameter is a directory on the local filesystem of each TaskTracker that contains
temporary data for MapReduce. Once the service is stopped, you can remove this directory to free disk space on each
host.
For detailed information on migrating from MapReduce to YARN, see Migrating from MapReduce 1 (MRv1) to MapReduce
2 (MRv2).
MapReduce and YARN use separate sets of configuration files. No files are removed or altered when you change to a
different framework. To change from YARN to MapReduce (or vice versa):
1. (Optional) Configure the new MapReduce or YARN service.
2. Update dependent services to use the chosen framework.
3. Configure the alternatives priority.
4. Redeploy the Oozie ShareLib.
5. Redeploy the client configuration.
6. Start the framework service to switch to.
7. (Optional) Stop the unused framework service to free up the resources it uses.
2. Put the AWS secret in the same .jceks file created in previous step.
3. Set your hadoop.security.credential.provider.path to the path of the .jceks file in the job configuration
so that the MapReduce framework loads AWS credentials from the .jceks file in HDFS. The following example
shows a Teragen MapReduce job that writes to an S3 bucket.
hadoop jar <path to the Hadoop MapReduce example jar file> teragen \
-Dhadoop.security.credential.provider.path= \
jceks://hdfs/<hdfs directory>/<file name>.jceks \
100 s3a://<bucket name>/teragen1
You can specify the variables <hdfs directory>, <file name>, <AWS access key id>, and <AWS secret access key>. <hdfs
directory> is the HDFS directory where you store the .jceks file. <file name> is the name of the .jceks file in HDFS.
To configure Oozie to submit S3 MapReduce jobs, see Configuring Oozie to Enable MapReduce Jobs To Read/Write
from Amazon S3.
Managing YARN
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
For an overview of computation frameworks, insight into their usage and restrictions, and examples of common tasks
they perform, see Managing YARN (MRv2) and MapReduce (MRv1) on page 196.
Adding the YARN Service
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
1. On the Home > Status tab, click
to the right of the cluster name and select Add a Service. A list of service types display. You can add one type of
service at a time.
2. Select YARN (MR2 Included) and click Continue.
3. Select the services on which the new service should depend. All services must depend on the same ZooKeeper
service. Click Continue.
4. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of the
hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of hosts to
which the HDFS DataNode role is assigned. You can reassign role instances.
Click a field below a role to display a dialog box containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the hosts dialog box.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
Cloudera Manager Property CDH Property Name Default Configuration Cloudera Tuning Guidelines
Name
Container Memory yarn.scheduler. 1 GB 0
minimum-allocation-mb
Minimum
Container Memory yarn.scheduler. 64 GB amount of memory on
maximum-allocation-mb
Maximum largest host
Container Memory yarn.scheduler. 512 MB Use a fairly large value, such
increment-allocation-mb
Increment as 128 MB
Container Memory yarn.nodemanager. 8 GB 8 GB
resource.memory-mb
Configuring Directories
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
Warning:
In addition to importing configuration settings, the import process:
• Configures services to use YARN as the MapReduce computation framework instead of MapReduce.
• Overwrites existing YARN configuration and role assignments.
You can import MapReduce configurations to YARN as part of the upgrade wizard. If you do not import configurations
during upgrade, you can manually import the configurations at a later time:
1. Go to the YARN service page.
2. Stop the YARN service.
3. Select Actions > Import MapReduce Configuration. The import wizard presents a warning letting you know that
it will import your configuration, restart the YARN service and its dependent services, and update the client
configuration.
4. Click Continue to proceed. The next page indicates some additional configuration required by YARN.
5. Verify or modify the configurations and click Continue. The Switch Cluster to MR2 step proceeds.
6. When all steps have been completed, click Finish.
7. (Optional) Remove the MapReduce service.
a. Click the Cloudera Manager logo to return to the Home page.
b. In the MapReduce row, right-click
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
7. Restart the YARN service.
The Fair Scheduler does not support Dominant Resource Calculator. The fairshare policy that the Fair Scheduler uses
considers only the memory for fairShare and minShare calculation, therefore GPU devices will be allocated from a
common pool.
Install NVIDIA GPU with Ubuntu Linux
The following is an example of how to setup NVIDIA drivers + NVIDIA-smi. This example is specific for Ubuntu 16.04
or later.
1. Add repository for the driver and the toolkit:
2. Update repositories:
4. Install the NVIDIA toolkit which contains the GPU management tools like NVIDIA-smi:
sudo reboot
Enable GPU
16. Find NodeManager GPU Detection Executable and define the location of nvidia-smi.
By default this property has no value, which means that YARN will check the following paths to find nvidia-smi:
• /usr/bin
• /bin
• /usr/local/nvidia/bin
17. Click Save Changes.
18. Select Hosts / All Hosts.
19. Find the the host with the ResourceManager role and then click on the ResourceManager role.
20. Click the Process tab.
21. Open the fair-scheduler.xml and copy its content.
22. Return to the Home page by clicking the Cloudera Manager logo.
23. Select the YARN service.
24. Click the Configuration tab.
• In order to achieve a more fair GPU allocation, set schedulingPolicy to drf (Dominant Resource Fairness)
for queues to which you want to allocate GPUs:
<schedulingPolicy>drf</schedulingPolicy>
Important: When Fair Scheduler XML Advanced Configuration Snippet (Safety Valves) is used,
it overwrites the Dynamic Resource Pool UI.
For more information about the Allocation file format, see YARN documentation on Fair Scheduler.
29. Click Save changes.
30. Click the Stale Service Restart icon that is next to the service to invoke the cluster restart wizard.
As a result the changes in the stale configuration are displayed.
In container-executor.cfg:
[gpu]
module.enabled = true
[cgroups]
root = /var/lib/yarn-ce/cgroups
yarn-hierarchy = /hadoop-yarn
In yarn-site.xml:
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
<property>
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
<property>
<name>yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices</name>
<value>auto</value>
</property>
<property>
-
<value>org.apache.hadoop.yarn.server.nodemanager.util.DefaultLCEResourcesHandler</value>
+
<value>org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler</value>
<name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name>
<value>{{CGROUP_GROUP_CPU}}/hadoop-yarn</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.cgroups.mount-path</name>
<value>/var/lib/yarn-ce/cgroups</value>
</property>
<property>
In System Resources:
@@ -1,4 +1,7 @@
{"dynamic": true, "directory": null, "file": null, "tcp_listen": null, "cpu": {"shares":
1024}, "named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents":
null, "install": null, "named_resource": null}
{"dynamic": true, "directory": null, "file": null, "tcp_listen": null, "cpu": null,
"named_cpu": null, "io": {"weight": 500}, "memory": null, "rlimits": null, "contents":
null, "install": null, "named_resource": null}
{"dynamic": false, "directory": null, "file": null, "tcp_listen": null, "cpu": null,
"named_cpu": null, "io": null, "memory": {"soft_limit": -1, "hard_limit": -1}, "rlimits":
null, "contents": null, "install": null, "named_resource": null}
... ...
@@ -6,4 +9,7 @@
{"dynamic": false, "directory": null, "file": null, "tcp_listen": null, "cpu": {"shares":
1024}, "named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents":
null, "install": null, "named_resource": {"type": "cpu", "name": "hadoop-yarn", "user":
"yarn", "group": "hadoop", "mode": 509, "cpu": null, "blkio": null, "memory": null}}
{"dynamic": false, "directory": null, "file": null, "tcp_listen": null, "cpu": null,
"named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents": null,
"install": null, "named_resource": {"type": "devices", "name": "hadoop-yarn", "user":
"yarn", "group": "hadoop", "mode": 509, "cpu": null, "blkio": null, "memory": null}}
{"dynamic": false, "directory": {"path": "/var/lib/yarn-ce/cgroups", "user": "yarn",
"group": "hadoop", "mode": 509, "bytes_free_warning_threshhold_bytes": 0}, "file": null,
"tcp_listen": null, "cpu": null, "named_cpu": null, "io": null, "memory": null,
"rlimits": null, "contents": null, "install": null, "named_resource": null}
Important: When Fair Scheduler XML Advanced Configuration Snippet (Safety Valves) is
used it overwrites the Dynamic Resource Pool UI.
For more information about the Allocation file format, see YARN documentation on FairScheduler.
<name>yarn.resource-types</name>
<value>fpga</value>
</property>
<property>
<name>yarn.resource-types.fpga.minimum-allocation</name>
<value>0</value>
</property>
<property>
<name>yarn.resource-types.fpga.maximum-allocation</name>
<value>10</value>
</property>
<property>
In yarn-site.xml:
<name>yarn.resource-types</name>
<value>fpga</value>
</property>
<property>
<name>yarn.resource-types.fpga.minimum-allocation</name>
<value>0</value>
</property>
<property>
<name>yarn.resource-types.fpga.maximum-allocation</name>
<value>10</value>
</property>
<property>
<name>yarn.resource-types</name>
<value>fpga</value>
</property>
<property>
<name>yarn.nodemanager.resource-type.fpga</name>
<value>2</value>
</property>
<property>
Note: You cannot use the wildcard (*) character along with a list of users and/or groups in
the same ACL. If you use the wildcard character it must be the only item in the ACL.
Note: In all cases where a single space is required, you will see: <single space>.
• Users only
user1,user2,userN
Use a comma-separated list of user names. Do not place spaces after the commas separating the users in the list.
• Groups only
<single space>HR,marketing,support
You must begin group-only ACLs with a single space. Group-only ACLs use the same syntax as users, except each
entry is a group name rather than user name.
• Users and Groups
fred,alice,haley<single space>datascience,marketing,support
A comma-separated list of user names, followed by a single space, followed by a comma-separated list of group
names. This sample ACL authorizes access to users “fred”, “alice”, and “haley”, and to those users in the groups
“datascience”, “marketing”, and “support”.
Examples
The following ACL entry authorizes access only to the members of “my_group”:
<single space>my_group
The following ACL authorizes access to the users “john”, “jane”, and the group “HR”:
john,jane<single space>HR
In this example, six groups (“group_1” through “group_6”) are defined in the system. The following ACL authorizes
access to a subset of the defined groups, allowing access to all members of groups 1 through 5 (and implicitly denies
access to members of the group “group_6”):
<single space>group_1,group_2,group_3,group_4,group_5
Important: See YARN Admin ACL on page 209 before activating YARN ACLs, because you must configure
the YARN Admin ACL first, before activation.
In a default Cloudera Manager managed YARN deployment, ACL checks are turned on but do not provide any security,
which means that any user can execute administrative commands or submit an application to any YARN queue. To
provide security the ACL must be changed from its default value, the wildcard character (*).
In non-Cloudera Manager managed clusters, the default YARN ACL setting is false, and ACLs are turned off and provide
security out-of-the-box.
Activate YARN ACLs via the yarn.acl.enable property (values are either true or false):
<property>
<name>yarn.acl.enable</name>
<value>true</value>
</property>
YARN ACLs are independent of HDFS or protocol ACLs, which secure communications between clients and servers at
a low level.
YARN ACL Types
This section describes the types of YARN ACLs available for use:
• YARN Admin ACL on page 209
(yarn.admin.acl)
• Queue ACL on page 210
(aclSubmitApps and aclAdministerApps)
• Application ACL
(mapreduce.job.acl-view-job and mapreduce.job.acl-modify-job)
Important: The YARN Admin ACL is triggered and applied only when you run YARN sub-commands
via yarn rmadmin <cmd>. If you run other YARN commands via the YARN command line (for example,
starting the Resource or Node Manager), it does not trigger the YARN Admin ACL check or provide
the same level of security.
The default YARN Admin ACL is set to the wildcard character (*), meaning all users and groups have YARN Administrator
access and privileges. So after YARN ACL enforcement is enabled, (via the yarn.acl.enable property) every user
has YARN ACL Administrator access. Unless you wish for all users to have YARN Admin ACL access, edit the
yarn.admin.acl setting upon initial YARN configuration, and before enabling YARN ACLs.
A typical YARN Admin ACL looks like the following, where the system's Hadoop administrator and multiple groups are
granted access:
hadoopadmin<space>yarnadmgroup,hadoopadmgroup
Queue ACL
Use Queue ACLs to identify and control which users and/or groups can take actions on particular queues. Configure
Queue ACLs using the aclSubmitApps and aclAdministerApps properties, which are set per queue. Queue ACLs are
scheduler dependent, and the implementation and enforcement differ per scheduler type.
Note: Cloudera only supports the Fair Scheduler in CDH. Cloudera does not support Scheduler
Reservations (including aclAdministerReservations, aclListReservations, and
aclSubmitReservations) and their related ACLs. For details, see YARN Unsupported Features.
Unlike the YARN Admin ACL, Queue ACLs are not enabled and enforced by default. Instead, you must explicitly enable
Queue ACLs. Queue ACLs are defined, per queue, in the Fair Scheduler configuration. By default, neither of the Queue
ACL property types is set on any queue, and access is allowed or open to any user.
The users and groups defined in the yarn.admin.acl are considered to be part of the Queue ACL,
aclAdministerApps. So any user or group that is defined in the yarn.admin.acl can submit to any queue and kill
any running application in the system.
Important: The users and groups defined in the yarn.admin.acl are considered to be part of the
Queue ACL, aclAdministerApps. So any user or group that is defined in the yarn.admin.acl can
submit to any queue and kill any running application in the system.
Following is an example of a Queue ACL with both types defined. Note that the single space in aclAdministerApps
indicates a group-only rule:
<queue name="Marketing">
<aclSubmitApps>john,jane</aclSubmitApps>
<aclAdministerApps><single space>others</aclAdministerApps>
</queue>
As mentioned earlier, applications are scheduled on leaf queues only. You specify queues as children of other queues
by placing them as sub-elements of their parents in the Fair Scheduler allocation file (fair-scheduler.xml). The
default Queue ACL setting for all parent and leaf queues is “ “ (a single space), which means that by default, no one
can access any of these queues.
Queue ACL inheritance is enforced by assessing the ACLs defined in the queue hierarchy in a bottom-up order to the
root queue. So within this hierarchy, access evaluations start at the level of the bottom-most leaf queue. If the ACL
does not provide access, then the parent Queue ACL is checked. These evaluations continue upward until the root
queue is checked.
Queue ACLs do not interact directly with the placement policy rules (the rules that determine the pools to which
applications and queries are assigned) and are not part of the placement policy rules, which are executed before the
ACLs are checked. The policy rules return a final result in the form of a queue name. The queue is then evaluated for
access, as described earlier. The Queue ACL allows or denies access to this final queue, which means that an application
can be rejected even if the placement policy returns back a queue.
Important:
In all YARN systems, the default setting for the root queue is reversed compared to all other queues–the
root queue has a default setting of “*”, which means everyone has access:
So even when the Queue ACLs are turned on by default, everyone has access because the root queue
ACL is inherited by all the leaf queues.
Best practice: A best practice for securing an environment is to set the root queue aclSubmitApps ACL to <single
space>, and specify a limited set of users and groups in aclAdministerApps. Set the ACLs for all other queues to
provide submit or administrative access as appropriate.
The order in which the two types of Queue ACLs are evaluated is always:
1. aclSubmitApps
2. aclAdministerApps
The following diagram shows the evaluation flow for Queue ACLs:
Application ACLs
Use Application ACLs to provide a user and/or group–neither of whom is the owner–access to an application. The most
common use case for Application ACLs occurs when you have a team of users collaborating on or managing a set of
applications, and you need to provide read access to logs and job statistics, or access to allow for the modification of
a job (killing the job) and/or application. Application ACLs are set per application and are managed by the application
owner.
Users who start an application (the owners) always have access to the application they start, which includes the
application logs, job statistics, and ACLs. No other user can remove or change owner access. By default, no other users
have access to the application data because the Application ACL defaults to “ “ (single space), which means no one has
access.
MapReduce
Create and use the following MapReduce Application ACLs to view YARN logs:
• mapreduce.job.acl-view-job
Provides read access to the MapReduce history and the YARN logs.
• mapreduce.job.acl-modify-job
Provides the same access as mapreduce.job.acl-view-job, and also allows the user to modify a running job.
Note: Job modification is currently limited to killing the job. No other YARN system modifications
are supported.
During a search or other activities, you may come across the following two legacy settings from MapReduce; they are
not supported by YARN. Do not use them:
• mapreduce.cluster.acls.enabled
• mapreduce.cluster.administrators
Spark
Spark ACLs follow a slightly different format, using a separate property for users and groups. Both user and group lists
use a comma-separated list of entries. The wildcard character “*” allows access to anyone, and the single space “ “
allows access to no one. Enable Spark ACLs using the property spark.acls.enable, which is set to false by default
(not enabled) and must be changed to true to enforce ACLs at the Spark level.
Create and use the following Application ACLs for the Spark application:
• Set spark.acls.enable to true (default is false).
• Set spark.admin.acls and spark.admin.acls.groups for administrative access to all Spark applications.
• Set spark.ui.view.acls and spark.ui.view.acls.groups for view access to the specific Spark application.
• Set spark.modify.acls and spark.modify.acls.groups for administrative access to the specific Spark
application.
Refer to Spark Security and Spark Configuration Security for additional details.
For clusters that do not have log aggregation, logs for running applications are kept on the node where the container
runs. You can access these logs via the Resource Manager and Node Manager web interface, which performs the ACL
checks.
Killing an Application
The Application ACL mapreduce.job.acl-modify-job determines whether or not a user can modify a job, but in
the context of YARN, this only allows the user to kill an application. The kill action is application agnostic and part of
the YARN framework. Other application types, like MapReduce or Spark, implement their own kill action independent
of the YARN framework. MapReduce provides the kill actions via the mapred command.
For YARN, the following three groups of users are allowed to kill a running application:
• The application owner
• A cluster administrator defined in yarn.admin.acl
• A queue administrator defined in aclAdministerApps for the queue in which the application is running
Note that for the queue administrators, ACL inheritance applies, as described earlier.
The following diagram shows a sample queue structure, starting with leaf queues on the bottom, up to root queue at
the top; use it to follow the examples of killing an application and viewing a log:
Example: Moving the Application and Viewing the Log in the Queue "Test"
For this Application ACL evaluation flow example, assume the following for application_1536220066338_0002
running in the queue "Test":
• Application owner: John
• "Marketing" and "Dev" queue administrator: Jane
• Jane has log view rights via the mapreduce.job.acl-view-job ACL
• YARN cluster administrator: Bob
In this use case, John attempts to view the logs for his job, which is allowed because he is the application owner.
Jane attempts to access application_1536220066338_0002 in the queue "Test" to move the application to the
"Marketing" queue. She is denied access to the "Test" queue via the queue ACLs–so she cannot submit to or administer
the queue "Test". She is also unable to kill a job running in queue "Test". She then attempts to access the logs for
application_1536220066338_0002 and is allowed access via the mapreduce.job.acl-view-job ACL.
Bob attempts to access application_1536220066338_0002 in the queue "Test" to move the application to the
"Marketing" queue. As the YARN cluster administrator, he has access to all queues and can move the application.
Note: Permissions on the log files are also set at the filesystem level and are enforced by the filesystem:
the filesystem can block you from accessing the file, which means that you can not open/read the file
to check the ACLs that are contained in the file.
Managing MapReduce
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
For an overview of computation frameworks, insight into their usage and restrictions, and examples of common tasks
they perform, see Managing YARN (MRv2) and MapReduce (MRv1) on page 196.
Configuring the MapReduce Scheduler
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
The MapReduce service is configured by default to use the FairScheduler. You can change the scheduler type to FIFO
or Capacity Scheduler. You can also modify the Fair Scheduler and Capacity Scheduler configuration. For further
information on schedulers, see YARN (MRv2) and MapReduce (MRv1) Schedulers on page 449.
Managing ZooKeeper
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
This topic describes how to add, remove, and replace ZooKeeper roles.
com.cloudera.cmf.command.CmdExecException:java.lang.RuntimeException:
java.lang.IllegalStateException: Assumption violated:
getAllDependencies returned multiple distinct services of the same type
at SeqFlowCmd.java line 120
in com.cloudera.cmf.command.flow.SeqFlowCmd run()
CDH services that are not dependent can use different ZooKeeper services. For example, Kafka does not depend on
any services other than ZooKeeper. You might have one ZooKeeper service for Kafka, and one ZooKeeper service for
the rest of your CDH services.
Note: If the data directories are not initialized, the ZooKeeper servers cannot be started.
In a production environment, you should deploy ZooKeeper as an ensemble with an odd number of servers. As long
as a majority of the servers in the ensemble are available, the ZooKeeper service will be available. The minimum
recommended ensemble size is three ZooKeeper servers, and Cloudera recommends that each server run on a separate
machine. In addition, the ZooKeeper server process should have its own dedicated disk storage if possible.
Replacing a ZooKeeper Role Using Cloudera Manager with Zookeeper Service Downtime
Minimum Required Role: Full Administrator
1. Go to ZooKeeper Instances.
2. Stop the ZooKeeper role on the old host.
3. Remove the ZooKeeper role from old host on the ZooKeeper Instances page.
4. Add a new ZooKeeper role on the new host.
5. Restart the old ZooKeeper servers that have outdated configuration.
6. Confirm the ZooKeeper service has elected one of the restarted hosts as a leader on the ZooKeeper Status page.
See Confirming the Election Status of a ZooKeeper Service.
Replacing a ZooKeeper Role Using Cloudera Manager without Zookeeper Service Downtime
Minimum Required Role: Full Administrator
Note: This process is valid only if the SASL authentication is not turned on between the Zookeeper
servers. You can check this in Cloudera Manager > Zookeeper > Configuration > Enable Server to
Server SASL Authentication.
1. Go to ZooKeeper Instances.
2. Stop the ZooKeeper role on the old host.
3. Confirm the ZooKeeper service has elected one of the remaining hosts as a leader on the ZooKeeper Status page.
See Confirming the Election Status of a ZooKeeper Service.
4. On the ZooKeeper Instances page, remove the ZooKeeper role from the old host.
5. Add a new ZooKeeper role on the new host.
6. Change the individual configuration of the newly added Zookeeper role to have the highest ZooKeeper Server ID
set in the cluster.
7. Go to Zookeeper > Instances and click the newly added Server instance.
8. In the individual Server page, select Start this Server from the Actions dropdown menu to start the new ZooKeeper
role.
Note: If you try it from elsewhere, you may see an error message.
9. On the ZooKeeper Status page, confirm that there is a leader and all other hosts are followers.
10. Restart the ZooKeeper server that has an outdated configuration and is a follower.
11. Restart the leader Zookeeper server that has an outdated configuration.
12. Confirm that a leader has been elected after the restart, and the whole Zookeeper service is in green state.
13. Restart/rolling restart any dependent services such as HBase, HDFS, YARN, Hive, or other services that are marked
to have stale configuration.
Mode: follower
Trying 10.1.2.154...
Connected to server.example.org.
Escape character is '^]'.
stat
Zookeeper version: 3.4.5-cdh5.4.4--1, built on 07/06/2015 23:54 GMT
...
Oozie
1. Go to /var/lib/oozie on each Oozie server and even if the LZO JAR is present, symlink the Hadoop LZO JAR:
• CDH 5 - /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar
• CDH 4 - /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo.jar
2. Restart Oozie.
HBase
Restart HBase.
Impala
Restart Impala.
Hive
Restart the Hive server.
Sqoop 1
1. Add the following entries to the Sqoop 1 Client Client Advanced Configuration Snippet (Safety Valve)
• HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/
• JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native
2. Re-deploy the client configuration.
Managing Hosts
Cloudera Manager provides a number of features that let you configure and manage the hosts in your clusters.
The Hosts screen has the following tabs:
• You can also search for hosts by selecting a value from the facets in the Filters section at the left of the page.
• If the Configuring Agent Heartbeat and Health Status Options on page 47 are configured as follows:
– Send Agent heartbeat every x
– Set health status to Concerning if the Agent heartbeats fail y
– Set health status to Bad if the Agent heartbeats fail z
The value v for a host's Last Heartbeat facet is computed as follows:
– v < x * y = Good
– v >= x * y and <= x * z = Concerning
– v >= x * z = Bad
Disks Overview
Click the Disks Overview tab to display an overview of the status of all disks in the deployment. The statistics exposed
match or build on those in iostat, and are shown in a series of histograms that by default cover every physical disk
in the system.
Adjust the endpoints of the time line to see the statistics for different time periods. Specify a filter in the box to limit
the displayed data. For example, to see the disks for a single rack rack1, set the filter to: logicalPartition =
false and rackId = "rack1" and click Filter. Click a histogram to drill down and identify outliers. Mouse over
the graph and click to display additional information about the chart.
Status
The Status page is displayed when a host is initially selected and provides summary information about the status of
the selected host. Use this page to gain a general understanding of work being done by the system, the configuration,
and health status.
If this host has been decommissioned or is in maintenance mode, you will see the following icon(s) ( , ) in the top
bar of the page next to the status message.
Details
This panel provides basic system configuration such as the host's IP address, rack, health status summary, and disk
and CPU resources. This information summarizes much of the detailed information provided in other panes on this
tab. To view details about the Host agent, click the Host Agent link in the Details section.
Health Tests
Cloudera Manager monitors a variety of metrics that are used to indicate whether a host is functioning as expected.
The Health Tests panel shows health test results in an expandable/collapsible list, typically with the specific metrics
that the test returned. (You can Expand All or Collapse All from the links at the upper right of the Health Tests panel).
• The color of the text (and the background color of the field) for a health test result indicates the status of the
results. The tests are sorted by their health status – Good, Concerning, Bad, or Disabled. The list of entries for
good and disabled health tests are collapsed by default; however, Bad or Concerning results are shown expanded.
• The text of a health test also acts as a link to further information about the test. Clicking the text will pop up a
window with further information, such as the meaning of the test and its possible results, suggestions for actions
you can take or how to make configuration changes related to the test. The help text for a health test also provides
a link to the relevant monitoring configuration section for the service. See Configuring Monitoring Settings on
page 280 for more information.
Health History
The Health History provides a record of state transitions of the health tests for the host.
• Click the arrow symbol at the left to view the description of the health test state change.
• Click the View link to open a new page that shows the state of the host at the time of the transition. In this view
some of the status settings are greyed out, as they reflect a time in the past, not the current status.
File Systems
The File systems panel provides information about disks, their mount points and usage. Use this information to determine
if additional disk space is required.
Roles
Use the Roles panel to see the role instances running on the selected host, as well as each instance's status and health.
Hosts are configured with one or more role instances, each of which corresponds to a service. The role indicates which
daemon runs on the host. Some examples of roles include the NameNode, Secondary NameNode, Balancer, JobTrackers,
DataNodes, RegionServers and so on. Typically a host will run multiple roles in support of the various services running
in the cluster.
Clicking the role name takes you to the role instance's status page.
You can delete a role from the host from the Instances tab of the Service page for the parent service of the role. You
can add a role to a host in the same way. See Role Instances on page 265.
Charts
Charts are shown for each host instance in your cluster.
See Viewing Charts for Cluster, Service, Role, and Host Instances on page 279 for detailed information on the charts
that are presented, and the ability to search and display metrics of your choice.
Processes
The Processes page provides information about each of the processes that are currently running on this host. Use this
page to access management web UIs, check process status, and access log information.
Note: The Processes page may display exited startup processes. Such processes are cleaned up within
a day.
Resources
The Resources page provides information about the resources (CPU, memory, disk, and ports) used by every service
and role instance running on the selected host.
Each entry on this page lists:
• The service name
• The name of the particular instance of this service
• A brief description of the resource
• The amount of the resource being consumed or the settings for the resource
The resource information provided depends on the type of resource:
Commands
The Commands page shows you running or recent commands for the host you are viewing. See Viewing Running and
Recent Commands on page 301 for more information.
Configuration
Minimum Required Role: Full Administrator
The Configuration page for a host lets you set properties for the selected host. You can set properties in the following
categories:
• Advanced - Advanced configuration properties. These include the Java Home Directory, which explicitly sets the
value of JAVA_HOME for all processes. This overrides the auto-detection logic that is normally used.
• Monitoring - Monitoring properties for this host. The monitoring settings you make on this page will override the
global host monitoring settings you make on the Configuration tab of the Hosts page. You can configure monitoring
properties for:
– health check thresholds
– the amount of free space on the filesystem containing the Cloudera Manager Agent's log and process directories
– a variety of conditions related to memory usage and other properties
– alerts for health check events
For some monitoring properties, you can set thresholds as either a percentage or an absolute value (in bytes).
• Other - Other configuration properties.
• Parcels - Configuration properties related to parcels. Includes the Parcel Director property, the directory that
parcels will be installed into on this host. If the parcel_dir variable is set in the Agent's config.ini file, it will
override this value.
• Resource Management - Enables resource management using control groups (cgroups).
For more information, see the description for each or property or see Modifying Configuration Properties Using Cloudera
Manager on page 74.
Components
The Components page lists every component installed on this host. This may include components that have been
installed but have not been added as a service (such as YARN, Flume, or Impala).
This includes the following information:
• Component - The name of the component.
• Version - The version of CDH from which each component came.
• Component Version - The detailed version number for each component.
Audits
The Audits page lets you filter for audit events related to this host. See Lifecycle and Security Auditing on page 363 for
more information.
Charts Library
The Charts Library page for a host instance provides charts for all metrics kept for that host instance, organized by
category. Each category is collapsible/expandable. See Viewing Charts for Cluster, Service, Role, and Host Instances
on page 279 for more information.
Important: As of February 1, 2021, all downloads of CDH and Cloudera Manager require a username
and password and use a modified URL. You must use the modified URL, including the username and
password when downloading the repository contents described below. You may need to upgrade
Cloudera Manager to a newer version that uses the modified URLs.
This can affect new installations, upgrades, adding new hosts to a cluster, downloading a new parcel,
and adding a cluster.
For more information, see Updating an existing CDH/Cloudera Manager deployment to access
downloads with authentication.
You can add one or more hosts to your cluster using the Add Hosts wizard, which installs the Oracle JDK, CDH, and
Cloudera Manager Agent software. After the software is installed and the Cloudera Manager Agent is started, the
Agent connects to the Cloudera Manager Server and you can use the Cloudera Manager Admin Console to manage
and monitor CDH on the new host.
The Add Hosts wizard does not create roles on the new host; once you have successfully added the host(s) you can
either add roles, one service at a time, or apply a host template, which can define role configurations for multiple roles.
Important:
• Unqualified hostnames (short names) must be unique in a Cloudera Manager instance. For
example, you cannot have both host01.example.com and host01.standby.example.com managed
by the same Cloudera Manager Server.
• All hosts in a single cluster must be running the same version of CDH.
• When you add a new host, you must install the same version of CDH to enable the new host to
work with the other hosts in the cluster. The installation wizard lets you select the version of CDH
to install, and you can choose a custom repository to ensure that the version you install matches
the version on the other hosts.
• If you are managing multiple clusters, select the version of CDH that matches the version in use
on the cluster where you plan to add the new host.
• When you add a new host, the following occurs:
– YARN topology.map is updated to include the new host
– Any service that includes topology.map in its configuration—Flume, Hive, Hue, Oozie, Solr,
Spark, Sqoop 2, YARN—is marked stale
At a convenient point after adding the host you should restart the stale services to pick up the
new configuration.
Important: This step temporarily puts the existing cluster hosts in an unmanageable state; they are
still configured to use TLS and so cannot communicate with the Cloudera Manager Server. Roles on
these hosts continue to operate normally, but Cloudera Manager is unable to detect errors and issues
in the cluster and reports all hosts as being in bad health. To work around this issue, you can manually
install the Cloudera Manager Agent on the new host. See Alternate Method of Installing Cloudera
Manager Agent without Disabling TLS on page 230.
OS Command
RHEL sudo scp mynode.example.com:/etc/yum.repos.d/
cloudera-manager.repo /etc/yum.repos.d/
cloudera-manager.repo
2. Remove cached package lists and other transient data by running the following command:
OS Command
RHEL sudo yum clean all
3. Install the Oracle JDK package from the Cloudera Manager repository. Install the same version as is used on other
cluster hosts. Only JDK 1.8 is supported:
OS Command
RHEL sudo yum install jdk1.8.0_144-cloudera
4. Set up the TLS certificates using the same procedure that was used to set them up on other cluster hosts. See
Manually Configuring TLS Encryption for Cloudera Manager. If you have set up a custom truststore, copy that file
from an existing host to the same location on the new host.
5. Install the Cloudera Manager Agent:
OS Command
RHEL sudo yum install cloudera-manager-agent
6. Copy the Cloudera Manager Agent configuration file from an existing cluster host that is already configured for
TLS to the same location on the new host. For example:
7. Create and secure the file containing the password used to protect the private key of the Agent:
a. Use a text editor to create a file called agentkey.pw that contains the password. Save the file in the
/etc/cloudera-scm-agent directory.
b. Change ownership of the file to root:
9. Log in to Cloudera Manager and go to Hosts > All Hosts page and verify that the new host is recognized by Cloudera
Manager.
5. If the cluster uses Kerberos authentication, ensure that the Kerberos packages are installed on the new hosts. If
necessary, use the package commands provided on the Add Hosts screen to install these packages.
6. Select the Repository Location where Cloudera Manager can find the software to install on the new hosts. Select
Public Cloudera Repository or Custom Repository and enter the URL of a custom repository available on your
local network. See Cloudera Manager 6 Version and Download Information for a list of public repositories.
7. Follow the instructions in the wizard to install the Oracle JDK.
8. Enter Login Credentials:
a. Select root for the root account, or select Another user and enter the username for an account that has
password-less sudo privileges.
b. Select an authentication method:
• If you choose password authentication, enter and confirm the password.
• If you choose public-key authentication, provide a passphrase and path to the required key files.
You can modify the default SSH port if necessary.
c. Specify the maximum number of host installations to run at once. The default and recommended value is 10.
You can adjust this based on your network capacity.
d. Click Continue.
The Install Agents page displays and Cloudera Manager installs the Agent software on the new hosts.
e. When the agent installation finishes, click Continue.
9. The Host Inspector runs and displays any problems with the hosts. Correct the problems before continuing.
10. After correcting any problems, click Continue.
Enable Kerberos
If you have previously enabled Kerberos on your cluster:
1. Install the packages required to kinit on the new host (see the list in Getting Started).
2. If you have set up Cloudera Manager to manage krb5.conf, it will automatically deploy the file on the new host.
Note that Cloudera Manager will deploy krb5.conf only if you use the Kerberos wizard. If you have used the
API, you will need to manually perform the commands that the wizard calls.
If Cloudera Manager does not manage krb5.conf, you must manually update the file at /etc/krb5.conf.
Cloudera Manager supports nested rack specifications. For example, you could specify the rack /rack3, or
/group5/rack3 to indicate the third rack in the fifth group. All hosts in a cluster must have the same number of path
components in their rack specifications.
To specify racks for hosts:
1. Click the Hosts tab.
2. Check the checkboxes next to the host(s) for a particular rack, such as all hosts for /rack123.
3. Click Actions for Selected (n) > Assign Rack, where n is the number of selected hosts.
4. Enter a rack name or ID that starts with a slash /, such as /rack123 or /aisle1/rack123, and then click Confirm.
5. Optionally restart affected services. Rack assignments are not automatically updated for running services.
Host Templates
Minimum Required Role: Full Administrator
Host templates let you designate a set of role groups that can be applied in a single operation to a host or a set of
hosts. This significantly simplifies the process of configuring new hosts when you need to expand your cluster.
Important: A host template can only be applied on a host with a version of CDH that matches the
CDH version running on the cluster to which the host template belongs.
You can create and manage host templates under the Templates tab from the Hosts page.
1. Click the Hosts tab on the main Cloudera Manager navigation bar.
2. Click the Templates tab on the Hosts page.
Templates are not required; Cloudera Manager assigns roles and role groups to the hosts of your cluster when you
perform the initial cluster installation. However, if you want to add new hosts to your cluster, a host template can
make this much easier.
If there are existing host templates, they are listed on the page, along with links to each role group included in the
template.
If you are managing multiple clusters, you must create separate host templates for each cluster, as the templates
specify role configurations specific to the roles in a single cluster. Existing host templates are listed under the cluster
to which they apply.
• You can click a role group name to be taken to the Edit configuration page for that role group, where you can
modify the role group settings.
• From the Actions menu associated with the template you can edit the template, clone it, or delete it.
Decommissioning Hosts
Minimum Required Role: Limited Operator (also provided by Operator, Configurator, Cluster Administrator, or Full
Administrator)
Note that the Limited Operator and Operator roles do not allow you to suppress or enable alerts.
Note: Hosts with DataNodes and DataNode roles themselves can only be decommissioned if the
resulting action leaves enough DataNodes commissioned to maintain the configured HDFS replication
factor (by default 3). If you attempt to decommission a DataNode or a host with a DataNode in such
situations, the decommission process will not complete and must be aborted.
Cloudera Manager manages the host decommission and recommission process and allows you the option to specify
whether to replicate the data to other DataNodes, and whether or not to suppress alerts.
Decommissioning a host decommissions and stops all roles on the host without requiring you to individually
decommission the roles on each service. Decommissioning applies to only to HDFS DataNode, MapReduce TaskTracker,
YARN NodeManager, and HBase RegionServer roles. If the host has other roles running on it, those roles are stopped.
To decommission one or more hosts:
1. If the host has a DataNode, and you are planning to replicate data to other hosts (for longer term maintenance
operations or to permanently decommission or repurpose the host), perform the steps in Tuning HDFS Prior to
Decommissioning DataNodes on page 239.
2. In Cloudera Manager, select the cluster where you want to decommission hosts.
3. Click Hosts > All Hosts.
4. Select the hosts that you want to decommission.
5. Select Actions for Selected > Begin Maintenance (Suppress Alerts/Decommission.
(If you are logged in as a user with the Limited Operator or Operator role, the menu item is labeled Decommission
Host(s) and you will not see the option to suppress alerts.)
The Begin Maintenance (Suppress Alerts/Decommission) dialog box opens. The role instances running on the
hosts display at the top.
6. To decommission the hosts and suppress alerts, select Decommission Host(s). When you select this option for
hosts running a DataNode role, choose one of the following (if the host is not running a DataNode role, you will
only see the Decommission Host(s) option:):
• Decommission DataNodes
This option re-replicates data to other DataNodes in the cluster according to the configured replication factor.
Depending on the amount of data and other factors, this can take a significant amount of time and uses a
great deal of network bandwidth. This option is appropriate when replacing disks, repurposing hosts for
non-HDFS use, or permanently retiring hardware.
• Take DataNode Offline
This option does not re-replicate HDFS data to other DataNodes until the amount of time you specify has
passed, making it less disruptive to active workloads. After this time has passed, the DataNode is automatically
recommissioned, but the DataNode role is not started. This option is appropriate for short-term maintenance
tasks such not involving disks, such as rebooting, CPU/RAM upgrades, or switching network cables.
Note:
• You cannot start roles on a decommissioned host.
• When a DataNode is decommissioned, although HDFS data is replicated to other DataNodes,
local files containing the original data blocks are not automatically removed from the storage
directories on the host. If you want to permanently remove these files from the host to reclaim
disk space, you must do so manually.
Recommissioning Hosts
Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)
Only hosts that are decommissioned using Cloudera Manager can be recommissioned.
1. In Cloudera Manager, select the cluster where you want to recommission hosts.
2. Click Hosts > All Hosts.
3. Select the hosts that you want to recommission.
4. Select Actions for Selected > End Maintenance (Suppress Alerts/Decommission.
The End Maintenance (Suppress Alerts/Decommission dialog box opens. The role instances running on the hosts
display at the top.
5. To recommission the hosts, select Recommission Host(s).
6. Choose one of the following:
• Bring hosts online and start all roles
All decommissioned roles will be recommissioned and started. HDFS DataNodes will be started first and
brought online before decommissioning to avoid excess replication.
• Bring hosts online
All decommissioned roles will be recommissioned but remain stopped. You can restart the roles later.
small batches. If a DataNode has thousands of blocks, decommissioning can take several hours. Before decommissioning
hosts with DataNodes, you should first tune HDFS:
1. Run the following command to identify any problems in the HDFS file system:
2. Fix any issues reported by the fsck command. If the command output lists corrupted files, use the fsck command
to move them to the lost+found directory or delete them:
or
3. Raise the heap size of the DataNodes. DataNodes should be configured with at least 4 GB heap size to allow for
the increase in iterations and max streams.
a. Go to the HDFS service page.
b. Click the Configuration tab.
c. Select Scope > DataNode.
d. Select Category > Resource Management.
e. Set the Java Heap Size of DataNode in Bytes property as recommended.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role
group. See Modifying Configuration Properties Using Cloudera Manager on page 74.
f. Enter a Reason for change, and then click Save Changes to commit the changes.
4. Increase the replication work multiplier per iteration to a larger number (the default is 2, however 10 is
recommended):
a. Select Scope > NameNode.
b. Expand the Category > Advanced category.
c. Configure the Replication Work Multiplier Per Iteration property to a value such as 10.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role
group. See Modifying Configuration Properties Using Cloudera Manager on page 74.
d. Enter a Reason for change, and then click Save Changes to commit the changes.
5. Increase the replication maximum threads and maximum replication thread hard limits:
a. Select Scope > NameNode.
b. Expand the Category > Advanced category.
c. Configure the Maximum number of replication threads on a DataNode and Hard limit on the number of
replication threads on a DataNode properties to 50 and 100 respectively. You can decrease the number of
threads (or use the default values) to minimize the impact of decommissioning on the cluster, but the trade
off is that decommissioning will take longer.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role
group. See Modifying Configuration Properties Using Cloudera Manager on page 74.
d. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Restart the HDFS service.
For additional tuning recommendations, see Performance Considerations on page 241.
Performance Considerations
Decommissioning a DataNode does not happen instantly because the process requires replication of a potentially large
number of blocks. During decommissioning, the performance of your cluster may be impacted. This section describes
the decommissioning process and suggests solutions for several common performance issues.
Decommissioning occurs in two steps:
1. The Commission State of the DataNode is marked as Decommissioning and the data is replicated from this node
to other available nodes. Until all blocks are replicated, the node remains in a Decommissioning state. You can
view this state from the NameNode Web UI. (Go to the HDFS service and select Web UI > NameNode Web UI.)
2. When all data blocks are replicated to other nodes, the node is marked as Decommissioned.
Decommissioning can impact performance in the following ways:
• There must be enough disk space on the other active DataNodes for the data to be replicated. After
decommissioning, the remaining active DataNodes have more blocks and therefore decommissioning these
DataNodes in the future may take more time.
• There will be increased network traffic and disk I/O while the data blocks are replicated.
• Data balance and data locality can be affected, which can lead to a decrease in performance of any running or
submitted jobs.
• Decommissioning a large numbers of DataNodes at the same time can decrease performance.
• If you are decommissioning a minority of the DataNodes, the speed of data reads from these nodes limits the
performance of decommissioning because decommissioning maxes out network bandwidth when reading data
blocks from the DataNode and spreads the bandwidth used to replicate the blocks among other DataNodes in
the cluster. To avoid performance impacts in the cluster, Cloudera recommends that you only decommission a
minority of the DataNodes at the same time.
• You can decrease the number of replication threads to decrease the performance impact of the replications, but
this will cause the decommissioning process to take longer to complete. See Tuning HDFS Prior to Decommissioning
DataNodes on page 239.
Cloudera recommends that you add DataNodes and decommission DataNodes in parallel, in smaller groups. For
example, if the replication factor is 3, then you should add two DataNodes and decommission two DataNodes at the
same time.
Troubleshooting Performance of Decommissioning
The following conditions can also impact performance when decommissioning DataNodes:
• Open Files on page 241
• A block cannot be relocated because there are not enough DataNodes to satisfy the block placement policy. on
page 242
Open Files
Write operations on the DataNode do not involve the NameNode. If there are blocks associated with open files
located on a DataNode, they are not relocated until the file is closed. This commonly occurs with:
• Clusters using HBase
• Open Flume files
• Long running tasks
To find open files, run the following command:
After you find the open files, perform the appropriate action to restart process to close the file. For example, major
compaction closes all files in a region for HBase.
Alternatively, you may evict writers to those decommissioning DataNodes with the following command:
For example:
A block cannot be relocated because there are not enough DataNodes to satisfy the block placement policy.
For example, for a 10 node cluster, if the mapred.submit.replication is set to the default of 10 while attempting
to decommission one DataNode, there will be difficulties relocating blocks that are associated with map/reduce
jobs. This condition will lead to errors in the NameNode logs similar to the following:
Use the following steps to find the number of files where the block replication policy is equal to or above your
current cluster size:
1. Provide a listing of open files, their blocks, the locations of those blocks by running the following command:
2. Run the following command to return a list of how many files have a given replication factor:
3. Examine the paths, and decide whether to reduce the replication factor of the files, or remove them from the
cluster.
Maintenance Mode
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
Maintenance mode allows you to suppress alerts for a host, service, role, or an entire cluster. This can be useful when
you need to take actions in your cluster (make configuration changes and restart various elements) and do not want
to see the alerts that will be generated due to those actions.
Putting an entity into maintenance mode does not prevent events from being logged; it only suppresses the alerts that
those events would otherwise generate. You can see a history of all the events that were recorded for entities during
the period that those entities were in maintenance mode.
For example:
• If you set the HBase service into maintenance mode, then its roles (HBase Master and all RegionServers) are put
into effective maintenance mode.
• If you set a host into maintenance mode, then any roles running on that host are put into effective maintenance
mode.
Entities that have been explicitly put into maintenance mode show the icon . Entities that have entered effective
maintenance mode as a result of inheritance from a higher-level entity show the icon .
When an entity (role, host or service) is in effective maintenance mode, it can only be removed from maintenance
mode when the higher-level entity exits maintenance mode. For example, if you put a service into maintenance mode,
the roles associated with that service are entered into effective maintenance mode, and remain in effective maintenance
mode until the service exits maintenance mode. You cannot remove them from maintenance mode individually.
Alternatively, an entity that is in effective maintenance mode can be put into explicit maintenance mode. In this case,
the entity remains in maintenance mode even when the higher-level entity exits maintenance mode. For example,
suppose you put a host into maintenance mode, (which puts all the roles on that host into effective maintenance
mode). You then select one of the roles on that host and put it explicitly into maintenance mode. When you have the
host exit maintenance mode, that one role remains in maintenance mode. You need to select it individually and
specifically have it exit maintenance mode.
to the right of the cluster name and selecting View Maintenance Mode Status.
The roles will be put in explicit maintenance mode. If the roles were already in effective maintenance mode (because
its service or host was put into maintenance mode) the roles will now be in explicit maintenance mode. This means
that they will not exit maintenance mode automatically if their host or service exits maintenance mode; they must be
explicitly removed from maintenance mode.
5. Deselect the Recommission Host(s) option to take the host out of Maintenance Mode and re-enable alerts from
the hosts. Hosts that are currently in Maintenance Mode display the icon on the All Hosts page.
6. Click End Maintenance.
Changing Hostnames
Minimum Required Role: Full Administrator
Important:
• The process described here requires Cloudera Manager and cluster downtime.
• If any user created scripts reference specific hostnames those must also be updated.
• Due to the length and complexity of the following procedure, changing cluster hostnames is not
recommended by Cloudera.
• To change hostnames for Apache Kudu, follow the procedure mentioned in Perform Hostname
Changes.
After you have installed Cloudera Manager and created a cluster, you may need to update the names of the hosts
running the Cloudera Manager Server or cluster services. To update a deployment with new hostnames, follow these
steps:
1. Verify if TLS/SSL certificates have been issued for any of the services and make sure to create new TLS/SSL certificates
in advance for services protected by TLS/SSL. See Encryption Mechanisms Overview.
2. Export the Cloudera Manager configuration using one of the following methods:
• Open a browser and go to this URL http://cm_hostname:7180/api/api_version/cm/deployment.
Save the displayed configuration.
• From terminal type:
$ curl -u admin:admin http://cm_hostname:7180/api/api_version/cm/deployment >
cme-cm-export.json
where cm_hostname is the name of the Cloudera Manager host and api_version is the correct version of the API
for the version of Cloudera Manager you are using. For example,
http://tcdn5-1.ent.cloudera.com:7180/api/v32/cm/deployment.
3. Stop all services on the cluster.
4. Stop the Cloudera Management Service.
5. Stop the Cloudera Manager Server.
6. Stop the Cloudera Manager Agents on the hosts that will be having the hostname changed.
7. Back up the Cloudera Manager Server database using mysqldump, pg_dump, or another preferred backup utility.
Store the backup in a safe location.
8. Update names and principals:
a. Update the target hosts using standard per-OS/name service methods (/etc/hosts, dns,
/etc/sysconfig/network, hostname, and so on). Ensure that you remove the old hostname.
b. If you are changing the hostname of the host running Cloudera Manager Server do the following:
a. Change the hostname per step 8.a.
b. Update the Cloudera Manager hostname in /etc/cloudera-scm-agent/config.ini on all Agents.
c. If the cluster is configured for Kerberos security, do the following:
Open cluster-princ.txt and remove any noncluster service principal entries. Make sure
that the default krbtgt and other principals you created, or that were created by Kerberos
by default, are not removed by running the following: for i in `cat
cluster-princ.txt`; do yes yes | kadmin.local -q "delprinc $i"; done.
• For an Active Directory KDC, an AD administrator must manually delete the principals for the old
hostname from Active Directory.
b. Start the Cloudera Manager database and Cloudera Manager Server.
c. Start the Cloudera Manager Agents on the newly renamed hosts. The Agents should show a current
heartbeat in Cloudera Manager.
d. Within the Cloudera Manager Admin Console click the Hosts tab.
e. Select the checkbox next to the host with the new name.
f. Select Actions > Regenerate Keytab.
9. If one of the hosts that was renamed has a NameNode configured with high availability and automatic failover
enabled, reconfigure the ZooKeeper Failover Controller znodes to reflect the new hostname.
a. Start ZooKeeper Servers.
Warning: All other services, and most importantly HDFS, and the ZooKeeper Failover
Controller (FC) role within the HDFS, should not be running.
b. On one of the hosts that has a ZooKeeper Server role, run zookeeper-client.
a. If the cluster is configured for Kerberos security, configure ZooKeeper authorization as follows:
a. Go to the HDFS service.
b. Click the Instances tab.
c. Click the Failover Controller role.
d. Click the Process tab.
e. In the Configuration Files column of the hdfs/hdfs.sh ["zkfc"] program, expand Show.
f. Inspect core-site.xml in the displayed list of files and determine the value of the
ha.zookeeper.auth property, which will be something like:
digest:hdfs-fcs:TEbW2bgoODa96rO3ZTn7ND5fSOGx0h. The part after digest:hdfs-fcs:
is the password (in the example it is TEbW2bgoODa96rO3ZTn7ND5fSOGx0h)
g. Run the addauth command with the password:
d. If you are not running JobTracker in a high availability configuration, delete the HA znode: rmr
/hadoop-ha.
Deleting Hosts
Minimum Required Role: Full Administrator
You can remove a host from a cluster in two ways:
• Delete the host entirely from Cloudera Manager.
• Remove a host from a cluster, but leave it available to other clusters managed by Cloudera Manager.
Both methods decommission the hosts, delete roles, and remove managed service software, but preserve data
directories.
Managing Services
Cloudera Manager service configuration features let you manage the deployment and configuration of CDH and
managed services. You can add new services and roles if needed, gracefully start, stop and restart services or roles,
and decommission and delete roles or services if necessary. Further, you can modify the configuration properties for
services or for individual role instances. If you have a Cloudera Enterprise license, you can view past configuration
changes and roll back to a previous revision. You can also generate client configuration files, enabling you to easily
distribute them to the users of a service.
The topics in this chapter describe how to configure and use the services on your cluster. Some services have unique
configuration requirements or provide unique features: those are covered in Managing Services on page 132.
Adding a Service
Minimum Required Role: Full Administrator
Important: As of February 1, 2021, all downloads of CDH and Cloudera Manager require a username
and password and use a modified URL. You must use the modified URL, including the username and
password when downloading the repository contents described below. You may need to upgrade
Cloudera Manager to a newer version that uses the modified URLs.
This can affect new installations, upgrades, adding new hosts to a cluster, downloading a new parcel,
and adding a cluster.
For more information, see Updating an existing CDH/Cloudera Manager deployment to access
downloads with authentication.
After initial installation, you can use the Add a Service wizard to add and configure new service instances. For example,
you may want to add a service such as Oozie that you did not select in the wizard during the initial installation.
The binaries for the following services are not packaged in CDH and must be installed individually before being adding
the service:
If you do not add the binaries before adding the service, the service will fail to start.
To add a service:
1. On the Home > Status tab, click
to the right of the cluster name and select Add a Service. A list of service types display. You can add one type of
service at a time.
2. Select a service and click Continue. If you are missing required binaries, a pop-up displays asking if you want to
continue with adding the service.
3. Select the services on which the new service should depend. All services must depend on the same ZooKeeper
service. Click Continue.
4. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of the
hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of hosts to
which the HDFS DataNode role is assigned. You can reassign role instances.
Click a field below a role to display a dialog box containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the hosts dialog box.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
5. Review and modify configuration settings, such as data directory paths and heap sizes and click Continue.
If you are installing the Accumulo Service, select Intialize Accumulo to initialize the service as part of the installation
process.
The service is started.
6. Click Continue then click Finish. You are returned to the Home page.
7. Verify the new service is started properly by checking the health status for the new service. If the Health Status
is Good, then the service started properly.
The service's properties will be displayed showing the values for each property for the selected clusters. The filters on
the left side can be used to limit the properties displayed.
You can also view property configuration values that differ between clusters across a deployment by selecting
Non-uniform Values on the Configuration tab of the Cloudera Manager Home > Status tab. For more information, see
Cluster-Wide Configuration on page 108
Add-on Services
Minimum Required Role: Full Administrator
Cloudera Manager supports adding new types of services (referred to as an add-on service) to Cloudera Manager,
allowing such services to leverage Cloudera Manager distribution, configuration, monitoring, resource management,
and life-cycle management features. An add-on service can be provided by Cloudera or an independent software
vendor (ISV). If you have multiple clusters managed by Cloudera Manager, an add-on service can be deployed on any
of the clusters.
Note: If the add-on service is already installed and running on hosts that are not currently being
managed by Cloudera Manager, you must first add the hosts to a cluster that's under management.
See Adding a Host to the Cluster on page 228 for details.
RHEL 6 compatible:
2. Log on to the Cloudera Manager Server host, and place the CSD file under the location configured for CSD files.
3. Set the file ownership to cloudera-scm:cloudera-scm with permission 644.
4. Restart the Cloudera Manager Server:
5. Log into the Cloudera Manager Admin Console and restart the Cloudera Management Service.
a. Do one of the following:
• 1. Select Clusters > Cloudera Management Service.
2. Select Actions > Restart.
• On the Home > Status tab, click
Note: It is not required that the Cloudera Manager server host be part of a managed cluster and have
an agent installed. Although you initially copy the CSD file to the Cloudera Manager server, the Parcel
for the add-on service will not be installed on the Cloudera Manager Server host unless the host is
managed by Cloudera Manager.
If you have already installed the external software onto your cluster, you can skip these steps and proceed to Adding
an Add-on Service on page 253.
1.
Click in the main navigation bar. If the vendor has included the location of the repository in the CSD, the
parcel should already be present and ready for downloading. If the parcel is available, skip to step 7.
2. Use one of the following methods to open the parcel settings page:
• Navigation bar
1. Click the parcel icon in the top navigation bar or click Hosts and click the Parcels tab.
2. Click the Configuration button.
• Menu
1. Select Administration > Settings.
2. Select Category > Parcels.
3. In the Remote Parcel Repository URLs list, click the addition symbol to open an additional row.
4. Enter the path to the repository.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6.
Click . The external parcel should appear in the set of parcels available for download.
7. Download, distribute, and activate the parcel. See Managing Parcels.
5. After the server has restarted, log into the Cloudera Manager Admin Console and restart the Cloudera Management
Service.
6. Optionally remove the parcel.
Note: If you are unable to start the HDFS service, it's possible that one of the roles instances, such
as a DataNode, was running on a host that is no longer connected to the Cloudera Manager Server
host, perhaps because of a hardware or network failure. If this is the case, the Cloudera Manager
Server will be unable to connect to the Cloudera Manager Agent on that disconnected host to start
the role instance, which will prevent the HDFS service from starting. To work around this, you can
stop all services, abort the pending command to start the role instance on the disconnected host, and
then restart all services again without that role instance. For information about aborting a pending
command, see Aborting a Pending Command on page 257.
Restarting a Service
It is sometimes necessary to restart a service, which is essentially a combination of stopping a service and then starting
it again. For example, if you change the hostname or port where the Cloudera Manager is running, or you enable TLS
security, you must restart the Cloudera Management Service to update the URL to the Server.
1. On the Home > Status tab, click
Rolling Restart
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
Rolling restart allows you to conditionally restart the role instances of the following services to update software or use
a new configuration:
• Flume
• HBase
• HDFS
• Kafka
• Key Trustee KMS
• Key Trustee Server
• MapReduce
• Oozie
• YARN
• ZooKeeper
If the service is not running, rolling restart is not available for that service. You can specify a rolling restart of each
service individually.
If you have HDFS high availability enabled, you can also perform a cluster-level rolling restart. At the cluster level, the
rolling restart of worker hosts is performed on a host-by-host basis, rather than per service, to avoid all roles for a
service potentially being unavailable at the same time. During a cluster restart, to avoid having your NameNode (and
thus the cluster) be unavailable during the restart, Cloudera Manager forces a failover to the standby NameNode.
MapReduce (MRv1) JobTracker High Availability on page 492 and YARN (MRv2) ResourceManager High Availability on
page 488 is not required for a cluster-level rolling restart. However, if you have JobTracker or ResourceManager high
availability enabled, Cloudera Manager will force a failover to the standby JobTracker or ResourceManager.
performance degradation. If you are using a single rack only, you should only restart one worker node at a
time to ensure data availability during upgrade.
• How long should Cloudera Manager wait before starting the next batch.
• The number of batch failures that will cause the entire rolling restart to fail (this is an advanced feature). For
example if you have a very large cluster you can use this option to allow failures because if you know that
your cluster will be functional even if some worker roles are down.
Note:
• HDFS - If you do not have HDFS high availability configured, a warning appears reminding
you that the service will become unavailable during the restart while the NameNode is
restarted. Services that depend on that HDFS service will also be disrupted. Cloudera
recommends that you restart the DataNodes one at a time—one host per batch, which is
the default.
• HBase
– Administration operations such as any of the following should not be performed during
the rolling restart, to avoid leaving the cluster in an inconsistent state:
– Split
– Create, disable, enable, or drop table
– Metadata changes
– Create, clone, or restore a snapshot. Snapshots rely on the RegionServers being
up; otherwise the snapshot will fail.
– To increase the speed of a rolling restart of the HBase service, set the Region Mover
Threads property to a higher value. This increases the number of regions that can be
moved in parallel, but places additional strain on the HMaster. In most cases, Region
Mover Threads should be set to 5 or lower.
– Another option to increase the speed of a rolling restart of the HBase service is to set
the Skip Region Reload During Rolling Restart property to true. This setting can cause
regions to be moved around multiple times, which can degrade HBase client performance.
• MapReduce - If you restart the JobTracker, all current jobs will fail.
• YARN - If you restart ResourceManager and ResourceManager HA is enabled, current jobs
continue running: they do not restart or fail. ResourceManager HA is supported for CDH 5.2
and higher.
• ZooKeeper and Flume - For both ZooKeeper and Flume, the option to restart roles in batches
is not available. They are always restarted one by one.
4. If you select an HDFS, HBase, or MapReduce service, you can have their worker roles restarted in batches. You
can configure:
• How many roles should be included in a batch - Cloudera Manager restarts the worker roles rack-by-rack in
alphabetical order, and within each rack, hosts are restarted in alphabetical order. If you are using the default
replication factor of 3, Hadoop tries to keep the replicas on at least 2 different racks. So if you have multiple
racks, you can use a higher batch size than the default 1. But you should be aware that using too high batch
size also means that fewer worker roles are active at any time during the upgrade, so it can cause temporary
performance degradation. If you are using a single rack only, you should only restart one worker node at a
time to ensure data availability during upgrade.
• How long should Cloudera Manager wait before starting the next batch.
• The number of batch failures that will cause the entire rolling restart to fail (this is an advanced feature). For
example if you have a very large cluster you can use this option to allow failures because if you know that
your cluster will be functional even if some worker roles are down.
5. Click Restart to start the rolling restart. While the restart is in progress, the Command Details page shows the
steps for stopping and restarting the services.
You can click the indicator ( ) with the blue badge, which shows the number of commands that are currently
running in your cluster (if any). This indicator is positioned just to the left of the Support link at the right hand side of
the navigation bar. Unlike the Commands tab for a role or service, this indicator includes all commands running for all
services or roles in the cluster. In the Running Commands window, click Abort to abort the pending command. For
more information, see Viewing Running and Recent Commands on page 301.
To abort a pending command for a service or role:
1. Go to the Service > Instances tab for the service where the role instance you want to stop is located. For example,
go to the HDFS Service > Instances tab if you want to abort a pending command for a DataNode.
2. In the list of instances, click the link for role instance where the command is running (for example, the instance
that is located on the disconnected host).
3. Go to the Commands tab.
4. Find the command in the list of Running Commands and click Abort Command to abort the running command.
Deleting Services
Minimum Required Role: Full Administrator
1. Stop the service. For information on starting and stopping services, see Starting, Stopping, and Restarting Services
on page 253.
Renaming a Service
Minimum Required Role: Full Administrator
A service is given a name upon installation, and that name is used as an identifier internally. However, Cloudera Manager
allows you to provide a display name for a service, and that name will appear in the Cloudera Manager Admin Console
instead of the original (internal) name.
1. On the Home > Status tab, click
NameNode *.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink
*.period=10
namenode.sink.graphite.server_host=<hostname>
namenode.sink.graphite.server_port=<port>
namenode.sink.graphite.metrics_prefix=<prefix>
SecondaryNameNode *.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink
*.period=10
secondarynamenode.sink.graphite.server_host=<hostname>
secondarynamenode.sink.graphite.server_port=<port>
secondarynamenode.sink.graphite.metrics_prefix=<prefix>
ResourceManager *.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink
*.period=10
resourcemanager.sink.graphite.server_host=<hostname>
resourcemanager.sink.graphite.server_port=<port>
resourcemanager.sink.graphite.metrics_prefix=<prefix>
Note: To use metrics, set values for each context. For example, for MapReduce1, add values for both
the JobTracker Default Group and the TaskTracker Default Group.
hbase.period=10
hbase.servers=<graphite
hostname>:<port>
jvm.class=org.apache.hadoop.metrics.graphite.GraphiteContext
jvm.period=10
jvm.servers=<graphite
hostname>:<port>
rpc.class=org.apache.hadoop.metrics.graphite.GraphiteContext
rpc.period=10
rpc.servers=<graphite
hostname>:<port>
mapred.class=org.apache.hadoop.metrics.graphite.GraphiteContext
mapred.period=10
mapred.servers=<graphite
hostname>:<port>
jvm.class=org.apache.hadoop.metrics.graphite.GraphiteContext
jvm.period=10
jvm.servers=<graphite
hostname>:<port>
rpc.class=org.apache.hadoop.metrics.graphite.GraphiteContext
rpc.period=10
rpc.servers=<graphite
hostname>:<port>
7. To add optional parameters for socket connection retry, modify this example as necessary:
8. To define a filter, which is recommended for preventing YARN metrics from overwhelming the Ganglia server, do
so on the sink side. For example:
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter
*.record.filter.class=${*.source.filter.class}
*.metric.filter.class=${*.source.filter.class}
nodemanager.sink.ganglia.record.filter.exclude=ContainerResource*
*.period=10
datanode.sink.ganglia.servers=<hostname>:<port>
NameNode *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.period=10
namenode.sink.ganglia.servers=<hostname>:<port>
SecondaryNameNode *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.period=10
secondarynamenode.sink.ganglia.servers=<hostname>:<port>
*.period=10
nodemanager.sink.ganglia.servers=<hostname>:<port>
ResourceManager *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.period=10
resourcemanager.sink.ganglia.servers=<hostname>:<port>
*.period=10
jobhistoryserver.sink.ganglia.servers=<hostname>:<port>
Note: To use metrics, set values for each context. For example, for MapReduce1, add values for both
the JobTracker Default Group and the TaskTracker Default Group.
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=<hostname>:<port>
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=<hostname>:<port>
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=<hostname>:<port>
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=<hostname>:<port>
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rpc.period=10
rpc.servers=<hostname>:<port>
Managing Roles
When Cloudera Manager configures a service, it configures hosts in your cluster with one or more functions (called
roles in Cloudera Manager) that are required for that service. The role determines which Hadoop daemons run on a
given host. For example, when Cloudera Manager configures an HDFS service instance it configures one host to run
the NameNode role, another host to run as the Secondary NameNode role, another host to run the Balancer role, and
some or all of the remaining hosts to run DataNode roles.
Configuration settings are organized in role groups. A role group includes a set of configuration properties for a specific
group, as well as a list of role instances associated with that role group. Cloudera Manager automatically creates default
role groups.
For role types that allow multiple instances on multiple hosts, such as DataNodes, TaskTrackers, RegionServers (and
many others), you can create multiple role groups to allow one set of role instances to use different configuration
settings than another set of instances of the same role type. In fact, upon initial cluster setup, if you are installing on
identical hosts with limited memory, Cloudera Manager will (typically) automatically create two role groups for each
worker role — one group for the role instances on hosts with only other worker roles, and a separate group for the
instance running on the host that is also hosting master roles.
The HDFS service is an example of this: Cloudera Manager typically creates one role group (DataNode Default Group)
for the DataNode role instances running on the worker hosts, and another group (HDFS-1-DATANODE-1) for the
DataNode instance running on the host that is also running the master roles such as the NameNode, JobTracker, HBase
Master and so on. Typically the configurations for those two classes of hosts will differ in terms of settings such as
memory for JVMs.
Cloudera Manager configuration screens offer two layout options: classic and new. The new layout is the default;
however, on each configuration page you can easily switch between layouts using the Switch to XXX layout link at the
top right of the page. For more information, see Cluster Configuration Overview on page 73.
Gateway Roles
A gateway is a special type of role whose sole purpose is to designate a host that should receive a client configuration
for a specific service, when the host does not have any roles running on it. Gateway roles enable Cloudera Manager
to install and manage client configurations on that host. There is no process associated with a gateway role, and its
status will always be Stopped. You can configure gateway roles for HBase, HDFS, Hive, Kafka, MapReduce, Solr, Spark,
Sqoop 1 Client, and YARN.
Role Instances
Adding a Role Instance
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
After creating services, you can add role instances to the services. For example, after initial installation in which you
created the HDFS service, you can add a DataNode role instance to a host where one was not previously running. Upon
upgrading a cluster to a new version of CDH you might want to create a role instance for a role added in the new
version.
1. Go to the service for which you want to add a role instance. For example, to add a DataNode role instance, go to
the HDFS service.
2. Click the Instances tab.
3. Click the Add Role Instances button.
4. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of the
hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of hosts to
which the HDFS DataNode role is assigned. You can reassign role instances.
Click a field below a role to display a dialog box containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the hosts dialog box.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
5. Click Continue.
6. In the Review Changes page, review the configuration changes to be applied. Confirm the settings entered for file
system paths. The file paths required vary based on the services to be installed. For example, you might confirm
the NameNode Data Directory and the DataNode Data Directory for HDFS. Click Continue. The wizard finishes by
performing any actions necessary to prepare the cluster for the new role instances. For example, new DataNodes
are added to the NameNode dfs_hosts_allow.txt file. The new role instance is configured with the default
role group for its role type, even if there are multiple role groups for the role type. If you want to use a different
role group, follow the instructions in Managing Role Groups on page 269 for moving role instances to a different
role group. The new role instances are not started automatically.
Important: Use Cloudera Manager to stop the Node Manager service. If it is stopped manually, it can
cause jobs to fail.
1. Go to the service that contains the role instances to start, stop, or restart.
2. Click the Instances tab.
3. Check the checkboxes next to the role instances to start, stop, or restart (such as a DataNode instance).
4. Select Actions for Selected > Start, Stop, or Restart, and then click Start, Stop, or Restart again to start the process.
When you see a Finished status, the process has finished.
Also see Rolling Restart on page 254.
decommission a DataNode or a host with a DataNode in such situations, the decommission process will not complete
and must be aborted.
A role will be decommissioned if its host is decommissioned. See Tuning and Troubleshooting Host Decommissioning
on page 239 for more details.
To remove a DataNode from the cluster, you decommission the DataNode role as described here and then perform a
few additional steps to remove the role. See Removing a DataNode on page 146.
To decommission role instances:
1. If you are decommissioning DataNodes, perform the steps in Tuning HDFS Prior to Decommissioning DataNodes
on page 239.
2. Click the service instance that contains the role instance you want to decommission.
3. Click the Instances tab.
4. Check the checkboxes next to the role instances to decommission.
5. Select Actions for Selected > Decommission, and then click Decommission again to start the process. A
Decommission Command pop-up displays that shows each step or decommission command as it is run. In the
Details area, click to see the subcommands that are run. Depending on the role, the steps may include adding
the host to an "exclusions list" and refreshing the NameNode, JobTracker, or NodeManager; stopping the Balancer
(if it is running); and moving data blocks or regions. Roles that do not have specific decommission actions are
stopped.
You can abort the decommission process by clicking the Abort button, but you must recommission and restart
the role.
The Commission State facet in the Filters list displays Decommissioning while decommissioning is in progress,
and Decommissioned when the decommissioning process has finished. When the process is complete, a is
added in front of Decommission Command.
Note: Deleting a role instance does not clean up the associated client configurations that have been
deployed in the cluster.
Role Groups
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
A role group is a set of configuration properties for a role type, as well as a list of role instances associated with that
group. Cloudera Manager automatically creates a default role group named Role Type Default Group for each role
type.Each role instance can be associated with only a single role group.
Role groups provide two types of properties: those that affect the configuration of the service itself and those that
affect monitoring of the service, if applicable (the Monitoring subcategory). (Not all services have monitoring properties).
For more information about monitoring properties see Configuring Monitoring Settings on page 280.
When you run the installation or upgrade wizard, Cloudera Manager configures the default role groups it adds, and
adds any other required role groups for a given role type. For example, a DataNode role on the same host as the
NameNode might require a different configuration than DataNode roles running on other hosts. Cloudera Manager
creates a separate role group for the DataNode role running on the NameNode host and uses the default configuration
for DataNode roles running on other hosts.
You can modify the settings of the default role group, or you can create new role groups and associate role instances
to whichever role group is most appropriate. This simplifies the management of role configurations when one group
of role instances may require different settings than another group of instances of the same role type—for example,
due to differences in the hardware the roles run on. You modify the configuration for any of the service's role groups
through the Configuration tab for the service. You can also override the settings inherited from a role group for a role
instance.
If there are multiple role groups for a role type, you can move role instances from one group to another. When you
move a role instance to a different group, it inherits the configuration settings for its new group.
set missing properties (for example the TaskTracker Local Data Directory List property, which is not populated
if you select None) and clear other validation warnings and errors.
Action Procedure
Rename 1. Click the role group name, click
Delete You cannot delete any of the default groups. The group must first be empty; if you
want to delete a group you've created, you must move any role instances to a
different role group.
1. Click the role group name.
2. Click
next to the role group name on the right, select Delete, and confirm by clicking
Delete. Deleting a role group removes it from host templates.
Time Line
The Time Line appears on many pages in Cloudera Manager. When you view the top level service and Hosts tabs, the
Time Line shows status and health only for a specific point in time. When you are viewing the Logs and Events tabs,
and when you are viewing the Status, Commands, Audits, Jobs, Applications, and Queries pages of individual services,
roles, and hosts, the Time Line appears as a Time Range Selector, which lets you highlight a range of time over which
to view historical data.
Click the ( ) icon at the far right to turn on and turn off the display of the Time Line.
Cloudera Manager displays timestamped data using the time zone of the host where Cloudera Manager server is
running. The time zone information can be found under the Support > About menu.
The background chart in the Time Line shows the percentage of CPU utilization on all hosts in the cluster, updated at
approximately one-minute intervals, depending on the total visible time range. You can use this graph to identify
periods of activity that may be of interest.
In the pages that support a time range selection, the area between the handles shows the selected time range.
There are a variety of ways to change the time range in this mode.
The Reports screen (Clusters > Reports) does not support the Time Range Selector: the historical reports accessed
from the Reports screen have their own time range selection mechanism.
You can select the point in time in one of the following ways:
• By moving the Time Marker ( )
• When the Time Marker is set to a past time, you can quickly switch back to view the current time using the Now
button ( ).
• By clicking the date, choosing the date and time, and clicking Apply.
to open the time selection widget. Enter a start and end time and click Apply to put your choice into effect.
• When you are under the Clusters tab with an individual activity selected, a Zoom to Duration button is available.
This lets you zoom the time selection to include just the time range that corresponds to the duration of your
selected activity.
Health Tests
Cloudera Manager monitors the health of the services, roles, and hosts that are running in your clusters using health
tests. The Cloudera Management Service also provides health tests for its roles. Role-based health tests are enabled
by default. For example, a simple health test is whether there's enough disk space in every NameNode data directory.
A more complicated health test may evaluate when the last checkpoint for HDFS was compared to a threshold or
whether a DataNode is connected to a NameNode. Some of these health tests also aggregate other health tests: in a
distributed system like HDFS, it's normal to have a few DataNodes down (assuming you've got dozens of hosts), so we
allow for setting thresholds on what percentage of hosts should color the entire service down.
Health tests can return one of three values: Good, Concerning, and Bad. A test returns Concerning health if the test
falls below a warning threshold. A test returns Bad if the test falls below a critical threshold. The overall health of a
service or role instance is a roll-up of its health tests. If any health test is Concerning (but none are Bad) the role's or
service's health is Concerning; if any health test is Bad, the service's or role's health is Bad.
In the Cloudera Manager Admin Console, health tests results are indicated with colors: Good , Concerning , and
Bad .
There are two types of health tests:
• Pass-fail tests - there are two types:
– Compare a property to a yes-no value. For example, whether a service or role started as expected, a DataNode
is connected to its NameNode, or a TaskTracker is (or is not) blacklisted.
– Exercise a service lightly to confirm it is working and responsive. HDFS (NameNode role), HBase, and ZooKeeper
services perform these tests, which are referred to as "canary" tests.
Both types of pass-fail tests result in the health reported as being either Good or Bad.
• Metric tests - compare a property to a numeric value. For example, the number of file descriptors in use, the
amount of disk space used or free, how much time spent in garbage collection, or how many pages were swapped
to disk in the previous 15 minutes. In these tests the property is compared to a threshold that determine whether
everything is Good, (for example, plenty of disk space available), whether it is Concerning (disk space getting low),
or is Bad (a critically low amount of disk space).
By default most health tests are enabled and (if appropriate) configured with reasonable thresholds. You can modify
threshold values by editing the monitoring properties under the entity's Configuration tab. You can also enable or
disable individual or summary health tests, and in some cases specify what should be included in the calculation of
overall health for the service, role instance, or host. See Configuring Monitoring Settings on page 280 for more
information.
Note: Suppressing a health test is different than disabling a health test. A disabled health test never
runs, whereas a suppressed health test runs but its results are hidden.
Status
The Status tab contains:
• Clusters - The clusters being managed by Cloudera Manager. Each cluster is displayed either in summary form or
in full form depending on the configuration of the Administration > Settings > Other > Maximum Cluster Count
Shown In Full property. When the number of clusters exceeds the value of the property, only cluster summary
information displays.
– Summary Form - A list of links to cluster status pages. Click Customize to jump to the Administration >
Settings > Other > Maximum Cluster Count Shown In Full property.
– Full Form - A separate section for each cluster containing a link to the cluster status page and a table containing
links to the Hosts page and the status pages of the services running in the cluster.
Each service row in the table has a menu of actions that you select by clicking
Click the indicator to display the Health Issues pop-up dialog box.
By default only Bad health test results are shown in the dialog box. To display
Concerning health test results, click the Also show n concerning issue(s)
link.Click the link to display the Status page containing with details about
the health test result.
Configuration Indicates that the service has at least one configuration issue. The indicator
issue shows the number of configuration issues at the highest severity level. If
there are configuration errors, the indicator is red. If there are no errors
but configuration warnings exist, then the indicator is yellow. No indicator
is shown if there are no configuration notifications.
Click the indicator to display the Configuration Issues pop-up dialog box.
By default only notifications at the Error severity level are listed, grouped
by service name are shown in the dialog box. To display Warning
notifications, click the Also show n warning(s) link.Click the message
associated with an error or warning to be taken to the configuration property
for which the notification has been issued where you can address the
issue.See Managing Services on page 249.
Restart Configuration Indicates that at least one of a service's roles is running with a configuration
Needed modified that does not match the current configuration settings in Cloudera Manager.
Refresh Click the indicator to display the Stale Configurations on page 91 page.To
Needed bring the cluster up-to-date, click the Refresh or Restart button on the Stale
Configurations page or follow the instructions in Refreshing a Cluster on
page 105, Restarting a Cluster on page 106, or Restarting Services and
Instances after Configuration Changes on page 78.
Client Indicates that the client configuration for a service should be redeployed.
configuration
Click the indicator to display the Stale Configurations on page 91 page.To
redeployment
bring the cluster up-to-date, click the Deploy Client Configuration button
required
on the Stale Configurations page or follow the instructions in Manually
Redeploying Client Configuration Files on page 94.
– Cloudera Management Service - A table containing a link to the Cloudera Manager Service. The Cloudera
Manager Service has a menu of actions that you select by clicking
.
– Charts - A set of charts (dashboard) that summarize resource utilization (IO, CPU usage) and processing
metrics.
Click a line, stack area, scatter, or bar chart to expand it into a full-page view with a legend for the individual
charted entities as well more fine-grained axes divisions.
By default the time scale of a dashboard is 30 minutes. To change the time scale, click a duration link
at the top-right of the dashboard.
To set the dashboard type, click and select one of the following:
• Custom - displays a custom dashboard.
• Default - displays a default dashboard.
• Reset - resets the custom dashboard to the predefined set of charts, discarding any customizations.
Displays all commands run recently across the clusters. A badge indicates how many recent commands are
still running. Click the command link to display details about the command and child commands. See also Viewing
Running and Recent Commands on page 301.
Note: You can configure the Cloudera Manager Admin Console to automatically log out a user after
a configurable period of time. See Automatic Logout on page 25.
Note: Time values that appear in Cloudera Manager charts reflect the time zone setting on the
Cloudera Manager client machine, but time values returned by the Cloudera Manager API (including
those that appear in JSON and CSV files exported from charts) reflect Coordinated Universal Time
(UTC). For more information on the timestamp format, see the Cloudera Manager API documentation,
for example, ApiTimeSeriesData.java.
to the upper right of the chart to switch between custom and default dashboards.
• Charts can also be added to a custom dashboard. Click the icon at the top right and click Add to Dashboard.
You can add the chart to an existing dashboard by selecting Add chart to an existing custom or system dashboard
and selecting the dashboard name. Add the chart to a new dashboard by clicking Add chart to a new custom
dashboard and enter a new name in the Dashboard Name field.
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
There are several types of monitoring settings you can configure in Cloudera Manager:
• Health tests - For a service or role for which monitoring is provided, you can enable and disable selected health
tests and events, configure how those health tests factor into the overall health of the service, and modify thresholds
for the status of certain health tests. For hosts you can disable or enable selected health tests, modify thresholds,
and enable or disable health alerts.
• Free space - For hosts, you can set threshold-based monitoring of free space in the various directories on the
hosts Cloudera Manager monitors.
• Activities - For MapReduce, YARN, and Impala services, you can configure aspects of how Cloudera Manager
monitors activities, applications, and queries.
• Alerts - For all roles you can configure health alerts and configuration change alerts. You can also configure some
service specific alerts and how alerts are delivered.
• Log events - For all roles you can configure logging thresholds, log directories, log event capture, when log messages
become events, and when to generate log alerts.
• Monitoring roles - For the Cloudera Management Service you can configure monitoring settings for the monitoring
roles themselves—enable and disable health tests on the monitoring processes as well as configuring some general
settings related to events and alerts (specifically with the Event Server and Alert Publisher). Each of the Cloudera
Management Service roles has its own parameters that can be modified to specify how much data is retained by
that service. For some monitoring functions, the amount of retained data can grow very large, so it may become
necessary to adjust the limits.
For general information about modifying configuration settings, see Modifying Configuration Properties Using Cloudera
Manager on page 74.
Depending on the service or role you select, and the configuration category, you can enable or disable health tests,
determine when health tests cause alerts, or determine whether specific health tests are used in computing the overall
health of a role or service. In most cases you can disable these "roll-up" health tests separately from the individual
health tests.
As a rule, a health test whose result is considered "Concerning" or "Bad" is forwarded as an event to the Event Server.
That includes health tests whose results are based on configured Warning or Critical thresholds, as well pass-fail type
health tests. An event is also published when the health test result returns to normal.
You can control when an individual health test is forwarded as an event or as an alert by modifying the threshold values
for the relevant health test.
foo=10
bar=20
any activity named "foo" would be marked slow if it ran for more than 10 minutes. Any activity named "bar" would be
marked slow if it ran for more than 20 minutes.
Since Java regular expressions can be used, if the rule set is:
foo.*=10
bar=20
any activity with a name that starts with "foo" (for example, fool, food, foot) matches the first rule.
If there is no match for an activity, then that activity is not monitored for job duration. However, you can add a "catch-all"
as the last rule that always matches any name:
foo.*=10
bar=20
baz=30
.*=60
In this case, any job that runs longer than 60 minutes is marked slow and generates an event.
8. Click the icon that is next to any stale services to invoke the cluster restart wizard.
Configuring Alerts
Enabling Activity Monitor Alerts
You can enable alerts when an activity runs too slowly or fails.
1. Go to the MapReduce service.
2. Click the Configuration tab.
3. Select Scope > MapReduce service_name (Service-Wide).
4. Click the Monitoring category.
5. Check the Alert on Slow Activities or Alert on Activity Failure checkboxes.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
7. Return to the Home page by clicking the Cloudera Manager logo.
8. Click the icon that is next to any stale services to invoke the cluster restart wizard.
Enabling Configuration Change Alerts
Configuration change alerts can be set service wide, or on specific roles for the service.
1. Click a service, role, or host.
Note: If alerting is enabled for events, you can search for and view alerts in the Events tab, even if
you do not have email notification configured.
Important: We do not recommend logging to a network-mounted file system. If a role is writing its
logs across the network, a network failure or the failure of a remote file system can cause that role
to freeze up until the network recovers.
Configuring Logs
1. Go to a service.
2. Click the Configuration tab.
3. Select role_name (Service-Wide) > Logs .
4. Edit a log property.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Return to the Home page by clicking the Cloudera Manager logo.
7. Click the icon that is next to any stale services to invoke the cluster restart wizard.
The number of messages is greater and severity is least for TRACE. The default setting is INFO.
1. Go to a service.
2. Click the Configuration tab.
3. Enter Logging Threshold in the Search text field.
4. For the desired role group, select a logging threshold level.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Return to the Home page by clicking the Cloudera Manager logo.
7. Click the icon that is next to any stale services to invoke the cluster restart wizard.
2. Enter a Reason for change, and then click Save Changes to commit the changes.
3. Return to the Home page by clicking the Cloudera Manager logo.
4. Click the icon that is next to any stale services to invoke the cluster restart wizard.
to search for the property by name. Cloudera Manager shows all copies of the property that matches the search
filter.
6. In the Content field, edit the rules as needed. Rules can be written as regular expressions.
7. Enter a Reason for change, and then click Save Changes to commit the changes.
8. Return to the Home page by clicking the Cloudera Manager logo.
9. Click the icon that is next to any stale services to invoke the cluster restart wizard.
Cloudera defines a number of rules by default. For example:
• The line {"rate": 10, "threshold":"FATAL"}, means log entries with severity FATAL should be forwarded
as events, up to 10 a minute.
• The line {"rate": 0, "exceptiontype": "java.io.EOFException"}, means log entries with the exception
java.io.EOFException should always be forwarded as an event.
The syntax for these rules is defined in the Description field for this property: the syntax lets you create rules that
identify log messages based on log4j severity, message content matching, or the exception type. These rules must
result in valid JSON.
Note: Editing these rules is not recommended. Cloudera Manager provides a default set of rules that
should be sufficient for most users.
Monitoring Clusters
There are several ways to monitor clusters.
The Clusters tab in the top navigation bar displays each cluster's services in its own section, with the Cloudera
Management Service separately below. You can select the following cluster-specific pages: hosts, reports, activities,
and resource management.
The Home > Status tab displays the clusters being managed by Cloudera Manager. Each cluster is displayed either in
summary form or in full form depending on the configuration of the Administration > Settings > Other > Maximum
Cluster Count Shown In Full property. When the number of clusters exceeds the value of the property, only cluster
summary information displays.
To display a cluster Status page, click the cluster name on the Home > Status tab Status tab. The cluster Status page
displays a table containing links to the Hosts page and the status pages of the services running in the cluster.
Each service row in the table has a menu of actions that you select by clicking
Click the indicator to display the Health Issues pop-up dialog box.
By default only Bad health test results are shown in the dialog box. To display
Concerning health test results, click the Also show n concerning issue(s) link.Click
the link to display the Status page containing with details about the health test
result.
Configuration issue Indicates that the service has at least one configuration issue. The indicator shows
the number of configuration issues at the highest severity level. If there are
configuration errors, the indicator is red. If there are no errors but configuration
warnings exist, then the indicator is yellow. No indicator is shown if there are no
configuration notifications.
Click the indicator to display the Configuration Issues pop-up dialog box.
By default only notifications at the Error severity level are listed, grouped by service
name are shown in the dialog box. To display Warning notifications, click the Also
show n warning(s) link.Click the message associated with an error or warning to
be taken to the configuration property for which the notification has been issued
where you can address the issue.See Managing Services on page 249.
Restart Configuration Indicates that at least one of a service's roles is running with a configuration that
Needed modified does not match the current configuration settings in Cloudera Manager.
Refresh Click the indicator to display the Stale Configurations on page 91 page.To bring
Needed the cluster up-to-date, click the Refresh or Restart button on the Stale
Configurations page or follow the instructions in Refreshing a Cluster on page 105,
Restarting a Cluster on page 106, or Restarting Services and Instances after
Configuration Changes on page 78.
Client configuration Indicates that the client configuration for a service should be redeployed.
redeployment
Click the indicator to display the Stale Configurations on page 91 page.To bring
required
the cluster up-to-date, click the Deploy Client Configuration button on the Stale
Configurations page or follow the instructions in Manually Redeploying Client
Configuration Files on page 94.
The right side of the status page displays charts (dashboard) that summarize resource utilization (IO, CPU usage) and
processing metrics. .
Note: If you delete a cluster, the deleted cluster still displays in some charts. This is because the charts
also show historical data. Over time, data from the deleted cluster will drop off as older data is replaced
by more current data. You can work around this by:
• Waiting for the data from the deleted cluster to drop off.
• Editing the where clause of query for the chart to include only the cluster(s) you are interested
in. (For example: clusterDisplayName=Cluster_1). You can revert to the original query at
a later date, after the data for the deleted cluster has dropped off. See Charting Time-Series Data
on page 365.
• Deleting all data in the Host Monitor and Service Monitor storage directories and starting from
scratch. You will, however, loose all historical data from both current and deleted clusters. See
Configuring Host Monitor Data Storage on page 454 and Configuring Service Monitor Data Storage
on page 454 to learn where the storage directories are located.
Running the Network Performance Inspector From the Cloudera Manager Admin Console
To run the Network Performance Inspector:
1. Open the Network Performance Inspector from one of the following pages in the Cloudera Manager Admin
Console:
• From the All Hosts page:
1. Select Hosts > All Hosts.
2. Click the Inspect Network Performance button to launch the inspector.
• From the Cluster Status page:
1. Select Clusters > Cluster Name
2. In the Status section, click the Hosts link at the top of the list of Cluster Services to open the Hosts page.
3. Click the Inspect Cluster Network Performance button to launch the inspector.
Note: Cloudera recommends that do not abort the command. If you have selected the bandwidth
test and then abort the job before it completes, subsequent runs of the test may fail, and you
will need to perform a workaround on all the tested hosts to reset the inspector. See Known
Issues.
The Cluster Performance Inspector command window opens. When the Inspector finishes, click the Show Inspector
Results button to open the Network Performance Inspector Results page.
Select Show Hosts with Issues to display any problems found by the inspector, or select Show All Hosts. The
Network Diagnostic Result page displays a table of results. Each row represents a single host. The Target Hosts
Summary column summarizes the performance of the host. Click the summary text to view detailed performance
statistics from this host to each of the other hosts.
The inspector summarizes the latency and bandwidth tests for the hosts using three icons:
Table 10:
Orange 250 - 500 MBits/Second Any host with a ping latency in the range of
1 to 4 milliseconds.
Concerning network
performance
Red Less than 250 MBits/Second Any target host that is unreachable by
hostname (has a 100% packet loss), any host
Bad network
with a ping latency greater than 4
performance.
milliseconds, or any target host with a packet
loss of 1% or greater.
Poor performance can result from firewall settings, router configurations, network topology, and other factors.
You may need to work with your network administrator to mitigate these issues.
Running the Network Performance Inspector From the Cloudera Manager API
You can invoke the Network Performance Inspector using the Cloudera Manager API and the following endpoints:
• /cm/commmands/hostsPerfInspector
Invokes the inspector across an arbitrary set of hosts, (including hosts that are not part of the cluster).
• /cm/commands/clusterPerfInspector
Invokes the inspector across the hosts in two clusters.
• /cm/clusters/cluster Name/commands/perfInspector
Invokes the inspector across the hosts of a specified cluster.
For more information, see the Cloudera Manager REST API documentation
Monitoring Services
Cloudera Manager's Service Monitoring feature monitors dozens of service health and performance metrics about the
services and role instances running on your cluster:
• Presents health and performance data in a variety of formats including interactive charts
• Monitors metrics against configurable thresholds
• Generates events related to system and service health and critical log entries and makes them available for
searching and alerting
• Maintains a complete record of service-related actions and configuration changes
This opens the Status page where you can view a variety of information about a service and its performance. See
Viewing Service Status on page 293 for details.
Note: Not all service types provide complete monitoring and health information. Hive, Hue, Oozie,
Solr, and YARN (CDH 4 only) only provide the basic Status Summary on page 294.
Each service that supports monitoring provides a set of monitoring properties where you can enable or disable health
tests and events, and set thresholds for tests and modify thresholds for the status of certain health tests. For more
information see Configuring Monitoring Settings on page 280.
The HDFS, MapReduce, HBase, ZooKeeper, and Flume services also provide additional information: a snapshot of
service-specific metrics, health test results, health history, and a set of charts that provide a historical view of metrics
of interest.
service. The Actions menu is disabled while viewing a past status to ensure that you cannot accidentally act on outdated
status information.
See Time Line on page 271 for more details.
Status Summary
The Status Summary shows the status of each service instance being managed by Cloudera Manager. Even services
such as Hue, Oozie, or YARN (which are not monitored by Cloudera Manager) show a status summary. The overall
status for a service is a roll-up of the health test results for the service and all its role instances. The Status can be:
Unknown health The status of a service or role instance is unknown. This can occur for a
number of reasons, such as the Service Monitor is not running, or
connectivity to the Agent doing the health monitoring has been lost.
To see the status of one or more role instances, click the role type link under Status Summary. If there is a single
instance of the role type, the link directs you to the Status page of the role instance.
If there are multiple role instances (such as for DataNodes, TaskTrackers, and RegionServers), the role type link directs
you to the Role Instances page for that role type. Click on each instance, under Role Type, to be taken to the
corresponding Status page.
To display the results for each health test that applies to this role type, expand the Health Tests filter on the left and
expand Good Health, Warnings, Bad Health, or Disabled Health. Health test results that have been filtered out by
your role type selection appear as unavailable.
Service Summary
Some services (specifically HDFS, MapReduce, HBase, Flume, and ZooKeeper) provide additional statistics about their
operation and performance. These are shown in a Summary panel at the left side of the page. The contents of this
panel depend on the service:
• The HDFS Summary shows disk space usage.
• The MapReduce Summary shows statistics on slot usage, jobs and so on.
• The Flume Summary provides a link to a page of Flume metric details. See Flume Metric Details on page 296.
• The ZooKeeper Summary provides links to the ZooKeeper role instances (nodes) as well as Zxid information if you
have a ZooKeeper Quorum (multiple ZooKeeper servers).
For example:
Other services such as Hue, Oozie, Impala, and Cloudera Manager itself, do not provide a Service Summary.
Charts
HDFS, MapReduce, HBase, ZooKeeper, Flume, and Cloudera Management Service all display charts of some of the
critical metrics related to their performance and health. Other services such as Hive, Hue, Oozie, and Solr do not provide
charts.
See Viewing Charts for Cluster, Service, Role, and Host Instances on page 279 for detailed information on the charts
that are presented, and the ability to search and display metrics of your choice.
Note: The information on this page is always the Current information for the selected service and
roles. This page does not support a historical view: thus, the Time Range Selector is not available.
• Whether the role is currently in maintenance mode. If the role has been set into maintenance mode explicitly,
you will see the following icon ( ). If it is in effective maintenance mode due to the service or its host having
been set into maintenance mode, the icon will be this ( ).
• Whether the role is currently decommissioned.
You can sort or filter the Instances list by criteria in any of the displayed columns:
• Sort
1. Click the column header by which you want to sort. A small arrow indicates whether the sort is in ascending
or descending order.
2. Click the column header again to reverse the sort order.
• Filter - Type a property value in the Search box or select the value from the facets at the left of the page.
Unknown health The status of a service or role instance is unknown. This can occur for a
number of reasons, such as the Service Monitor is not running, or
connectivity to the Agent doing the health monitoring has been lost.
Summary
The Summary panel provides basic information about the role instance, where it resides, and the health of its host.
All role types provide the Summary panel. Some role instances related to HDFS, MapReduce, and HBase also provide
a Health Tests panel and associated charts.
• Clicking the Details link for a health test displays further information about the test, such as the meaning of the
test and its possible results, suggestions for actions you can take or how to make configuration changes related
to the test. The help text may include a link to the relevant monitoring configuration section for the service. See
Configuring Monitoring Settings on page 280 for more information.
• In the Health Tests panel:
– Clicking displays the lists of health tests that contributed to the health test.
– Clicking the Details link displays further information about the health test.
• In the Health History panel:
– Clicking displays the lists of health tests that contributed to the health history.
– Clicking the Show link moves the time range to the historical time period.
Status Summary
The Status Summary panel reports a roll-up of the status of all the roles.
Charts
Charts are shown for roles that are related to HDFS, MapReduce, HBase, ZooKeeper, Flume, and Cloudera Management
Service. Roles related to other services such as Hue, Hive, Oozie, and YARN, do not provide charts.
See Viewing Charts for Cluster, Service, Role, and Host Instances on page 279 for detailed information on the charts
that are presented, and the ability to search and display metrics of your choice.
The indicator positioned just to the left of the Search field on the right side of the Admin Console main navigation
bar displays the number of commands currently running for all services or roles. To display the running commands,
click the indicator.
To display all commands that have run and finished recently, do one of the following:
• Click the All Recent Commands button in the window that pops up when you click the indicator. This command
displays information on all running and recent commands in the same form, as described below.
• Click the Cloudera Manager logo in the main navigation bar and click the All Recent Commands tab.
Command Details
The details available for a command depend on whether the command is running or recently completed.
Running Commands
The Running Commands area shows commands that are in progress.
While the status of the command is In Progress, an Abort button displays so that you can abort the command if
necessary.
The Commands status information is updated automatically while the command is running.
After the command has finished running (all its subcommands have finished), the status is updated, the Abort buttons
disappear, and the information for Recent Commands appears as described below.
Recent Commands
The Recent Commands area shows commands that were run and finished within the search time range you specified.
If no commands were run during the selected time range, you can double the time range selection by clicking the Try
expanding the time range selection link. If you are in the "current time" mode, the beginning time will move; if you
are looking at a time range in the past, both the beginning and ending times of the range are changed. You can also
change the time range using the options described in Time Line on page 271.
Command Details
In the Running Commands dialog box or Recent Commands page, click a command in the Command column to display
its details and any subcommands. The page title is the name of the command.
The Summary section at the top shows information about the command:
• The current status
• The context, which can be a cluster, service, host, or role
• The time the command started
• The duration of the command
• A message about the command completion
• If the context is a role, links to role instance logs
The Details section shows how many steps, if any, the selected command has and lists any subcommands.
Expand a command to view subcommands. In the Running Commands dialog box, each subcommand also has an Abort
button that is present as long as the subcommand is in progress.
You can perform the following actions:
• Select the option to display all the subcommands or only failed or running commands.
• Click the link in the Context column to go to the Status page for the component (host, service, or role instance)
to which this command is related.
• Click a Role Log tab to display the log for that role, and stdout and stderr if available for the role.
Click a duration link at the top right of the charts to change the time period for
which the resource usage displays.
Click a duration link at the top right of the charts to change the time period for
which the resource usage displays.
• Status - a summary of the virtual CPU cores and memory that can be allocated by the YARN scheduler.
• Resource Pools Usage - a list of pools that have been explicitly configured and pools created by YARN, and properties
of the pools. The Configuration link takes you to the Dynamic Resource Pool Configuration page.
– Allocated Memory - The memory assigned to the pool that is currently allocated to applications and queries.
– Allocated VCores - The number of virtual CPU cores assigned to the pool that are currently allocated to
applications and queries.
– Allocated Containers - The number of YARN containers assigned to the pool whose resources have been
allocated.
– Pending Containers - The number of YARN containers assigned to the pool whose resources are pending.
Monitoring Hosts
Cloudera Manager's Host Monitoring features let you manage and monitor the status of the hosts in your clusters.
• You can also search for hosts by selecting a value from the facets in the Filters section at the left of the page.
• If the Configuring Agent Heartbeat and Health Status Options on page 47 are configured as follows:
– Send Agent heartbeat every x
– Set health status to Concerning if the Agent heartbeats fail y
– Set health status to Bad if the Agent heartbeats fail z
The value v for a host's Last Heartbeat facet is computed as follows:
– v < x * y = Good
– v >= x * y and <= x * z = Concerning
– v >= x * z = Bad
Role Assignments
You can view the assignment of roles to hosts as follows:
1. Click the Roles tab.
2. Click a cluster name or All Clusters.
Disks Overview
Click the Disks Overview tab to display an overview of the status of all disks in the deployment. The statistics exposed
match or build on those in iostat, and are shown in a series of histograms that by default cover every physical disk
in the system.
Adjust the endpoints of the time line to see the statistics for different time periods. Specify a filter in the box to limit
the displayed data. For example, to see the disks for a single rack rack1, set the filter to: logicalPartition =
false and rackId = "rack1" and click Filter. Click a histogram to drill down and identify outliers. Mouse over
the graph and click to display additional information about the chart.
Host Details
You can view detailed information about each host, including:
• Name, IP address, rack ID
• Health status of the host and last time the Cloudera Manager Agent sent a heartbeat to the Cloudera Manager
Server
• Number of cores
• System load averages for the past 1, 5, and 15 minutes
• Memory usage
• File system disks, their mount points, and usage
• Health test results for the host
• Charts showing a variety of metrics and health test results over time.
• Role instances running on the host and their health
• CPU, memory, and disk resources used for each role instance
To view detailed host information:
1. Click the Hosts tab.
2. Click the name of one of the hosts. The Status page is displayed for the host you selected.
3. Click tabs to access specific categories of information. Each tab provides various categories of information about
the host, its services, components, and configuration.
From the status page you can view details about several categories of information.
Status
The Status page is displayed when a host is initially selected and provides summary information about the status of
the selected host. Use this page to gain a general understanding of work being done by the system, the configuration,
and health status.
If this host has been decommissioned or is in maintenance mode, you will see the following icon(s) ( , ) in the top
bar of the page next to the status message.
Details
This panel provides basic system configuration such as the host's IP address, rack, health status summary, and disk
and CPU resources. This information summarizes much of the detailed information provided in other panes on this
tab. To view details about the Host agent, click the Host Agent link in the Details section.
Health Tests
Cloudera Manager monitors a variety of metrics that are used to indicate whether a host is functioning as expected.
The Health Tests panel shows health test results in an expandable/collapsible list, typically with the specific metrics
that the test returned. (You can Expand All or Collapse All from the links at the upper right of the Health Tests panel).
• The color of the text (and the background color of the field) for a health test result indicates the status of the
results. The tests are sorted by their health status – Good, Concerning, Bad, or Disabled. The list of entries for
good and disabled health tests are collapsed by default; however, Bad or Concerning results are shown expanded.
• The text of a health test also acts as a link to further information about the test. Clicking the text will pop up a
window with further information, such as the meaning of the test and its possible results, suggestions for actions
you can take or how to make configuration changes related to the test. The help text for a health test also provides
a link to the relevant monitoring configuration section for the service. See Configuring Monitoring Settings on
page 280 for more information.
Health History
The Health History provides a record of state transitions of the health tests for the host.
• Click the arrow symbol at the left to view the description of the health test state change.
• Click the View link to open a new page that shows the state of the host at the time of the transition. In this view
some of the status settings are greyed out, as they reflect a time in the past, not the current status.
File Systems
The File systems panel provides information about disks, their mount points and usage. Use this information to determine
if additional disk space is required.
Roles
Use the Roles panel to see the role instances running on the selected host, as well as each instance's status and health.
Hosts are configured with one or more role instances, each of which corresponds to a service. The role indicates which
daemon runs on the host. Some examples of roles include the NameNode, Secondary NameNode, Balancer, JobTrackers,
DataNodes, RegionServers and so on. Typically a host will run multiple roles in support of the various services running
in the cluster.
Clicking the role name takes you to the role instance's status page.
You can delete a role from the host from the Instances tab of the Service page for the parent service of the role. You
can add a role to a host in the same way. See Role Instances on page 265.
Charts
Charts are shown for each host instance in your cluster.
See Viewing Charts for Cluster, Service, Role, and Host Instances on page 279 for detailed information on the charts
that are presented, and the ability to search and display metrics of your choice.
Processes
The Processes page provides information about each of the processes that are currently running on this host. Use this
page to access management web UIs, check process status, and access log information.
Note: The Processes page may display exited startup processes. Such processes are cleaned up within
a day.
• Stderr - A link to the stderr log (a file external to Cloudera Manager) for this host.
• Stdout - A link to the stdout log (a file external to Cloudera Manager) for this host.
Resources
The Resources page provides information about the resources (CPU, memory, disk, and ports) used by every service
and role instance running on the selected host.
Each entry on this page lists:
• The service name
• The name of the particular instance of this service
• A brief description of the resource
• The amount of the resource being consumed or the settings for the resource
The resource information provided depends on the type of resource:
• CPU - An approximate percentage of the CPU resource consumed.
• Memory - The number of bytes consumed.
• Disk - The disk location where this service stores information.
• Ports - The port number being used by the service to establish network connections.
Commands
The Commands page shows you running or recent commands for the host you are viewing. See Viewing Running and
Recent Commands on page 301 for more information.
Configuration
Minimum Required Role: Full Administrator
The Configuration page for a host lets you set properties for the selected host. You can set properties in the following
categories:
• Advanced - Advanced configuration properties. These include the Java Home Directory, which explicitly sets the
value of JAVA_HOME for all processes. This overrides the auto-detection logic that is normally used.
• Monitoring - Monitoring properties for this host. The monitoring settings you make on this page will override the
global host monitoring settings you make on the Configuration tab of the Hosts page. You can configure monitoring
properties for:
– health check thresholds
– the amount of free space on the filesystem containing the Cloudera Manager Agent's log and process directories
– a variety of conditions related to memory usage and other properties
– alerts for health check events
For some monitoring properties, you can set thresholds as either a percentage or an absolute value (in bytes).
• Other - Other configuration properties.
• Parcels - Configuration properties related to parcels. Includes the Parcel Director property, the directory that
parcels will be installed into on this host. If the parcel_dir variable is set in the Agent's config.ini file, it will
override this value.
• Resource Management - Enables resource management using control groups (cgroups).
For more information, see the description for each or property or see Modifying Configuration Properties Using Cloudera
Manager on page 74.
Components
The Components page lists every component installed on this host. This may include components that have been
installed but have not been added as a service (such as YARN, Flume, or Impala).
This includes the following information:
Audits
The Audits page lets you filter for audit events related to this host. See Lifecycle and Security Auditing on page 363 for
more information.
Charts Library
The Charts Library page for a host instance provides charts for all metrics kept for that host instance, organized by
category. Each category is collapsible/expandable. See Viewing Charts for Cluster, Service, Role, and Host Instances
on page 279 for more information.
Host Inspector
You can use the host inspector to gather information about hosts that Cloudera Manager is currently managing. You
can review this information to better understand system status and troubleshoot any existing issues. For example, you
might use this information to investigate potential DNS misconfiguration.
The inspector runs tests to gather information for functional areas including:
• Networking
• System time
• User and group configuration
• HDFS settings
• Component versions
Common cases in which this information is useful include:
• Installing components
• Upgrading components
• Adding hosts to a cluster
• Removing hosts from a cluster
3. If the command is too far in the past, you can use the Time Range Selector to move the time range back to cover
the time period you want.
4. When you find the Host Inspector command, click its name to display its subcommands.
5. Click the Show Inspector Results button to view the report.
See Viewing Running and Recent Commands on page 301 for more information about viewing past command activity.
Monitoring Activities
Cloudera Manager's activity monitoring capability monitors the MapReduce, Pig, Hive, Oozie, and streaming jobs,
Impala queries, and YARN applications running or that have run on your cluster. When the individual jobs are part of
larger workflows (using Oozie, Hive, or Pig), these jobs are aggregated into MapReduce jobs that can be monitored as
a whole, as well as by the component jobs.
If you are running multiple clusters, there will be a separate link in the Clusters tab for each cluster's MapReduce
activities, Impala queries, and YARN applications.
The following sections describe how to view and monitor activities that run on your cluster.
You can use the Time Range Selector or a duration link ( ) to set the time range.
(See Time Line on page 271 for details).
Note: Activity Monitor treats the original job start time as immutable. If a job is resubmitted due to
failover it will retain its original start time.
You can select an activity and drill down to look at the jobs and tasks spawned by that job:
• View the children (MapReduce jobs) of a Pig or Hive activity.
• View the task attempts generated by a MapReduce job.
• View the children (MapReduce, Pig, or Hive activities) of an Oozie job.
• View the activity or job statistics in a detail report format.
• Compare the selected activity to a set of other similar activities, to determine if the selected activity showed
anomalous behavior. For example, if a standard job suddenly runs much longer than usual, this may indicate issues
with your cluster.
• Display the distribution of task attempts that made up a job, by different metrics compared to task duration. You
can use this, for example, to determine if tasks running on a certain host are performing slower than average.
• Kill a running job, if necessary.
Note: Some activity data is sampled at one-minute intervals. This means that if you run a very short
job that both starts and ends within the sampling interval, it may not be detected by the Activity
Monitor, and thus will not appear in the Activities list or charts.
Children For a Pig, Hive or Oozie activity, takes you to the Children tab of the individual activity
page. You can also go to this page by clicking the activity ID in the activity list. This
command only appears for Pig, Hive or Oozie activities.
Tasks For a MapReduce job, takes you to the Tasks tab of the individual job page. You can
also go to this page by clicking the job ID in the activity or activity children list. This
command only appears for a MapReduce job.
Details Takes you to the Details tab where you can view the activity or job statistics in report
form.
Compare Takes you to the Compare tab where you can see how the selected activity compares
to other similar activities in terms of a wide variety of metrics.
Task Distribution Takes you to the Task Distribution tab where you can view the distribution of task
attempts that made up this job, by amount of data and task duration. This command
is available for MapReduce and Streaming jobs.
Kill Job A pop-up asks for confirmation that you want to kill the job. This command is available
only for MapReduce and Streaming jobs.
•
The second column shows a chart icon ( ). Select this to chart statistics for the job. If there are charts
showing similar statistics for the cluster or for other jobs, the statistics for the job are added to the chart.
See Activity Charts on page 312 for more details.
• The third column shows the status of the job, if the activity is a MapReduce job:
MapReduce job
Pig job
Hive job
Oozie job
Streaming job
Note: You cannot hide the shortcut menu or chart icon columns. Also, column selections are retained
only for the current session.
to the right of the Search button and select the filter you want to run. There are predefined filters to search by
job type (for example Pig activities, MapReduce jobs, and so on) or for running, failed, or long-running activities.
To create a filter:
1. Click the
4. To create a compound filter, click the plus icon at the end of the filter row to add another row. If you combine
filter criteria, all criteria must be true for an activity to match.
5. To remove a filter criteria from a compound filter, click the minus icon at the end of the filter row. Removing the
last row removes the filter.
6. To include any children of a Pig, Hive, or Oozie activity in your search results, check the Include Child Activities
checkbox. Otherwise, only the top-level activity will be included, even if one or more child activities matched the
filter criteria.
7. Click the Search button (which appears when you start creating the filter) to run the filter.
Note: Filters are remembered across user sessions — that is, if you log out the filter will be preserved
and will still be active when you log back in. Newly-submitted activities will appear in the Activity List
only if they match the filter criteria.
Activity Charts
By default the charts show aggregated statistics about the performance of the cluster: Tasks Running, CPU Usage, and
Memory Usage. There are additional charts you can enable from a pop-up panel. You can also superimpose individual
job statistics on any of the displayed charts.
Most charts display multiple metrics within the same chart. For example, the Tasks Running chart shows two metrics:
Cluster, Running Maps and Cluster, Running Reduces in the same chart. Each metric appears in a different color.
• To see the exact values at a given point in time, move the cursor over the chart – a movable vertical line pinpoints
a specific time, and a tooltip shows you the values at that point.
• You can use the time range selector at the top of the page to zoom in – the chart display will follow. In order to
zoom out, you can use the Time Range Selector at the top of the page or click the link below the chart.
To select additional charts:
1.
Click at the top right of the chart panel to open the Customize dialog box.
2. Check or uncheck the boxes next to the charts you want to show or hide.
To show or hide cluster-wide statistics:
• Check or uncheck the Cluster checkbox at the top of the Charts panel.
To chart statistics for an individual job:
•
Click the chart icon ( ) in the row next to the job you want to show on the charts. The job ID will appear in
the top bar next to the Cluster checkbox, and the statistics will appear on the appropriate chart.
• To remove a job's statistics from the chart, click the next to the job ID in the top bar of the chart.
Note: Chart selections are retained only for the current session.
2. Click the Pig, Hive or Oozie activity you want to inspect. This presents a list of the jobs that make up the Pig, Hive
or Oozie activity.
The functions under the Children tab are the same as those seen under the Activities tab. You can filter the job list,
show and hide columns in the job list, show and hide charts and plot job statistics on those charts.
• Click an individual job to view Task information and other information for that child. See Viewing and Filtering
MapReduce Activities on page 309 for details of how the functions on this page work.
In addition, viewing a Pig, Hive or Oozie activity provides the following tabs:
• The Details tab shows Activity details in a report form. See Viewing Activity Details in a Report Format for more
information.
• The Compare tab compares this activity to other similar activity. The main difference between this and a comparison
for a single MapReduce activity is that the comparison is done looking at other activities of the same type (Pig,
Hive or Oozie) but does include the child jobs of the activity. See Comparing Similar Activities for an explanation
of that tab.
Task Attempts
The Tasks tab contains a list of the Map and Reduce task attempts that make up a job.
Viewing a Job's Task Attempts
1. From the Clusters tab, in the section marked Other, select the activity you want to inspect.
• If the activity is a MapReduce job, the Tasks tab opens.
• If the activity is a Pig, Hive, or Oozie activity, select the job you want to inspect from the activity's Children
tab to open the Tasks tab.
The columns shown under the Tasks tab display statistics about the performance of and resources used by the task
attempts spawned by the selected job. By default only a subset of the possible metrics are displayed — you can modify
the columns that are displayed to add or remove the columns in the display.
• The status of an attempt is shown in the Attempt Status column:
to the right of the Search button and select the filter you want to run. There are predefined filters to search by
job type (for example Pig activities, MapReduce jobs, and so on) or for running, failed, or long-running activities.
To create a filter:
1. Click the
Note: The filter persists only for this user session — when you log out, tasks list filter is removed.
2. Select the Details tab after the list of child jobs is displayed.
This displays information about the Pig, Oozie, or Hive job as a whole.
Note that this the same data you would see for the activity if you displayed all possible columns in the Activities list.
Note: The Impala query monitoring feature requires Impala 1.0.1 and higher.
Viewing Queries
1. Do one of the following:
• Select Clusters > Cluster name > Impala service name Queries.
• On the Home > Status tab, select Impala service name and click the Queries tab.
The Impala queries run during the selected time range display in the Results Tab on page 317.
You can also perform the following actions on this page:
Action Description
Filter the displayed queries Create filter expressions manually, select preconfigured
filters, or use the Workload Summary section to build a
query interactively. See Filtering Queries on page 318.
Select additional attributes for display. Click Select Attributes. Selected attributes also display as
available filters in the Workload Summary section. To
display information about attributes, hover over a field
label. See Filter Attributes on page 320.
Only attributes that support filtering appear in the
Workload Summary section. See the Table 14: Attributes
on page 321 table.
View a histogram of the attribute values. Click the icon to the right of each attribute displayed in
the Workload Summary section.
Display charts based on the filter expression and selected Cick the Charts tab.
attributes.
View charts that help identify whether Impala best Click the Best Practices link.
practices are being followed.
Export a JSON file with the query results that you can use Click Export.
for further analysis.
Results Tab
Queries appear on the Results tab, with the most recent at the top. Each query has summary and detail information.
A query summary includes the following default attributes: start and end timestamps, statement, duration, rows
produced, user, coordinator, database, and query type. For example:
You can add additional attributes to the summary by clicking the Attribute Selector. In each query summary, the query
statement is truncated if it is too long to display. To display the entire statement, click . The query entry expands to
display the entire query string. To collapse the query display, click . To display information about query attributes
and possible values, hover over a field in a query. For example:
If an error occurred while processing the query, displays under the complete timestamp.
Use the Actions drop-down menu to the right of each query listing to do the following. (Not all options display,
depending on the type of job.)
• Query Details – Opens a details page for the job. See query details.
• User's Impala Queries – Displays a list of queries run by the user for the current job.
• Cancel (running queries only) – Cancel a running query (administrators only). Canceling a running query creates
an audit event. When you cancel a query, replaces the progress bar.
• Queries in the same YARN pool – Displays queries that use the same resource pool.
Filtering Queries
You filter queries by selecting a time range and specifying a filter expression in the search box.
You can use the Time Range Selector or a duration link ( ) to set the time range.
(See Time Line on page 271 for details).
Filter Expressions
Filter expressions specify which entries should display when you run the filter. The simplest expression consists of
three components:
• Attribute - Query language name of the attribute.
• Operator - Type of comparison between the attribute and the attribute value. Cloudera Manager supports the
standard comparator operators =, !=, >, <, >=, <=, and RLIKE. (RLIKE performs regular expression matching as
specified in the Java Pattern class documentation.) Numeric values can be compared with all operators. String
values can be compared with =, !=, and RLIKE. Boolean values can be compared with = and !=.
• Value - The value of the attribute. The value depends on the type of the attribute. For a Boolean value, specify
either true or false. When specifying a string value, enclose the value in double quotes.
You create compound filter expressions using the AND and OR operators. When more than one operator is used in an
expression, AND is evaluated first, then OR. To change the order of evaluation, enclose subexpressions in parentheses.
Compound Expressions
To find all the queries issued by the root user that produced over 100 rows, use the expression:
To find all the executing queries issued by users Jack or Jill, use the expression:
to the right of the Search button to display a list of sample and recently run filters, and select a filter. The
filter text displays in the text box.
• Construct a Filter from the Workload Summary Attributes
Optionally, click Select Attributes to display a dialog box where you can chose which attributes to display in
the Workload Summary section. Select the checkbox next to one or more attributes, and click Close.
The attributes display in the Workload Summary section along with values or ranges of values that you can
filter on. The values and ranges display as links with checkboxes. Select one or more checkboxes to add the
range or value to the query. Click a link to run a query on that value or range. For example:
• Type a Filter
1. Start typing or press Spacebar in the text box. As you type, filter attributes matching the typed letter
display. If you press Spacebar, standard filter attributes display. These suggestions are part of typeahead,
which helps build valid queries. For information about the attribute name and supported values for each
field, hover over the field in an existing query.
2. Select an attribute and press Enter.
3. Press Spacebar to display a drop-down list of operators.
4. Select an operator and press Enter.
5. Specify an attribute value in one of the following ways:
• For attribute values that support typeahead, press Spacebar to display a drop-down list of values
and press Enter.
• Type a value.
2. Click in the text box and press Enter or click Search. The list displays the results that match the specified filter.
The Workload Summary section refreshes to show only the values for the selected filter. The filter is added to the
Recently Run list.
Filter Attributes
The following table includes available filter attributes and their names in Cloudera Manager, types, and descriptions.
Note: Only attributes for which the Supports Filtering? column value is TRUE appear in the Workload
Summary section.
Database STRING TRUE The database on which the query was run. Called
'database' in searches.
(database)
DDL Type STRING TRUE The type of DDL query. Called 'ddl_type' in searches.
(ddl_type)
Delegated User STRING TRUE The effective user for the query. This is set only if
delegation is in use. Called 'delegated_user' in
(delegated_user)
searches.
Duration MILLISECONDS TRUE The duration of the query in milliseconds. Called
'query_duration' in searches.
(query_duration)
Estimated per Node Peak Memory BYTES TRUE The planning process's estimate of per-node peak
memory usage for the query. Called
(estimated_per_node_peak_memory)
'estimated_per_node_peak_memory' in searches.
Executing BOOLEAN FALSE Whether the query is currently executing. Called
'executing' in searches.
(executing)
File Formats STRING FALSE An alphabetically sorted list of all the file formats
used in the query. Called 'file_formats' in searches.
(file_formats)
HBase Bytes Read BYTES TRUE The total number of bytes read from HBase by this
query. Called 'hbase_bytes_read' in searches.
(hbase_bytes_read)
HBase Scanner Average Read BYTES_PER_SECOND TRUE The average HBase scanner read throughput for this
Throughput query. This is computed by dividing the total bytes
read from HBase by the total time spent reading by
(hbase_scanner_average_bytes_read_per_second)
all HBase scanners. Called
'hbase_scanner_average_bytes_read_per_second'
in searches.
HDFS Average Scan Range BYTES TRUE The average HDFS scan range size for this query.
HDFS scan nodes that contained only a single scan
(hdfs_average_scan_range)
range are not included in this computation. Low
numbers for a query might indicate reading many
small files which negatively impacts performance.
Called 'hdfs_average_scan_range' in searches.
HDFS Bytes Read BYTES TRUE The total number of bytes read from HDFS by this
query. Called 'hdfs_bytes_read' in searches.
(hdfs_bytes_read)
HDFS Bytes Read From Cache BYTES TRUE The total number of bytes read from HDFS that were
read from the HDFS cache. This is only for completed
(hdfs_bytes_read_from_cache)
queries. Called 'hdfs_bytes_read_from_cache' in
searches.
HDFS Bytes Read From Cache NUMBER TRUE The percentage of all bytes read by this query that
Percentage were read from the HDFS cache. This is only for
completed queries. Called
(hdfs_bytes_read_from_cache_percentage)
'hdfs_bytes_read_from_cache_percentage' in
searches.
HDFS Bytes Skipped BYTES TRUE The total number of bytes that had to be skipped by
this query while reading from HDFS. Any number
(hdfs_bytes_skipped)
above zero may indicate a problem. Called
'hdfs_bytes_skipped' in searches.
HDFS Bytes Written BYTES TRUE The total number of bytes written to HDFS by this
query. Called 'hdfs_bytes_written' in searches.
(hdfs_bytes_written)
HDFS Local Bytes Read BYTES TRUE The total number of local bytes read from HDFS by
this query. This is only for completed queries. Called
(hdfs_bytes_read_local)
'hdfs_bytes_read_local' in searches.
HDFS Local Bytes Read Percentage NUMBER TRUE The percentage of all bytes read from HDFS by this
query that were local. This is only for completed
(hdfs_bytes_read_local_percentage)
queries. Called 'hdfs_bytes_read_local_percentage'
in searches.
HDFS Remote Bytes Read BYTES TRUE The total number of remote bytes read from HDFS
by this query. This is only for completed queries.
(hdfs_bytes_read_remote)
Called 'hdfs_bytes_read_remote' in searches.
HDFS Remote Bytes Read Percentage NUMBER TRUE The percentage of all bytes read from HDFS by this
query that were remote. This is only for completed
(hdfs_bytes_read_remote_percentage)
queries. Called
'hdfs_bytes_read_remote_percentage' in searches.
HDFS Scanner Average Read BYTES_PER_SECOND TRUE The average HDFS scanner read throughput for this
Throughput query. This is computed by dividing the total bytes
read from HDFS by the total time spent reading by
(hdfs_scanner_average_bytes_read_per_second)
all HDFS scanners. Called
'hdfs_scanner_average_bytes_read_per_second' in
searches.
HDFS Short Circuit Bytes Read BYTES TRUE The total number of bytes read from HDFS by this
query that used short-circuit reads. This is only for
(hdfs_bytes_read_short_circuit)
completed queries. Called
'hdfs_bytes_read_short_circuit' in searches.
HDFS Short Circuit Bytes Read NUMBER TRUE The percentage of all bytes read from HDFS by this
Percentage query that used short-circuit reads. This is only for
completed queries. Called
(hdfs_bytes_read_short_circuit_percentage)
'hdfs_bytes_read_short_circuit_percentage' in
searches.
Impala Version STRING TRUE The version of the Impala Daemon coordinating this
query. Called 'impala_version' in searches.
(impala_version)
Memory Accrual BYTE_SECONDS TRUE The total accrued memory usage by the query. This
is computed by multiplying the average aggregate
(memory_accrual)
memory usage of the query by the query's duration.
Called 'memory_accrual' in searches.
Memory Spilled BYTES TRUE Amount of memory spilled to disk. Called
'memory_spilled' in searches.
(memory_spilled)
Network Address STRING TRUE The network address that issued this query. Called
'network_address' in searches.
(network_address)
Node with Peak Memory Usage STRING TRUE The node with the highest peak memory usage for
this query. See Per Node Peak Memory Usage for the
(memory_per_node_peak_node)
actual peak value. Called
'memory_per_node_peak_node' in searches.
Out of Memory BOOLEAN TRUE Whether the query ran out of memory. Called 'oom'
in searches.
(oom)
Per Node Peak Memory Usage BYTES TRUE The highest amount of memory allocated by any
single node that participated in this query. See Node
(memory_per_node_peak)
with Peak Memory Usage for the name of the peak
node. Called 'memory_per_node_peak' in searches.
Planning Wait Time MILLISECONDS TRUE The total amount of time the query spent waiting for
planning to complete. Called 'planning_wait_time'
(planning_wait_time)
in searches.
Planning Wait Time Percentage NUMBER TRUE The total amount of time the query spent waiting for
planning to complete divided by the query duration.
(planning_wait_time_percentage)
Called 'planning_wait_time_percentage' in searches.
Pool STRING TRUE The name of the resource pool in which this query
executed. Called 'pool' in searches. If YARN is in use,
(pool)
this corresponds to a YARN pool. Within YARN, a pool
is referred to as a queue.
Query ID STRING FALSE The id of this query. Called 'query_id' in searches.
(query_id)
Query State STRING TRUE The current state of the query (running, finished, and
so on). Called 'query_state' in searches.
(query_state)
Query Status STRING TRUE The status of the query. If the query hasn't failed the
status will be 'OK', otherwise it will provide more
(query_status)
information on the cause of the failure. Called
'query_status' in searches.
Query Type STRING TRUE The type of the query's SQL statement (DML, DDL,
Query). Called 'query_type' in searches.
(query_type)
Resource Reservation Wait Time MILLISECONDS TRUE The total amount of time the query spent waiting for
pool resources to become available . Called
(resources_reserved_wait_time)
'resources_reserved_wait_time' in searches.
Resource Reservation Wait Time NUMBER TRUE The total amount of time the query spent waiting for
Percentage pool resources to become available divided by the
query duration. Called
(resources_reserved_wait_time_percentage)
'resources_reserved_wait_time_percentage' in
searches.
Rows Inserted NUMBER TRUE The number of rows inserted by the query. Called
'rows_inserted' in searches.
(rows_inserted)
Rows Produced NUMBER TRUE The number of rows produced by the query. Called
'rows_produced' in searches.
(rows_produced)
Service Name STRING FALSE The name of the Impala service. Called
'service_name' in searches.
(service_name)
Session ID STRING TRUE The ID of the session that issued this query. Called
'session_id' in searches.
(session_id)
Session Type STRING TRUE The type of the session that issued this query. Called
'session_type' in searches.
(session_type)
Statistics Missing BOOLEAN TRUE Whether the query was flagged with missing table
or column statistics warning during the planning
(stats_missing)
process. Called 'stats_missing' in searches.
Threads: CPU Time MILLISECONDS TRUE The sum of the CPU time used by all threads of the
query. Called 'thread_cpu_time' in searches.
(thread_cpu_time)
Threads: CPU Time Percentage NUMBER TRUE The sum of the CPU time used by all threads of the
query divided by the total thread time. Called
(thread_cpu_time_percentage)
'thread_cpu_time_percentage' in searches.
Threads: Network Receive Wait Time MILLISECONDS TRUE The sum of the time spent waiting to receive data
over the network by all threads of the query. A query
(thread_network_receive_wait_time)
will almost always have some threads waiting to
receive data from other nodes in the query's
execution tree. Unlike other wait times, network
receive wait time does not usually indicate an
opportunity for improving a query's performance.
Called 'thread_network_receive_wait_time' in
searches.
Threads: Network Receive Wait Time NUMBER TRUE The sum of the time spent waiting to receive data
Percentage over the network by all threads of the query divided
by the total thread time. A query will almost always
(thread_network_receive_wait_time_percentage)
have some threads waiting to receive data from other
nodes in the query's execution tree. Unlike other
wait times, network receive wait time does not
usually indicate an opportunity for improving a
query's performance. Called
'thread_network_receive_wait_time_percentage' in
searches.
Threads: Network Send Wait Time MILLISECONDS TRUE The sum of the time spent waiting to send data over
the network by all threads of the query. Called
(thread_network_send_wait_time)
'thread_network_send_wait_time' in searches.
Threads: Network Send Wait Time NUMBER TRUE The sum of the time spent waiting to send data over
Percentage the network by all threads of the query divided by
the total thread time. Called
(thread_network_send_wait_time_percentage)
'thread_network_send_wait_time_percentage' in
searches.
Threads: Storage Wait Time MILLISECONDS TRUE The sum of the time spent waiting for storage by all
threads of the query. Called
(thread_storage_wait_time)
'thread_storage_wait_time' in searches.
Threads: Storage Wait Time Percentage NUMBER TRUE The sum of the time spent waiting for storage by all
threads of the query divided by the total thread time.
(thread_storage_wait_time_percentage)
Called 'thread_storage_wait_time_percentage' in
searches.
Threads: Total Time MILLISECONDS TRUE The sum of thread CPU, storage wait and network
wait times used by all threads of the query. Called
(thread_total_time)
'thread_total_time' in searches.
User STRING TRUE The effective user for the query. This is the delegated
user if delegation is in use. Otherwise, this is the
(user)
connected user. Called 'user' in searches.
Work CPU Time MILLISECONDS TRUE Attribute measuring the sum of CPU time used by all
threads of the query, in milliseconds. Called
(cm_cpu_milliseconds)
'work_cpu_time' in searches. For Impala queries,
CPU time is calculated based on the 'TotalCpuTime'
metric. For YARN MapReduce applications, this is
calculated from the 'cpu_milliseconds' metric.
Examples
Consider the following filter expressions: user = "root", rowsProduced > 0, fileFormats RLIKE ".TEXT.*",
and executing = true. In the examples:
• The filter attributes are user, rowsProduced, fileFormats, and executing.
• The operators are =, >, and RLIKE.
• The filter values are root, 0, .TEXT.*, and true.
Query Details
The Query Details page contains the low-level details of how a SQL query is processed through Impala. The initial
information on the page can help you tune the performance of some kinds of queries, primarily those involving joins.
The more detailed information on the page is primarily for troubleshooting with the assistance of Cloudera Support;
you might be asked to attach the contents of the page to a trouble ticket. The Query Details page displays the following
information that is also available in Query Profile.
• Query Plan
• Query Info
• Query Timeline on page 326
• Planner Timeline on page 327
• Query Fragments
To download the contents of the query profile details, select one of the following:
• Download Profile... or Download Profile... > Download Text Profile... - to download a text version of the query
detail.
• Download Profile... > Download Thrift Encoded Profile... - to download a binary version of the query detail.
Query Plan
The Query Plan section can help you diagnose and tune performance issues with queries. This information is especially
useful to understand performance issues with join queries, such as inefficient order of tables in the SQL statement,
lack of table and column statistics, and the need for query hints to specify a more efficient join mechanism. You can
also learn valuable information about how queries are processed for partitioned tables.
The information in this section corresponds to the output of the EXPLAIN statement for the Impala query. Each
fragment shown in the query plan corresponds to a processing step that is performed by the central coordinator host
or distributed across the hosts in the cluster.
Query Timeline
The Query Timeline section reports statistics about the execution time for phases of the query.
Planner Timeline
The Planner Timeline reports statistics about the execution time for phases of the query planner.
Query Info
The Query Info section reports the attributes of the query, start and end time, duration, and statistics about HDFS
access. You can hover over an attribute for information about the attribute name and supported values (for enumerated
values). For example:
Query Fragments
The Query Fragments section reports detailed low-level statistics for each query plan fragment, involving physical
aspects such as CPU utilization, disk I/O, and network traffic. This is the primary information that Cloudera Support
might use to help troubleshoot performance issues and diagnose bugs. The details for each fragment display on separate
tabs.
Viewing Jobs
1. Do one of the following:
• Select Clusters > Cluster name > YARN service name Applications.
• On the Home > Status tab, select YARN service name and click the Applications tab.
The YARN jobs run during the selected time range display in the Results Tab on page 328. The results displayed can be
filtered by creating filter expressions.
You can also perform the following actions on this page:
Action Description
Filter jobs that display. Create filter expressions manually, select preconfigured
filters, or use the Workload Summary section to build a
query interactively. See Filtering Jobs on page 328.
Select additional attributes for display. Click Select Attributes. Selected attributes also display as
available filters in the Workload Summary section. To
display information about attributes, hover over a field
label. See Filter Attributes on page 330
Only attributes that support filtering appear in the
Workload Summary section. See the Table 16: Attributes
on page 331 table.
View a histogram of the attribute values. Click the icon to the right of each attribute displayed in
the Workload Summary section.
Display charts based on the filter expression and selected Click the Charts tab.
attributes.
Action Description
Send a YARN application diagnostic bundle to Cloudera Click Collect Diagnostics Data. See Sending Diagnostic
support. Data to Cloudera for YARN Applications on page 340.
Export a JSON file with the query results that you can use Click Export.
for further analysis.
Results Tab
Jobs are ordered with the most recent at the top. Each job has summary and detail information. A job summary includes
start and end timestamps, query (if the job is part of a Hive query) name, pool, job type, job ID, and user. For example:
Use the Actions drop-down menu to the right of each job listing to do the following. (Not all options display,
depending on the type of job.)
• Application Details – Open a details page for the job.
• Collect Diagnostic Data – Send a YARN application diagnostic bundle to Cloudera support.
• Similar MR2 Jobs – Display a list of similar MapReduce 2 jobs.
• User's YARN Applications – Display a list of all jobs run by the user of the current job.
• View on JobHistory Server – View the application in the YARN JobHistory Server.
• Kill (running jobs only) – Kill a job (administrators only). Killing a job creates an audit event. When you kill a job,
replaces the progress bar.
• Applications in Hive Query (Hive jobs only)
• Applications in Oozie Workflow (Oozie jobs only)
• Applications in Pig Script (Pig jobs only)
Filtering Jobs
You filter jobs by selecting a time range and specifying a filter expression in the search box.
You can use the Time Range Selector or a duration link ( ) to set the time range.
(See Time Line on page 271 for details).
Filter Expressions
Filter expressions specify which entries should display when you run the filter. The simplest expression consists of
three components:
• Attribute - Query language name of the attribute.
• Operator - Type of comparison between the attribute and the attribute value. Cloudera Manager supports the
standard comparator operators =, !=, >, <, >=, <=, and RLIKE. (RLIKE performs regular expression matching as
specified in the Java Pattern class documentation.) Numeric values can be compared with all operators. String
values can be compared with =, !=, and RLIKE. Boolean values can be compared with = and !=.
• Value - The value of the attribute. The value depends on the type of the attribute. For a Boolean value, specify
either true or false. When specifying a string value, enclose the value in double quotes.
You create compound filter expressions using the AND and OR operators. When more than one operator is used in an
expression, AND is evaluated first, then OR. To change the order of evaluation, enclose subexpressions in parentheses.
Compound Expressions
To find all the jobs issued by the root user that ran for longer than ten seconds, use the expression:
To find all the jobs that had more than 200 maps issued by users Jack or Jill, use the expression:
to the right of the Search button to display a list of sample and recently run filters, and select a filter. The
filter text displays in the text box.
• Construct a Filter from the Workload Summary Attributes
Optionally, click Select Attributes to display a dialog box where you can chose attributes to display in the
Workload Summary section. Select the checkbox next to one or more attributes and click Close. Only attributes
that support filtering appear in the Workload Summary section. See the Table 16: Attributes on page 331
table.
The attributes display in the Workload Summary section along with values or ranges of values that you can
filter on. The values and ranges display as links with checkboxes. Select one or more checkboxes to add the
range or value to the query. Click a link to run a query on that value or range. For example:
• Type a Filter
1. Start typing or press Spacebar in the text box. As you type, filter attributes matching the typed letter
display. If you press Spacebar, standard filter attributes display. These suggestions are part of typeahead,
which helps build valid queries. For information about the attribute name and supported values for each
field, hover over the field in an existing query.
2. Select an attribute and press Enter.
3. Press Spacebar to display a drop-down list of operators.
4. Select an operator and press Enter.
5. Specify an attribute value in one of the following ways:
• For attribute values that support typeahead, press Spacebar to display a drop-down list of values
and press Enter.
• Type a value.
2. Click in the text box and press Enter or click Search. The list displays the results that match the specified filter. If
the histograms are showing, they are redrawn to show only the values for the selected filter. The filter is added
to the Recently Run list.
Filter Attributes
Filter attributes, their names as they are displayed in Cloudera Manager, their types, and descriptions, are enumerated
below.
Note: Only attributes where the Supports Filtering? column value is TRUE appear in the Workload
Summary section.
Application State STRING TRUE The state of this YARN application. This reflects the
ResourceManager state while the application is
(state)
running and the JobHistory Server state after the
application has completed. Called 'state' in searches.
Application Tags STRING FALSE A list of tags for the application. Called
'application_tags' in searches.
(application_tags)
Application Type STRING TRUE The type of the YARN application. Called
'application_type' in searches.
(application_type)
Completed Maps and Reduces NUMBER TRUE The number of completed map and reduce tasks in
this MapReduce job. Called 'tasks_completed' in
(tasks_completed)
searches. Available only for running jobs.
Data Local Maps NUMBER TRUE Data local maps. Called 'data_local_maps' in
searches.
(data_local_maps)
Data Local Maps Percentage NUMBER TRUE The number of data local maps as a percentage of
the total number of maps. Called
(data_local_maps_percentage)
'data_local_maps_percentage' in searches.
Diagnostics STRING FALSE Diagnostic information on the YARN application. If
the diagnostic information is long, this may only
(diagnostics)
contain the beginning of the information. Called
'diagnostics' in searches.
Duration MILLISECONDS TRUE How long YARN took to run this application. Called
'application_duration' in searches.
(application_duration)
Failed Map and Reduce Attempts NUMBER TRUE The number of failed map and reduce attempts for
this MapReduce job. Called 'failed_tasks_attempts'
(failed_tasks_attempts)
in searches. Available only for failed jobs.
Failed Map Attempts NUMBER TRUE The number of failed map attempts for this
MapReduce job. Called 'failed_map_attempts' in
(failed_map_attempts)
searches. Available only for running jobs.
Failed Maps NUMBER TRUE Failed maps. Called 'num_failed_maps' in searches.
(num_failed_maps)
Failed Reduce Attempts NUMBER TRUE The number of failed reduce attempts for this
MapReduce job. Called 'failed_reduce_attempts' in
(failed_reduce_attempts)
searches. Available only for running jobs.
Failed Reduces NUMBER TRUE Failed reduces. Called 'num_failed_reduces' in
searches.
(num_failed_reduces)
Failed Tasks NUMBER TRUE The total number of failed tasks. This is the sum of
'num_failed_maps' and 'num_failed_reduces'. Called
(num_failed_tasks)
'num_failed_tasks' in searches.
Fallow Map Slots Time MILLISECONDS TRUE Fallow map slots time. Called
'fallow_slots_millis_maps' in searches.
(fallow_slots_millis_maps)
Fallow Reduce Slots Time MILLISECONDS TRUE Fallow reduce slots time. Called
'fallow_slots_millis_reduces' in searches.
(fallow_slots_millis_reduces)
Fallow Slots Time MILLISECONDS TRUE Total fallow slots time. This is the sum of
'fallow_slots_millis_maps' and
(fallow_slots_millis)
'fallow_slots_millis_reduces'. Called
'fallow_slots_millis' in searches.
File Bytes Read BYTES TRUE File bytes read. Called 'file_bytes_read' in searches.
(file_bytes_read)
File Bytes Written BYTES TRUE File bytes written. Called 'file_bytes_written' in
searches.
(file_bytes_written)
File Large Read Operations NUMBER TRUE File large read operations. Called
'file_large_read_ops' in searches.
(file_large_read_ops)
File Read Operations NUMBER TRUE File read operations. Called 'file_read_ops' in
searches.
(file_read_ops)
File Write Operations NUMBER TRUE File write operations. Called 'file_large_write_ops'
in searches.
(file_write_ops)
Garbage Collection Time MILLISECONDS TRUE Garbage collection time. Called 'gc_time_millis' in
searches.
(gc_time_millis)
HDFS Bytes Read BYTES TRUE HDFS bytes read. Called 'hdfs_bytes_read' in
searches.
(hdfs_bytes_read)
HDFS Bytes Written BYTES TRUE HDFS bytes written. Called 'hdfs_bytes_written' in
searches.
(hdfs_bytes_written)
HDFS Large Read Operations NUMBER TRUE HDFS large read operations. Called
'hdfs_large_read_ops' in searches.
(hdfs_large_read_ops)
HDFS Read Operations NUMBER TRUE HDFS read operations. Called 'hdfs_read_ops' in
searches.
(hdfs_read_ops)
HDFS Write Operations NUMBER TRUE HDFS write operations. Called 'hdfs_write_ops' in
searches.
(hdfs_write_ops)
Hive Query ID STRING FALSE If this MapReduce job ran as a part of a Hive query,
this field contains the ID of the Hive query. Called
(hive_query_id)
'hive_query_id' in searches.
Hive Query String STRING TRUE If this MapReduce job ran as a part of a Hive query,
this field contains the string of the query. Called
(hive_query_string)
'hive_query_string' in searches.
Hive Sentry Subject Name STRING TRUE If this MapReduce job ran as a part of a Hive query
on a secured cluster using impersonation, this field
(hive_sentry_subject_name)
contains the name of the user that initiated the
query. Called 'hive_sentry_subject_name' in
searches.
Input Directory STRING TRUE The input directory for this MapReduce job. Called
'input_dir' in searches.
(input_dir)
Input Split Bytes BYTES TRUE Input split bytes. Called 'split_raw_bytes' in searches.
(split_raw_bytes)
Killed Map and Reduce Attempts NUMBER TRUE The number of map and reduce attempts that were
killed by user(s) for this MapReduce job. Called
(killed_tasks_attempts)
'killed_tasks_attempts' in searches. Available only
for killed jobs.
Killed Map Attempts NUMBER TRUE The number of map attempts killed by user(s) for
this MapReduce job. Called 'killed_map_attempts'
(killed_map_attempts)
in searches. Available only for running jobs.
Killed Reduce Attempts NUMBER TRUE The number of reduce attempts killed by user(s) for
this MapReduce job. Called 'killed_reduce_attempts'
(killed_reduce_attempts)
in searches. Available only for running jobs.
Launched Map Tasks NUMBER TRUE Launched map tasks. Called 'total_launched_maps'
in searches.
(total_launched_maps)
Map and Reduce Attempts in NEW NUMBER TRUE The number of map and reduce attempts in NEW
State state for this MapReduce job. Called
'new_tasks_attempts' in searches. Available only for
(new_tasks_attempts)
running jobs.
Map Attempts in NEW State NUMBER TRUE The number of map attempts in NEW state for this
MapReduce job. Called 'new_map_attempts' in
(new_map_attempts)
searches. Available only for running jobs.
Map Class STRING TRUE The class used by the map tasks in this MapReduce
job. Called 'mapper_class' in searches. You can search
(mapper_class)
for the mapper class using the class name alone, for
example 'QuasiMonteCarlo$QmcMapper', or the
fully qualified classname, for example, 'org.apache.
hadoop.examples.QuasiMonteCarlo$QmcMapper'.
Map CPU Allocation NUMBER TRUE Map CPU allocation. Called 'vcores_millis_maps' in
searches.
(vcores_millis_maps)
Map Input Records NUMBER TRUE Map input records. Called 'map_input_records' in
searches.
(map_input_records)
Map Memory Allocation NUMBER TRUE Map memory allocation. Called 'mb_millis_maps' in
searches.
(mb_millis_maps)
Map Output Bytes BYTES TRUE Map output bytes. Called 'map_output_bytes' in
searches.
(map_output_bytes)
Map Output Materialized Bytes BYTES TRUE Map output materialized bytes. Called
'map_output_materialized_bytes' in searches.
(map_output_materialized_bytes)
Map Output Records NUMBER TRUE Map output records. Called 'map_output_records'
in searches.
(map_output_records)
Map Progress NUMBER TRUE The percentage of maps completed for this
MapReduce job. Called 'map_progress' in searches.
(map_progress)
Available only for running jobs.
Maps Completed NUMBER TRUE The number of map tasks completed as a part of this
MapReduce job. Called 'maps_completed' in
(maps_completed)
searches.
Map Slots Time MILLISECONDS TRUE Total time spent by all maps in occupied slots. Called
'slots_millis_maps' in searches.
(slots_millis_maps)
Maps Pending NUMBER TRUE The number of maps waiting to be run for this
MapReduce job. Called 'maps_pending' in searches.
(maps_pending)
Available only for running jobs.
Maps Running NUMBER TRUE The number of maps currently running for this
MapReduce job. Called 'maps_running' in searches.
(maps_running)
Available only for running jobs.
Maps Total NUMBER TRUE The number of Map tasks in this MapReduce job.
Called 'maps_total' in searches.
(maps_total)
Memory Allocation NUMBER TRUE Total memory allocation. This is the sum of
'mb_millis_maps' and 'mb_millis_reduces'. Called
(mb_millis)
'mb_millis' in searches.
Merged Map Outputs NUMBER TRUE Merged map outputs. Called 'merged_map_outputs'
in searches.
(merged_map_outputs)
Oozie Workflow ID STRING FALSE If this MapReduce job ran as a part of an Oozie
workflow, this field contains the ID of the Oozie
(oozie_id)
workflow. Called 'oozie_id' in searches.
Other Local Maps NUMBER TRUE Other local maps. Called 'other_local_maps' in
searches.
(other_local_maps)
Other Local Maps Percentage NUMBER TRUE The number of other local maps as a percentage of
the total number of maps. Called
(other_local_maps_percentage)
'other_local_maps_percentage' in searches.
Output Directory STRING TRUE The output directory for this MapReduce job. Called
'output_dir' in searches.
(output_dir)
Pending Maps and Reduces NUMBER TRUE The number of maps and reduces waiting to be run
for this MapReduce job. Called 'tasks_pending' in
(tasks_pending)
searches. Available only for running jobs.
Physical Memory BYTES TRUE Physical memory. Called 'physical_memory_bytes'
in searches.
(physical_memory_bytes)
Pig Script ID STRING FALSE If this MapReduce job ran as a part of a Pig script,
this field contains the ID of the Pig script. Called
(pig_id)
'pig_id' in searches.
Pool STRING TRUE The name of the resource pool in which this
application ran. Called 'pool' in searches. Within
(pool)
YARN, a pool is referred to as a queue.
Progress NUMBER TRUE The progress reported by the application. Called
'progress' in searches.
(progress)
Rack Local Maps NUMBER TRUE Rack local maps. Called 'rack_local_maps' in searches.
(rack_local_maps)
Rack Local Maps Percentage NUMBER TRUE The number of rack local maps as a percentage of
the total number of maps. Called
(rack_local_maps_percentage)
'rack_local_maps_percentage' in searches.
Reduce Attempts in NEW State NUMBER TRUE The number of reduce attempts in NEW state for this
MapReduce job. Called 'new_reduce_attempts' in
(new_reduce_attempts)
searches. Available only for running jobs.
Reduce Class STRING TRUE The class used by the reduce tasks in this MapReduce
job. Called 'reducer_class' in searches. You can search
(reducer_class)
for the reducer class using the class name alone, for
example 'QuasiMonteCarlo$QmcReducer', or fully
qualified classname, for example, 'org.apache.
hadoop.examples.QuasiMonteCarlo$QmcReducer'.
Reduce CPU Allocation NUMBER TRUE Reduce CPU allocation. Called 'vcores_millis_reduces'
in searches.
(vcores_millis_reduces)
Reduce Input Groups NUMBER TRUE Reduce input groups. Called 'reduce_input_groups'
in searches.
(reduce_input_groups)
Reduce Input Records NUMBER TRUE Reduce input records. Called 'reduce_input_records'
in searches.
(reduce_input_records)
Reduce Progress NUMBER TRUE The percentage of reduces completed for this
MapReduce job. Called 'reduce_progress' in searches.
(reduce_progress)
Available only for running jobs.
Reduces Completed NUMBER TRUE The number of reduce tasks completed as a part of
this MapReduce job. Called 'reduces_completed' in
(reduces_completed)
searches.
Reduce Shuffle Bytes BYTES TRUE Reduce shuffle bytes. Called 'reduce_shuffle_bytes'
in searches.
(reduce_shuffle_bytes)
Reduce Slots Time MILLISECONDS TRUE Total time spent by all reduces in occupied slots.
Called 'slots_millis_reduces' in searches.
(slots_millis_reduces)
Reduces Pending NUMBER TRUE The number of reduces waiting to be run for this
MapReduce job. Called 'reduces_pending' in
(reduces_pending)
searches. Available only for running jobs.
Reduces Running NUMBER TRUE The number of reduces currently running for this
MapReduce job. Called 'reduces_running' in searches.
(reduces_running)
Available only for running jobs.
Reduces Total NUMBER TRUE The number of reduce tasks in this MapReduce job.
Called 'reduces_total' in searches.
(reduces_total)
Running Containers NUMBER FALSE The number of containers currently running for the
application. Called 'running_containers' in searches.
(running_containers)
Running Map and Reduce Attempts NUMBER TRUE The number of map and reduce attempts currently
running for this MapReduce job. Called
(running_tasks_attempts)
'running_tasks_attempts' in searches. Available only
for running jobs.
Running Map Attempts NUMBER TRUE The number of running map attempts for this
MapReduce job. Called 'running_map_attempts' in
(running_map_attempts)
searches. Available only for running jobs.
Running MapReduce Application NUMBER TRUE How long it took, in seconds, to retrieve information
Information Retrieval Duration. about the MapReduce application.
(running_application_info_retrieval_time)
Running Maps and Reduces NUMBER TRUE The number of maps and reduces currently running
for this MapReduce job. Called 'tasks_running' in
(tasks_running)
searches. Available only for running jobs.
Running Reduce Attempts NUMBER TRUE The number of running reduce attempts for this
MapReduce job. Called 'running_reduce_attempts'
(running_reduce_attempts)
in searches. Available only for running jobs.
Service Name STRING FALSE The name of the YARN service. Called 'service_name'
in searches.
(service_name)
Shuffle Bad ID Errors NUMBER TRUE Shuffle bad ID errors. Called 'shuffle_errors_bad_id'
in searches.
(shuffle_errors_bad_id)
Shuffle Wrong Length Errors NUMBER TRUE Shuffle wrong length errors. Called
'shuffle_errors_wrong_length' in searches.
(shuffle_errors_wrong_length)
Shuffle Wrong Map Errors NUMBER TRUE Shuffle wrong map errors. Called
'shuffle_errors_wrong_map' in searches.
(shuffle_errors_wrong_map)
Shuffle Wrong Reduce Errors NUMBER TRUE Shuffle wrong reduce errors. Called
'shuffle_errors_wrong_reduce' in searches.
(shuffle_errors_wrong_reduce)
Slots Time MILLISECONDS TRUE Total slots time. This is the sum of 'slots_millis_maps'
and 'slots_millis_reduces'. Called 'slots_millis' in
(slots_millis)
searches.
Spilled Records NUMBER TRUE Spilled Records. Called 'spilled_records' in searches.
(spilled_records)
Successful Map and Reduce Attempts NUMBER TRUE The number of successful map and reduce attempts
for this MapReduce job. Called
(successful_tasks_attempts)
'successful_tasks_attempts' in searches. Available
only for successful jobs.
Successful Map Attempts NUMBER TRUE The number of successful map attempts for this
MapReduce job. Called 'successful_map_attempts'
(successful_map_attempts)
in searches. Available only for running jobs.
Successful Reduce Attempts NUMBER TRUE The number of successful reduce attempts for this
MapReduce job. Called 'successful_reduce_attempts'
(successful_reduce_attempts)
in searches. Available only for running jobs.
Total Maps and Reduces Number NUMBER TRUE The number of map and reduce tasks in this
MapReduce job. Called 'tasks_total' in searches.
(total_task_num)
Available only for running jobs.
Total Tasks NUMBER TRUE The total number of tasks. This is the sum of
'total_launched_maps' and 'total_launched_reduces'.
(total_launched_tasks)
Called 'total_launched_tasks' in searches.
Tracking Url STRING FALSE The MapReduce application tracking URL.
(tracking_url)
Uberized Job BOOLEAN FALSE Whether this MapReduce job is uberized - running
completely in the ApplicationMaster. Called
(uberized)
'uberized' in searches. Available only for running jobs.
Unused Memory Seconds NUMBER TRUE The amount of memory the application has allocated
but not used (megabyte-seconds). This metric is
(unused_memory_seconds)
calculated hourly if container usage metric
aggregation is enabled. Called
'unused_memory_seconds' in searches.
Unused VCore Seconds NUMBER TRUE The amount of CPU resources the application has
allocated but not used (virtual core-seconds). This
(unused_vcore_seconds)
metric is calculated hourly if container usage metric
aggregation is enabled. Called
'unused_vcore_seconds' in searches.
Used Memory Max NUMBER TRUE The maximum container memory usage for a YARN
application. This metric is calculated hourly if
(used_memory_max)
container usage metric aggregation is enabled and
a Cloudera Manager Container Usage Metrics
Directory is specified.
For information about how to enable metric
aggregation and the Container Usage Metrics
Directory, see Enabling the Cluster Utilization Report
on page 458.
User STRING TRUE The user who ran the YARN application. Called 'user'
in searches.
(user)
Work CPU Time MILLISECONDS TRUE Attribute measuring the sum of CPU time used by all
threads of the query, in milliseconds. Called
(cm_cpu_milliseconds)
'work_cpu_time' in searches. For Impala queries,
CPU time is calculated based on the 'TotalCpuTime'
metric. For YARN MapReduce applications, this is
calculated from the 'cpu_milliseconds' metric.
Examples
Consider the following filter expressions: user = "root", rowsProduced > 0, fileFormats RLIKE ".TEXT.*",
and executing = true. In the examples:
• The filter attributes are user, rowsProduced, fileFormats, and executing.
• The operators are =, >, and RLIKE.
3. In the Send YARN Applications Diagnostic Data dialog box, provide the following information:
• If applicable, the Cloudera Support ticket number of the issue being experienced on the cluster.
• Optionally, add a comment to help the support team understand the issue.
4. Click the checkbox Send Diagnostic Data to Cloudera.
5. Click the button Collect and Send Diagnostic Data.
2. Filter the event stream to choose a time window, log level, and display the NodeManager source.
3. For any event, click View Log File to view the entire log file.
Note: In CDH 5.10 and higher, and in CDK 2.x Powered By Apache Spark, the Storage tab of the Spark
History Server is always blank. To see storage information while an application is running, use the web
UI of the application as described in the previous section. After the application is finished, storage
information is not available.
The following screenshot shows the timeline of the events in the application including the jobs that were run and the
allocation and deallocation of executors. Each job shows the last action, saveAsTextFile, run for the job. The timeline
shows that the application acquires executors over the course of running the first job. After the second job finishes,
the executors become idle and are returned to the cluster.
The web page for Job 1 shows how preceding stages are skipped because Spark retains the results from those stages:
If you click the show link you see the DAG of the job. Clicking the Details link on this page displays the logical query
plan:
Note: The following example demonstrates the Spark driver web UI. Streaming information is not
captured in the Spark History Server.
The Spark driver web application UI also supports displaying the behavior of streaming applications in the Streaming
tab. If you run the example described in Spark Streaming Example, and provide three bursts of data, the top of the tab
displays a series of visualizations of the statistics summarizing the overall behavior of the streaming application:
The application has one receiver that processed 3 bursts of event batches, which can be observed in the events,
processing time, and delay graphs. Further down the page you can view details of individual batches:
To view the details of a specific batch, click a link in the Batch Time column. Clicking the 2016/06/16 14:23:20 link with
8 events in the batch, provides the following details:
Events
An event is a record that something of interest has occurred – a service's health has changed state, a log message (of
the appropriate severity) has been logged, and so on. Many events are enabled and configured by default.
From the Events page you can filter for events for services or role instances, hosts, users, commands, and much more.
You can also search against the content information returned by the event.
The Event Server aggregates relevant events and makes them available for alerting and for searching. This way, you
have a view into the history of all relevant events that occur cluster-wide.
Cloudera Manager supports the following categories of events:
Category Description
ACTIVITY_EVENT Generated by the Activity Monitor; specifically, for jobs that fail, or that run slowly (as
determined by comparison with duration limits). In order to monitor your workload for
slow-running jobs, you must specify Activity Duration Rules on page 282.
AUDIT_EVENT Generated by actions performed
• In Cloudera Manager, such as creating, configuring, starting, stopping, and deleting
services or roles
• By services that are being audited by Cloudera Navigator.
HBASE Generated by HBase with the exception of log messages, which have the LOG_MESSAGE
category.
HEALTH_CHECK Indicate that certain health test activities have occurred, or that health test results have
met specific conditions (thresholds).
Thresholds for various health tests can be set under the Configuration tabs for HBase,
HDFS, Impala, and MapReduce service instances, at both the service and role level. See
Configuring Health Monitoring on page 280 for more information.
LOG_MESSAGE Generated for certain types of log messages from HDFS, MapReduce, and HBase services
and roles. Log events are created when a log entry matches a set of rules for identifying
messages of interest. The default set of rules is based on Cloudera experience supporting
Hadoop clusters. You can configure additional log event rules if necessary.
Category Description
SYSTEM Generated by system events such as parcel availability.
Viewing Events
The Events page lets you display events and alerts that have occurred within a time range you select anywhere in your
clusters. From the Events page you can filter for events for services or role instances, hosts, users, commands, and
much more. You can also search against the content information returned by the event.
To view events, click the Diagnostics tab on the top navigation bar, then select Events.
Events List
Event entries are ordered (within the time range you've selected) with the most recent at the top. If the event generated
an alert, that is indicated by a red alert icon ( ) in the entry.
This page supports infinite scrolling: you can scroll to the end of the displayed results and the page will fetch more
results and add them to the end of the list automatically.
To display event details, click Expand at the right side of the event entry.
Clicking the View link at the far right of the entry has different results depending on the category of the entry:
• ACTIVITY_EVENT - Displays the activity Details page.
• AUDIT_EVENT - If the event was a restart, displays the service's Commands page. If the event was a configuration
change, the Revision Details dialog box displays.
• HBASE - Displays a health report or log details.
• HEALTH_CHECK - Displays the status page of the role instance.
• LOG_MESSAGE - Displays the event's log entry. You can also click Expand to display details of the entry, then
click the URL link. When you perform one of these actions the time range in the Time Line is shifted to the time
the event occurred.
• SYSTEM - Displays the Parcels page.
Filtering Events
You filter events by selecting a time range and adding filters.
You can use the Time Range Selector or a duration link ( ) to set the time range.
(See Time Line on page 271 for details). The time it takes to perform a search will typically increase for a longer time
range, as the number of events to be searched will be larger.
Adding a Filter
To add a filter, do one of the following:
• Click the icon that displays next to a property when you hover in one of the event entries. A filter containing
the property, operator, and its value is added to the list of filters at the left and Cloudera Manager redisplays all
events that match the filter.
• Click the Add a filter link. A filter control is added to the list of filters.
1. Choose a property in the drop-down list. You can search by properties such as Username, Service, Command,
or Role. The properties vary depending on the service or role.
2. If the property allows it, choose an operator in the operator drop-down list.
3. Type a property value in the value text field. For some properties you can include multiple values in the value
field. For example, you can create a filter like Category = HEALTH_CHECK LOG_MESSAGE. To drop individual
values, click the to the right of the value. For properties where the list of values is finite and known, you
can start typing and then select from a drop-down list of potential matches.
4. Click Search. The log displays all events that match the filter criteria.
Note: You can filter on a string by adding a filter, selecting the property CONTENT, operator =, and
typing the string to search for in the value field.
Removing a Filter
1. Click the at the right of the filter. The filter is removed.
2. Click Search. The log displays all events that match the filter criteria.
Re-running a Search
To re-run a recently performed search, click
Alerts
An alert is an event that is considered especially noteworthy and is triggered by a selected event. Alerts are shown
with an badge when they appear in a list of events. You can configure the Alert Publisher to send alert
notifications by email or by SNMP trap to a trap receiver.
Service instances of type HDFS, MapReduce, and HBase (and their associated roles) can generate alerts if so configured.
Alerts can also be configured for the monitoring roles that are a part of the Cloudera Management Service.
The settings to enable or disable specific alerts are found under the Configuration tab for the services to which they
pertain. See Configuring Alerts on page 283 and for more information on setting up alerting.
For information about configuring the Alert Publisher to send email or SNMP notifications for alerts, see Configuring
Alert Delivery on page 285.
Managing Alerts
Minimum Required Role: Full Administrator
The Administration > Alerts page provides a summary of the settings for alerts in your clusters.
Alert Type The left column lets you select by alert type (Health, Log, or Activity) and within that by service instance.
In the case of Health alerts, you can look at alerts for Hosts as well. You can select an individual service to see just the
alert settings for that service.
Health/Log/Activity Alert Settings Depending on your selection in the left column, the right hand column show you
the list of alerts that are enabled or disabled for the selected service type.
To change the alert settings for a service, click Edit next to the service name. This will take you to the Monitoring
section of the Configuration tab for the service. From here you can enable or disable alerts and configure thresholds
as needed.
Recipients You can also view the list of recipients configured for the enabled alerts.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group.
See Modifying Configuration Properties Using Cloudera Manager on page 74.
5. Click the Save Changes button at the top of the page to save your settings.
6. Restart the Alert Publisher role.
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
5. Locate the SNMP NMS Hostname property and click the ? icon to display the property description.
6. Click the SMNP Mib link.
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
You can configure the Alert Publisher to run a user-written script in response to an alert. The Alert Publisher passes a
single argument to the script that is a UTF-8 JSON file containing a list of alerts. The script runs on the host where the
Alert Publisher service is running and must have read and execute permissions for the cloudera-scm user. Only one
instance of a script runs at a time. The standard out and standard error messages from the script are logged to the
Alert Publisher log file.
You use the Alert Publisher: Maximum Batch Size and Alert Publisher: Maximum Batch interval to configure when
the Alert Publisher delivers alerts. See Configuring Alerts on page 283.
To configure the Alert Publisher to deliver alerts using a script:
1. Save the script on the host where the Alert Publisher role is running.
2. Change the owner of the file to cloudera-scm and set its permissions to read and execute:
3. Open the Cloudera Manager Admin console and select Clusters > Cloudera Management Service.
4. Click the Configuration tab.
5. Select Scope > Alert Publisher .
6. Enter the path to the script in the Custom Alert Script property.
7. Enter a Reason for change, and then click Save Changes to commit the changes.
Sample JSON Alert File
When a custom script runs, it passes a JSON file that contains the alerts. For example:
[ {
"body" : {
"alert" : {
"content" : "The health test result for MAPREDUCE_HA_JOB_TRACKER_HEALTH has become
bad: JobTracker summary: myCluster.com (Availability: Active, Health: Bad). This health
test reflects the health of the active JobTracker.",
"timestamp" : {
"iso8601" : "2015-06-11T03:52:56Z",
"epochMs" : 1433994776083
},
"source" :
"http://myCluster.com:7180/cmf/eventRedirect/89521139-0859-4bef-bf65-eb141e63dbba",
"attributes" : {
"__persist_timestamp" : [ "1433994776172" ],
"ALERT_SUPPRESSED" : [ "false" ],
"HEALTH_TEST_NAME" : [ "MAPREDUCE_HA_JOB_TRACKER_HEALTH" ],
"SEVERITY" : [ "CRITICAL" ],
"HEALTH_TEST_RESULTS" : [ {
"content" : "The health test result for MAPREDUCE_HA_JOB_TRACKER_HEALTH has
become bad: JobTracker summary: myCluster.com (Availability: Active, Health: Bad). This
health test reflects the health of the active JobTracker.",
"testName" : "MAPREDUCE_HA_JOB_TRACKER_HEALTH",
"eventCode" : "EV_SERVICE_HEALTH_CHECK_BAD",
"severity" : "CRITICAL"
} ],
"CLUSTER_DISPLAY_NAME" : [ "Cluster 1" ],
"ALERT" : [ "true" ],
"CATEGORY" : [ "HEALTH_CHECK" ],
"BAD_TEST_RESULTS" : [ "1" ],
"SERVICE_TYPE" : [ "MAPREDUCE" ],
"EVENTCODE" : [ "EV_SERVICE_HEALTH_CHECK_BAD", "EV_SERVICE_HEALTH_CHECK_GOOD"
],
"ALERT_SUMMARY" : [ "The health of service MAPREDUCE-1 has become bad." ],
"CLUSTER_ID" : [ "1" ],
"SERVICE" : [ "MAPREDUCE-1" ],
"__uuid" : [ "89521139-0859-4bef-bf65-eb141e63dbba" ],
"CLUSTER" : [ "Cluster 1" ],
"CURRENT_COMPLETE_HEALTH_TEST_RESULTS" : [ "{\"content\":\"The health test result
for MAPREDUCE_HA_JOB_TRACKER_HEALTH has become bad: JobTracker summary: myCluster.com
(Availability: Active, Health: Bad). This health test reflects the health of the active
JobTracker.\",\"testName\":\"MAPREDUCE_HA_JOB_TRACKER_HEALTH\",\"eventCode\":\"EV_SERVICE_HEALTH_CHECK_BAD\",\"severity\":\"CRITICAL\"}",
"{\"content\":\"The health test result for MAPREDUCE_TASK_TRACKERS_HEALTHY has become
good: Healthy TaskTracker: 3. Concerning TaskTracker: 0. Total TaskTracker: 3. Percent
healthy: 100.00%. Percent healthy or concerning:
100.00%.\",\"testName\":\"MAPREDUCE_TASK_TRACKERS_HEALTHY\",\"eventCode\":\"EV_SERVICE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}"
],
"PREVIOUS_HEALTH_SUMMARY" : [ "GREEN" ],
"CURRENT_HEALTH_SUMMARY" : [ "RED" ],
"MONITOR_STARTUP" : [ "false" ],
"PREVIOUS_COMPLETE_HEALTH_TEST_RESULTS" : [ "{\"content\":\"The health test
result for MAPREDUCE_HA_JOB_TRACKER_HEALTH has become good: JobTracker summary:
myCluster.com (Availability: Active, Health:
Good)\",\"testName\":\"MAPREDUCE_HA_JOB_TRACKER_HEALTH\",\"eventCode\":\"EV_SERVICE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for MAPREDUCE_TASK_TRACKERS_HEALTHY has become
good: Healthy TaskTracker: 3. Concerning TaskTracker: 0. Total TaskTracker: 3. Percent
healthy: 100.00%. Percent healthy or concerning:
100.00%.\",\"testName\":\"MAPREDUCE_TASK_TRACKERS_HEALTHY\",\"eventCode\":\"EV_SERVICE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}"
],
"SERVICE_DISPLAY_NAME" : [ "MAPREDUCE-1" ]
}
}
},
"header" : {
"type" : "alert",
"version" : 2
}
}, {
"body" : {
"alert" : {
"content" : "The health test result for JOB_TRACKER_SCM_HEALTH has become bad:
This role's process exited. This role is supposed to be started.",
"timestamp" : {
"iso8601" : "2015-06-11T03:52:56Z",
"epochMs" : 1433994776083
},
"source" :
"http://myCluster.com:7180/cmf/eventRedirect/67b4d1c4-791b-428e-a9ea-8a09d4885f5d",
"attributes" : {
"__persist_timestamp" : [ "1433994776173" ],
"ALERT_SUPPRESSED" : [ "false" ],
"HEALTH_TEST_NAME" : [ "JOB_TRACKER_SCM_HEALTH" ],
"SEVERITY" : [ "CRITICAL" ],
"ROLE" : [ "MAPREDUCE-1-JOBTRACKER-10624c438dee9f17211d3f33fa899957" ],
"HEALTH_TEST_RESULTS" : [ {
"content" : "The health test result for JOB_TRACKER_SCM_HEALTH has become bad:
This role's process exited. This role is supposed to be started.",
"testName" : "JOB_TRACKER_SCM_HEALTH",
"eventCode" : "EV_ROLE_HEALTH_CHECK_BAD",
"severity" : "CRITICAL"
} ],
"CLUSTER_DISPLAY_NAME" : [ "Cluster 1" ],
"HOST_IDS" : [ "75e763c2-8d22-47a1-8c80-501751ae0db7" ],
"ALERT" : [ "true" ],
"ROLE_TYPE" : [ "JOBTRACKER" ],
"CATEGORY" : [ "HEALTH_CHECK" ],
"BAD_TEST_RESULTS" : [ "1" ],
"SERVICE_TYPE" : [ "MAPREDUCE" ],
"EVENTCODE" : [ "EV_ROLE_HEALTH_CHECK_BAD", "EV_ROLE_HEALTH_CHECK_GOOD",
"EV_ROLE_HEALTH_CHECK_DISABLED" ],
"ALERT_SUMMARY" : [ "The health of role jobtracker (nightly-1) has become bad."
],
"CLUSTER_ID" : [ "1" ],
"SERVICE" : [ "MAPREDUCE-1" ],
"__uuid" : [ "67b4d1c4-791b-428e-a9ea-8a09d4885f5d" ],
"CLUSTER" : [ "Cluster 1" ],
"CURRENT_COMPLETE_HEALTH_TEST_RESULTS" : [ "{\"content\":\"The health test result
for JOB_TRACKER_SCM_HEALTH has become bad: This role's process exited. This role is
supposed to be
started.\",\"testName\":\"JOB_TRACKER_SCM_HEALTH\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_BAD\",\"severity\":\"CRITICAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_UNEXPECTED_EXITS has become
good: This role encountered 0 unexpected exit(s) in the previous 5
minute(s).\",\"testName\":\"JOB_TRACKER_UNEXPECTED_EXITS\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_FILE_DESCRIPTOR has become good:
Open file descriptors: 244. File descriptor limit: 32,768. Percentage in use:
0.74%.\",\"testName\":\"JOB_TRACKER_FILE_DESCRIPTOR\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_SWAP_MEMORY_USAGE has become
good: 0 B of swap memory is being used by this role's
process.\",\"testName\":\"JOB_TRACKER_SWAP_MEMORY_USAGE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE has
become good: This role's Log Directory (/var/log/hadoop-0.20-mapreduce) is on a filesystem
with more than 20.00% of its space
free.\",\"testName\":\"JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_HOST_HEALTH has become good:
The health of this role's host is
good.\",\"testName\":\"JOB_TRACKER_HOST_HEALTH\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_WEB_METRIC_COLLECTION has become
good: The web server of this role is responding with metrics. The most recent collection
took 49
millisecond(s).\",\"testName\":\"JOB_TRACKER_WEB_METRIC_COLLECTION\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_GC_DURATION has become good:
Average time spent in garbage collection was 0 second(s) (0.00%) per minute over the
previous 5
minute(s).\",\"testName\":\"JOB_TRACKER_GC_DURATION\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_HEAP_DUMP_DIRECTORY_FREE_SPACE
has become disabled: Test disabled because role is not configured to dump heap when
out of memory. Test of whether this role's heap dump directory has enough free
space.\",\"testName\":\"JOB_TRACKER_HEAP_DUMP_DIRECTORY_FREE_SPACE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_DISABLED\",\"severity\":\"INFORMATIONAL\"}"
],
"CURRENT_HEALTH_SUMMARY" : [ "RED" ],
"PREVIOUS_HEALTH_SUMMARY" : [ "GREEN" ],
"MONITOR_STARTUP" : [ "false" ],
"ROLE_DISPLAY_NAME" : [ "jobtracker (nightly-1)" ],
"PREVIOUS_COMPLETE_HEALTH_TEST_RESULTS" : [ "{\"content\":\"The health test
result for JOB_TRACKER_SCM_HEALTH has become good: This role's status is as expected.
The role is
started.\",\"testName\":\"JOB_TRACKER_SCM_HEALTH\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_UNEXPECTED_EXITS has become
good: This role encountered 0 unexpected exit(s) in the previous 5
minute(s).\",\"testName\":\"JOB_TRACKER_UNEXPECTED_EXITS\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_FILE_DESCRIPTOR has become good:
Open file descriptors: 244. File descriptor limit: 32,768. Percentage in use:
0.74%.\",\"testName\":\"JOB_TRACKER_FILE_DESCRIPTOR\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_SWAP_MEMORY_USAGE has become
good: 0 B of swap memory is being used by this role's
process.\",\"testName\":\"JOB_TRACKER_SWAP_MEMORY_USAGE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE has
become good: This role's Log Directory (/var/log/hadoop-0.20-mapreduce) is on a filesystem
with more than 20.00% of its space
free.\",\"testName\":\"JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_HOST_HEALTH has become good:
The health of this role's host is
good.\",\"testName\":\"JOB_TRACKER_HOST_HEALTH\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_WEB_METRIC_COLLECTION has become
good: The web server of this role is responding with metrics. The most recent collection
took 49
millisecond(s).\",\"testName\":\"JOB_TRACKER_WEB_METRIC_COLLECTION\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_GC_DURATION has become good:
Average time spent in garbage collection was 0 second(s) (0.00%) per minute over the
previous 5
minute(s).\",\"testName\":\"JOB_TRACKER_GC_DURATION\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_HEAP_DUMP_DIRECTORY_FREE_SPACE
has become disabled: Test disabled because role is not configured to dump heap when
out of memory. Test of whether this role's heap dump directory has enough free
space.\",\"testName\":\"JOB_TRACKER_HEAP_DUMP_DIRECTORY_FREE_SPACE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_DISABLED\",\"severity\":\"INFORMATIONAL\"}"
],
"SERVICE_DISPLAY_NAME" : [ "MAPREDUCE-1" ],
"HOSTS" : [ "myCluster.com" ]
}
}
},
"header" : {
"type" : "alert",
"version" : 2
}
}, {
"body" : {
"alert" : {
"content" : "The health test result for JOB_TRACKER_UNEXPECTED_EXITS has become
bad: This role encountered 1 unexpected exit(s) in the previous 5 minute(s).This included
1 exit(s) due to OutOfMemory errors. Critical threshold: any.",
"timestamp" : {
"iso8601" : "2015-06-11T03:53:41Z",
"epochMs" : 1433994821940
},
"source" :
"http://myCluster.com:7180/cmf/eventRedirect/b8c4468d-08c2-4b5b-9bda-2bef892ba3f5",
"attributes" : {
"__persist_timestamp" : [ "1433994822027" ],
"ALERT_SUPPRESSED" : [ "false" ],
"HEALTH_TEST_NAME" : [ "JOB_TRACKER_UNEXPECTED_EXITS" ],
"SEVERITY" : [ "CRITICAL" ],
"ROLE" : [ "MAPREDUCE-1-JOBTRACKER-10624c438dee9f17211d3f33fa899957" ],
"HEALTH_TEST_RESULTS" : [ {
"content" : "The health test result for JOB_TRACKER_UNEXPECTED_EXITS has become
bad: This role encountered 1 unexpected exit(s) in the previous 5 minute(s).This included
1 exit(s) due to OutOfMemory errors. Critical threshold: any.",
"testName" : "JOB_TRACKER_UNEXPECTED_EXITS",
"eventCode" : "EV_ROLE_HEALTH_CHECK_BAD",
"severity" : "CRITICAL"
} ],
"CLUSTER_DISPLAY_NAME" : [ "Cluster 1" ],
"HOST_IDS" : [ "75e763c2-8d22-47a1-8c80-501751ae0db7" ],
"ALERT" : [ "true" ],
"ROLE_TYPE" : [ "JOBTRACKER" ],
"CATEGORY" : [ "HEALTH_CHECK" ],
"BAD_TEST_RESULTS" : [ "1" ],
"SERVICE_TYPE" : [ "MAPREDUCE" ],
"EVENTCODE" : [ "EV_ROLE_HEALTH_CHECK_BAD", "EV_ROLE_HEALTH_CHECK_GOOD",
"EV_ROLE_HEALTH_CHECK_DISABLED" ],
"ALERT_SUMMARY" : [ "The health of role jobtracker (nightly-1) has become bad."
],
"CLUSTER_ID" : [ "1" ],
"SERVICE" : [ "MAPREDUCE-1" ],
"__uuid" : [ "b8c4468d-08c2-4b5b-9bda-2bef892ba3f5" ],
"CLUSTER" : [ "Cluster 1" ],
"CURRENT_COMPLETE_HEALTH_TEST_RESULTS" : [ "{\"content\":\"The health test result
for JOB_TRACKER_SCM_HEALTH has become bad: This role's process exited. This role is
supposed to be
started.\",\"testName\":\"JOB_TRACKER_SCM_HEALTH\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_BAD\",\"severity\":\"CRITICAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_UNEXPECTED_EXITS has become bad:
This role encountered 1 unexpected exit(s) in the previous 5 minute(s).This included
1 exit(s) due to OutOfMemory errors. Critical threshold:
any.\",\"testName\":\"JOB_TRACKER_UNEXPECTED_EXITS\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_BAD\",\"severity\":\"CRITICAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_FILE_DESCRIPTOR has become good:
Open file descriptors: 244. File descriptor limit: 32,768. Percentage in use:
0.74%.\",\"testName\":\"JOB_TRACKER_FILE_DESCRIPTOR\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_SWAP_MEMORY_USAGE has become
good: 0 B of swap memory is being used by this role's
process.\",\"testName\":\"JOB_TRACKER_SWAP_MEMORY_USAGE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE has
become good: This role's Log Directory (/var/log/hadoop-0.20-mapreduce) is on a filesystem
with more than 20.00% of its space
free.\",\"testName\":\"JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
"{\"content\":\"The health test result for JOB_TRACKER_HOST_HEALTH has become good:
The health of this role's host is
good.\",\"testName\":\"JOB_TRACKER_HOST_HEALTH\",\"eventCode\":\"EV_ROLE_HEALTH_CHECK_GOOD\",\"severity\":\"INFORMATIONAL\"}",
Triggers
A trigger is a statement that specifies an action to be taken when one or more specified conditions are met for a
service, role, role configuration group, or host. The conditions are expressed as a tsquery statement, and the action
to be taken is to change the health for the service, role, role configuration group, or host to either Concerning (yellow)
or Bad (red).
Triggers can be created for services, roles, role configuration groups, or hosts. Create a trigger by doing one of the
following:
• Directly editing the configuration for the service, role (or role configuration group), or host configuration.
• Clicking Create Trigger on the drop-down menu for most charts. Note that the Create Trigger command is not
available on the drop-down menu for charts where no context (role, service, and so on) is defined, such as on the
Home > Status tab.
Important: Because triggers are a new and evolving feature, backward compatibility between
releases is not guaranteed at this time.
• Use the Create Trigger expression builder. See Creating a Trigger Using the Expression Editor on page 358.
Name (required)
A trigger's name must be unique in the context for which the trigger is defined. That is, there cannot be two triggers
for the same service or role with the same name. Different services or different roles can have triggers with the same
name.
Expression (required)
A trigger expression takes the form:
IF (CONDITIONS) DO HEALTH_ACTION
When the conditions of the trigger are met, the trigger is considered to be firing. A condition is any valid tsquery
statement. In most cases conditions employ stream filters to filter out streams below or above a certain threshold.
For example, the following tsquery can be used to retrieve the streams for DataNodes with more than 500 open file
descriptors:
The stream filter used here, last(fd_open) > 50, is composed of four parts:
• A scalar producing function "last" that takes a stream and returns its last data point
• A metric to operate on
• A comparator
• A scalar value
Other scalar producing functions are available, like min or max, and they can be combined to create arbitrarily complex
expressions:
IF ((SELECT fd_open WHERE roleType=DataNode AND last(fd_open) > 500) OR (SELECT fd_open
WHERE roleType=NameNode AND last(fd_open) > 500)) DO health:bad
A condition is met if it returns more than the number of streams specified by the streamThreshold (see below). A
trigger fires if the logical evaluation of all of its conditions results in a met condition. When a trigger fires, two actions
can be taken: health:concerning or health:bad. These actions change the health of the entity on which the
trigger is defined.
Enabled (optional)
Whether the trigger is enabled. The default is true, (enabled).
Trigger Example
The following is a JSON formatted trigger that fires if there are more than 10 DataNodes with more than 500 file
descriptors opened:
Important: If you select Edit Manually, changes you make manually do not appear in the
expression builder when you click Use Editor.
As you work, the Preview shows the resulting chart and current status of your host.
7. Scroll down and choose whether to apply this trigger to All hosts.
8. Click Create Trigger.
The metric expression function for average over time, moving_avg, is not available from the pop-up menu in the
editor. You can edit the expression directly using tsquery language.
login failed and succeeded), and provides an audit UI and API to view, filter, and export such events. The Cloudera
Manager audit log does not track the progress or results of commands (such as starting or stopping a service or creating
a directory for a service), it just notes the command that was executed and the user who executed it. To view the
progress or results of a command, follow the procedures in Viewing Running and Recent Commands on page 301.
The Cloudera Navigator Audit Server records service access events and the Cloudera Navigator Metadata Server
provides an audit UI and API to view, filter, and export both service access events and the lifecycle and security events
retrieved from Cloudera Manager. For information on Cloudera Navigator auditing features, see Exploring Audit Data.
Object Procedure
Cluster 1. Click the Audits tab on the top navigation bar.
Audit event entries are ordered with the most recent at the top.
You can use the Time Range Selector or a duration link ( ) to set the time range.
(See Time Line on page 271 for details). When you select the time range, the log displays all events in that range. The
time it takes to perform a search will typically increase for a longer time range, as the number of events to be searched
will be larger.
Adding a Filter
To add a filter, do one of the following:
• Click the icon that displays next to a property when you hover in one of the event entries. A filter containing
the property, operator, and its value is added to the list of filters at the left and Cloudera Manager redisplays all
events that match the filter.
• Click the Add a filter link. A filter control is added to the list of filters.
1. Choose a property in the drop-down list. You can search by properties such as Username, Service, Command,
or Role. The properties vary depending on the service or role.
2. If the property allows it, choose an operator in the operator drop-down list.
3. Type a property value in the value text field. To match a substring, use the like operator and specify %
around the string. For example, to see all the audit events for files created in the folder /user/joe/out
specify Source like %/user/joe/out%.
4. Click Search. The log displays all events that match the filter criteria.
5. Click to add more filters and repeat steps 1 through 4.
Removing a Filter
1. Click the at the right of the filter. The filter is removed.
2. Click Search. The log displays all events that match the filter criteria.
service,username,command,ipAddress,resource,allowed,timestamp
hdfs1,cloudera,setPermission,10.20.187.242,/user/hive,false,"2013-02-09T00:59:34.430Z"
hdfs1,cloudera,getfileinfo,10.20.187.242,/user/cloudera,true,"2013-02-09T00:59:22.667Z"
hdfs1,cloudera,getfileinfo,10.20.187.242,/,true,"2013-02-09T00:59:22.658Z"
In this example, the first event access was denied, and therefore the allowed field has the value false.
Terminology
Entity
A Cloudera Manager component that has metrics associated with it, such as a service, role, or host.
Metric
A property that can be measured to quantify the state of an entity or activity, such as the number of open file descriptors
or CPU utilization percentage.
Time series
A list of (time, value) pairs that is associated with some (entity, metric) pair such as, (datanode-1, fd_open),
(hostname, cpu_percent). In more complex cases, the time series can represent operations on other time series.
For example, (datanode-1 , cpu_user + cpu_system).
Facet
A display grouping of a set of time series. By default, when a query returns multiple time series, they are displayed in
individual charts. Facets allow you to display the time series in separate charts, in a single chart, or grouped by various
attributes of the set of time series.
to the right of the Build Chart button to display a list of recently run statements and select a statement.
The statement text displays in the text box and the chart(s) that display that time series will display.
• Select from the list of Chart Examples
1. Click the question mark icon
to the right of the Build Chart button to display a list of examples with descriptions.
2. Click Try it to create a chart based on the statement text in the example.
• Type a new statement
1. Press Spacebar in the text box. tsquery statement components display in a drop-down list. These
suggestions are part of type ahead, which helps build valid queries. Scroll to the desired component and
click Enter. Continue choosing query components by pressing Spacebar and Enter until the tsquery
statement is complete.
For example, the query SELECT jvm_heap_used_mb where clusterId = 1 could return a set of charts like the
following:
Notice the $HOSTNAME portion of the query string. $HOSTNAME is a variable that will be resolved to a specific value
based on the page before the query is actually issued. In this case, $HOSTNAME will become
nightly53-2.ent.cloudera.com.
Context-sensitive variables are useful since they allow portable queries to be written. For example the query above
may be on the host status page or any role status page to display the appropriate host's swap rate. Variables cannot
be used in queries that are part of user-defined dashboards since those dashboards have no service, role or host
context.
Chart Properties
By default, the time-series data retrieved by the tsquery is displayed on its own chart, using a Line style chart, a default
size, and a default minimum and maximum for the Y-axis. You can change the chart type, facet the data, set the chart
scale and size, and set X- and Y-axis ranges.
• Histogram - Displays the time series values as a set of bars where each bar represents a range of metric values
and the height of the bar represents the number of entities whose value falls within the range. The following
histogram shows the number of roles in each range of the last value of the resident memory.
• Table - Displays the time series values as a table with each row containing the data for a single time value.
Note: Heatmaps and histograms render charts for a single point as opposed to time series charts that
render a series of points. For queries that return time series, Cloudera Manager will generate the
heatmap or histogram based on the last recorded point in the series, and will issue the warning: "Query
returned more than one value per stream. Only the last value was used." To eliminate this warning,
use a scalar returning function to choose a point. For example, use select last(cpu_percent)
to use the last point or select max(cpu_percent) to use the maximum value (in the selected time
range).
example, ZooKeeper) contain just a single line. When a chart contains multiple lines, each entity is identified by a
different color line.
Changing Scale
You can set the scale of the chart to linear, logarithmic, and power.
Changing Dimensions
You can change the size of your charts by modifying the values in the Dimension fields. They change in 50-pixel
increments when you click the up or down arrows, and you can type values in as long as they are multiples of 50. If
you have multiple charts, depending on the dimensions you specify and the size of your browser window, your charts
may appear in rows of multiple charts. If the Resize Proportionally checkbox is checked, you can modify one dimension
and the other will be modified automatically to maintain the chart's width and height proportions.
The following chart shows the same query as the previous chart, but with All Combined selected (which shows all time
series in a single chart) and with the Dimension values increased to expand the chart.
Changing Axes
You can change the Y-axis range using the Y Range minimum and maximum fields.
The X-axis is based on clock time, and by default shows the last hour of data. You can use the Time Range Selector or
a duration link ( ) to set the time range. (See Time Line on page 271 for details).
– If there are service, role, or host instances in the chart, click the View link to display the instance's
Status page.
– Click the Close button to return to the regular chart view.
• Heatmap - Clicking a square in a heatmap displays a line chart of the time series for that entity.
• Histogram -
– Mousing over the upper right corner of a histogram and clicking opens a pop-up containing the query that
generated the chart, an expanded view of the chart, a list of entity names and links to the entities whose
metrics are represented by the histogram bars, and the value of the metric for each entity. For example,
clicking the following histogram
– Clicking a bar in the expanded histogram displays a line chart of the time series from which the histogram
was generated:
Clicking the < Back link at the bottom left of the line chart returns to the expanded histogram.
Editing a Chart
You can edit a chart from the custom dashboard and save it back into the same or another existing dashboard, or to
a new custom dashboard. Editing a chart only affects the copy of the chart in the current dashboard – if you have
copied the chart into other dashboards, those charts are not affected by your edits.
1. Move the cursor over the chart, and click the gear icon
Saving a Chart
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
After editing a chart you can save it to a new or existing custom dashboard.
1. Modify the chart's properties and click Build Chart.
2. Click Save to open the Save Chart dialog box, and select one of the following:
a. Update chart in current dashboard: <name of current dashboard>.
b. Add chart to another dashboard.
c. Add chart to a new custom dashboard.
3. Click Save Chart.
4. Click View Dashboard to go to the dashboard where the chart has been saved.
Dashboards
A dashboard is a set of charts. This topic covers:
Dashboard Types
A default dashboard is a predefined set of charts that you cannot change. In a default dashboard you can:
• Display chart details.
• Edit a chart and then save back to a new or existing custom dashboard.
A custom dashboard contains a set of charts that you can change. In a custom dashboard you can:
• Display chart details.
• Edit a chart and then save back to a new or existing custom dashboard.
• Save a chart, make any modifications, and then save to a new or existing dashboard.
• Remove a chart.
When you first display a page containing charts it has a custom dashboard with the same charts as a default dashboard.
Creating a Dashboard
1. Do one of the following:
• Select Charts > New Dashboard.
• Select Charts > Manage Dashboards and click Create Dashboard.
• Save a chart to a new dashboard.
2. Specify a name and optionally a duration.
Conventional dashboard names follow the patterns given by one of the following Java regular expressions:
Managing Dashboards
To manage dashboards, select Charts > Manage Dashboards. You can create, clone, edit, export, import, and remove
dashboards.
• Create Dashboard - create a new dashboard.
• Clone - clones an existing dashboard.
• Edit - edit an existing dashboard.
• Export - exports the specifications for the dashboard as a JSON file.
• Import Dashboard - reads an exported JSON file and recreates the dashboard.
• Remove - deletes the dashboard.
Configuring Dashboards
You can change the time scale of a dashboard, switch between default and custom dashboards, and reset a custom
dashboard.
To set the dashboard type, click and select one of the following:
• Custom - displays a custom dashboard.
• Default - displays a default dashboard.
• Reset - resets the custom dashboard to the predefined set of charts, discarding any customizations.
to the right of the Build Chart button and select a metric from the List of Metrics, type a metric name or
description into the Basic text field, or type a query into the Advanced field.
b. Click Build Chart. The charts that result from your query are displayed, and you can modify their chart type,
combine them using facets, change their size and so on.
2. Click Add.
to the right of the Build Chart button and select a metric from the List of Metrics, type a metric name
or description into the Basic text field, or type a query into the Advanced field.
2. Click Build Chart. The charts that result from your query are displayed, and you can modify their chart
type, combine them using facets, change their size and so on.
2. Click Add.
Note: If the query you've chosen has resulted in multiple charts, all the charts are added to the
dashboard as a set. Although the individual charts in this set can be copied, you can only edit the set
as a whole.
1. Move the cursor over the chart, and click the icon at the top right.
2. Click Remove.
tsquery Language
The tsquery language is used to specify statements for retrieving time-series data from the Cloudera Manager time-series
datastore.
Before diving into the tsquery language specification, here's how you perform some common queries using the tsquery
language:
1. Retrieve time series for all metrics for all DataNodes.
3. Retrieve the jvm_heap_used_mb metric time series divided by 1024 and the jvm_heap_committed metric time
series divided by 1024 for all roles running on the host named "my host".
4. Retrieve the jvm_total_threads and jvm_blocked_threads metric time series for all entities for which
Cloudera Manager collects these two metrics.
select jvm_total_threads,jvm_blocked_threads
tsquery Syntax
A tsquery statement has the following structure:
SELECT [metric expression]WHERE [predicate]
• The metric expression can be replaced with an asterisk (*), as shown in example 1. In that case, all metrics that
are applicable for selected entities, such as DATANODE in example 1, are returned.
• The predicate can be omitted, as shown in example 4. In such cases, time series for all entities for which the
metrics are appropriate are returned. For this query you would see the jvm_new_threads metric for NameNodes,
DataNodes, TaskTrackers, and so on.
Metric Expressions
A metric expression generates the time series. It is a comma-delimited list of one or more metric expression statements.
A metric expression statement is the name of a metric, a metric expression function, or a scalar value, joined by one
or more metric expression operators.
See the FAQ on page 384 which answers questions concerning how to discover metrics and use cases for scalar values.
Metric expressions support the binary operators: +, -, *, /.
Here are some examples of metric expressions:
• jvm_heap_used_mb, cpu_user, 5
• 1000 * jvm_gc_time_ms / jvm_gc_count
• total_cpu_user + total_cpu_system
• max(total_cpu_user)
dt(metric expression) N Derivative with negative values. The change of the underlying metric
expression per second. For example: dt(jvm_gc_count).
dt0(metric expression) N Derivative where negative values are skipped (useful for dealing with counter
resets). The change of the underlying metric expression per second. For
example: dt0(jvm_gc_time_ms) / 10.
getClusterFact(string Y Retrieves a fact about a cluster. Currently supports one fact: numCores. If
factName, double the number of cores cannot be determined, defaultValue is returned.
defaultValue)
getHostFact(string Y Retrieves a fact about a host. Currently supports one fact: numCores. If
factName, double the number of cores cannot be determined, defaultValue is returned.
defaultValue)
For example, select dt(total_cpu_user) /
getHostFact(numCores, 2) where category=HOST divides the
results of dt(total_cpu_user) by the current number of cores for each
host.
greatest(metric N Compares two metric expressions, one of which one is a scalar metric
expression, scalar expression. Returns a time series where each point is the result of evaluating
metric expression) max(point, scalar metric expression).
integral(metric N Computes the integral value for a stream and returns a time-series stream
expression) within which each data point is the integral value of the corresponding data
point from the original stream. For example, select
integral(maps_failed_rate) will return the count of the failed number
of maps.
counter_delta(metric N Computes the difference in counter value for a stream and returns a
expression) time-series stream within which each data point is the difference in counter
value of the corresponding data point from the counter value of previous
data point in the original stream. For example: select
counter_delta(maps_failed_rate) returns the count of the failed
number of maps. This method is more accurate than the integral()
function. However there are a few caveats:
• This function is only implemented for single time-series streams. For
streams of cross-entity aggregates, continue to use the integral()
function.
• If you apply this method for time-series streams which was created
using a version of Cloudera Manager older than 5.7, Cloudera Manager
fills in the older data points using the integral() function.
last(metric Y Returns the last point of a time series. For example, to use the last point of
expression) the cpu_percent metric time series, use the expression select
last(cpu_percent).
least(metric N Compares two metric expressions, of which one is a scalar metric expression.
expression, scalar Returns a time series where each point is the result of evaluating
metric expression) min(point, scalar metric expression).
max(metric expression) Y Computes the maximum value of the time series. For example, select
max(cpu_percent).
stats(metric N Some time-series streams have additional statistics for each data point.
expression, stats These include rollup time-series streams, cross-entity aggregates, and rate
name) metrics. The following statistics are available for rollup and cross-entity
aggregates: max, min, avg, std_dev, and sample. For rate metrics, the
underlying counter value is available using the "counter" statistics. For
Predicates
A predicate limits the number of streams in the returned series and can take one of the following forms:
• time_series_attribute operator value, where
– time_series_attribute is one of the supported attributes.
– operator is one of = and rlike
– value is an attribute value subject to the following constraints:
– For attributes values that contain spaces or values of attributes of the form xxxName such as
displayName, use quoted strings.
– The value for the rlike operator must be specified in quotes. For example: hostname rlike
"host[0-3]+.*".
– value can be any regular expression as specified in regular expression constructs in the Java Pattern class
documentation.
2. Retrieve all time series for all metrics for DataNodes or TaskTrackers that are running on host named "myhost".
3. Retrieve the total_cpu_user metric time series for all hosts with names that match the regular expression
"host[0-3]+.*"
2. Return the number of open file descriptors where processes have more than 500Mb of mem_rss:
For example, the following expression limits the stream to events that occurred only on weekdays:
day in (1,2,3,4,5)
hour in [#:#]
For example, the following expression limits the stream to events that occur only between 9:00 a.m. and 5:00 p.m.:
hour in [9:17]
Add the day or time range expression after the WHERE clause. Do not use the AND keyword. For example:
select fd_open where category = ROLE and roleType = SERVICEMONITOR day in (1,2,3,4,5)
You can also combine day in and hour in expressions. Always put the day expression before the hour expression.
The following example limits the stream to weekdays between 9:00 a.m. and 5:00 p.m.:
select fd_open where category = ROLE and roleType = SERVICEMONITOR day in (1,2,3,4,5)
hour in [9:17]
Name Description
active Indicates whether the entities to be retrieved must be active. A nonactive entity is an
entity that has been removed or deleted from the cluster. The default is to retrieve only
active entities (that is, active=true). To access time series for deleted or removed
entities, specify active=false in the query. For example:
SELECT fd_open WHERE roleType=DATANODE and active=false
Name Description
agentName A Flume agent name.
applicationName One of the Cloudera Manager monitoring daemon names.
cacheId The HDFS cache directive ID.
category The category of the entities returned by the query: CLUSTER, DIRECTORY, DISK,
FILESYSTEM, FLUME_SOURCE, FLUME_CHANNEL, FLUME_SINK, HOST, HTABLE,
IMPALA_QUERY_STREAM, NETWORK_INTERFACE, ROLE, SERVICE, USER,
YARN_APPLICATION_STREAM, YARN_QUEUE.
Some metrics are collected for more than one type of entity. For example,
total_cpu_user is collected for entities of category HOST and ROLE. To retrieve the
data only for hosts use:
select total_cpu_user where category=HOST
The ROLE category applies to all role types (see roleType attribute). The SERVICE
category applies to all service types (see serviceType attribute). For example, to
retrieve the committed heap for all roles on host1 use:
select jvm_committed_heap_mb where category=ROLE and
hostname="host1"
Name Description
path A filesystem path associated with the time-series entity.
poolName A pool name. For example, hdfs cache pool, yarn pools.
queueName The name of a YARN queue.
rackId A Rack ID. For example, /default.
roleConfigGroup The role group that a role belongs to.
roleName The role ID. For example,
HBASE-1-REGIONSERVER-0b0ad09537621923e2b460e5495569e7.
roleState The role state: BUSY, HISTORY_NOT_AVAILABLE, NA, RUNNING, STARTING, STOPPED,
STOPPING, UNKNOWN
serviceType The service type: ACCUMULO,FLUME, HDFS, HBASE, HIVE, HUE, IMPALA, KS_INDEXER,
MAPREDUCE, MGMT, OOZIE,SOLR, SPARK,SQOOP,YARN, ZOOKEEPER.
Entity Attributes
All Roles roleType, hostId, hostname, rackId, serviceType, serviceName
All Services serviceName, serviceType, clusterId, version, serviceDisplayName, clusterDisplayName
Entity Attributes
Agent roleType, hostId, hostname, rackId, serviceType, serviceName, clusterId, version,
agentName, serviceDisplayName, clusterDisplayName
Cluster clusterId, version, clusterDisplayName
Directory roleName, hostId, path, roleType, hostname, rackId, serviceType, serviceName, clusterId,
version, agentName, hostname, clusterDisplayName
Disk device, logicalPartition, hostId, rackId, clusterId, version, hostname, clusterDisplayName
File System hostId, mountpoint, rackId, clusterId, version, partition, hostname, clusterDisplayName
Flume Channel serviceName, hostId, rackId, roleName, flumeComponent, roleType, serviceType,
clusterId, version, agentName, serviceDisplayName, clusterDisplayName
Flume Sink serviceName, hostId, rackId, roleName, flumeComponent, roleType, serviceType,
clusterId, version, agentName, serviceDisplayName, clusterDisplayName
Flume Source serviceName, hostId, rackId, roleName, flumeComponent, roleType, serviceType,
clusterId, version, agentName, serviceDisplayName, clusterDisplayName
HDFS Cache Pool serviceName, poolName, nameserviceName, serviceType, clusterId, version, groupName,
ownerName, serviceDisplayName, clusterDisplayName
HNamespace serviceName, namespaceName, serviceType, clusterId, version, serviceDisplayName,
clusterDisplayName
Host hostId, rackId, clusterId, version, hostname, clusterDisplayName
HRegion htableName, hregionName, hregionStartTimeMs, namespaceName, serviceName,
tableName, serviceType, clusterId, version, roleType, hostname, roleName, hostId,
rackId , serviceDisplayName, clusterDisplayName
HTable namespaceName, serviceName, tableName, serviceType, clusterId, version,
serviceDisplayName, clusterDisplayName
Network Interface hostId, networkInterface, rackId, clusterId, version, hostname, clusterDisplayName
Rack rackId
Service serviceName, serviceType, clusterId, serviceDisplayName
Solr Collection serviceName, serviceType, clusterId, version, serviceDisplayName, clusterDisplayName
Solr Replica serviceName, solrShardName, solrReplicaName, solrCollectionName, serviceType,
clusterId, version, roleType, hostId, hostname, rackId, roleName, serviceDisplayName,
clusterDisplayName
Solr Shard serviceName, solrCollectionName, solrShardName, serviceType, clusterId, version,
serviceDisplayName, clusterDisplayName
Time Series Table tableName, roleName, roleType, applicationName, rollup, path
User userName
YARN Pool serviceName, queueName, schedulerType
FAQ
How do I compare information across hosts?
1. Click Hosts in the top navigation bar and click a host link.
2. In the Charts pane, choose a chart, for example Host CPU Usage and select and then Open in Chart Builder.
3. In the text box, remove the where entityName=$HOSTID clause and click Build Chart.
4. In the Facets list, click hostname to compare the values across hosts.
5. Configure the time scale, minimums and maximums, and dimension. For example:
How do I compare all disk IO for all the DataNodes that belong to a specific HDFS service?
Use a query of the form:
replacing hdfs1 with your HDFS service name. Then facet by metricDisplayName and compare all DataNode
byte_reads and byte_writes metrics at once. See Grouping (Faceting) Time Series on page 369 for more details
about faceting.
When would I use a derivative function?
Some metrics represent a counter, for example, bytes_read. For such metrics it is sometimes useful to see the
rate of change instead of the absolute counter value. Use dt or dt0 derivative functions.
When should I use the dt0 function?
Some metrics, like bytes_read represent a counter that always grows. For such metrics a negative rate means
that the counter has been reset (for example, process restarted, host restarted, and so on). Use dt0 for these
metrics.
How do I display a threshold on a chart?
Suppose that you want to retrieve the latencies for all disks on your hosts, compare them, and show a threshold
on the chart to easily detect outliers. Use the following query to retrieve the metrics and the threshold:
Then choose All Combined (1) in the Facets list. The scalar threshold 250 will also be rendered on the chart:
See Grouping (Faceting) Time Series on page 369 for more details about faceting.
I get the warning "The query hit the maximum results limit". How do I work around the limit?
There is a limit on the number of results that can be returned by a query. When a query results in more time-series
streams than the limit a warning for "partial results" is issued. To circumvent the problem, reduce the number of
metrics you are trying to retrieve or see Configuring Time-Series Query Results on page 367.
You can use the rlike operator to limit the query to a subset of entities. For example, instead of
The latter query retrieves the disk metrics for ten hosts.
How do I discover which metrics are available for which entities?
• Type Select in the text box and then press Space or continue typing. Metrics matching the letters you type
display in a drop-down list.
• Select Charts > Chart Builder, click the question mark icon
to the right of the Build Chart button and click the List of Metrics link
• Retrieve all metrics for the type of entity:
Metric Aggregation
In addition to collecting and storing raw metric values, the Cloudera Manager Service Monitor and Host Monitor
produce a number of aggregate metrics from the raw metric data. Where a raw data point is a timestamp value pair,
an aggregate metric point is a timestamp paired with a bundle of statistics including the minimum, maximum, average,
and standard deviation of the data points considered by the aggregate.
Individual metric streams are aggregated across time to produce statistical summaries at different data granularities.
For example, an individual metric stream of the number of open file descriptors on a host will be aggregated over time
to the ten-minute, hourly, six-hourly, daily and weekly data granularities. A point in the hourly aggregate stream will
include the maximum number of open file descriptors seen during that hour, the minimum, the average and so on.
When servicing a time-series request, either for the Cloudera Manager UI or API, the Service Monitor and Host Monitor
automatically choose the appropriate data granularity based on the time-range requested.
The ten minutely cross-time aggregate point covering the ten-minute window from 9:00 - 9:10 would have the following
statistics and metadata:
The Service Monitor and Host Monitor also produce cross-entity aggregates for a number of entities in the system.
Cross-entity aggregates are produced by considering the metric value of a particular metric across a number of entities
of the same type at a particular time. For each stream considered, two metrics are produced. The first tracks statistics
such as the minimum, maximum, average and standard deviation across all considered entities as well as the identities
of the entities that had the minimum and maximum values. The second tracks the sum of the metric across all considered
entities.
An example of the first type of cross-entity aggregate is the fd_open_across_datanodes metric. For an HDFS service
this metric contains aggregate statistics on the fd_open metric value for all the DataNodes in the service. For a rack
this metric contains statistics for all the DataNodes within that rack, and so on. An example of the second type of
cross-entity aggregate is the total_fd_open_across_datanodes metric. For an HDFS service this metric contains
the total number of file descriptors open by all the DataNodes in the service. For a rack this metric contains the total
number of file descriptors open by all the DataNodes within the rack, and so on. Note that unlike the first type of
cross-entity aggregate, this total type of cross-entity aggregate is a simple timestamp, value pair and not a bundle of
statistics.
The cross-entity aggregate fd_open_across_datanodes point for that HDFS service at that time would have the
following statistics and metadata:
Just like every other metric, cross-entity aggregates are aggregated across time. For example, a point in the hourly
aggregate of fd_open_across_datanodes for an HDFS service will include the maximum fd_open value of any
DataNode in that service over that hour, the average value over the hour, and so on. A point in the hourly aggregate
of total_fd_open_across_datanodes for an HDFS service will contain statistics on the value of the
total_fd_open_across_datanodes for that service over the hour.
{
"timestamp" : "2014-02-24T00:00:00.000Z",
"value" : 0.014541698027508003,
"type" : "SAMPLE",
"aggregateStatistics" : {
"sampleTime" : "2014-02-23T23:59:35.000Z",
"sampleValue" : 0.0,
"count" : 360,
"min" : 0.0,
"minTime" : "2014-02-23T18:00:35.000Z",
"max" : 2.9516129032258065,
"maxTime" : "2014-02-23T19:37:36.000Z",
"mean" : 0.014541698027508003,
"stdDev" : 0.17041289765265377
}
}
{
"timestamp" : "2014-03-26T00:50:15.725Z",
"value" : 3288.0,
"type" : "SAMPLE",
"aggregateStatistics" : {
"sampleTime" : "2014-03-26T00:49:19.000Z",
"sampleValue" : 7232.0,
"count" : 4,
"min" : 1600.0,
"minTime" : "2014-03-26T00:49:42.000Z",
"max" : 7232.0,
"maxTime" : "2014-03-26T00:49:19.000Z",
"mean" : 3288.0,
"stdDev" : 2656.7549127961856,
"crossEntityMetadata" : {
"maxEntityDisplayName" : "cleroy-9-1.ent.cloudera.com",
"minEntityDisplayName" : "cleroy-9-4.ent.cloudera.com",
"numEntities" : 4.0
}
}
}
{
"timestamp" : "2014-03-11T00:00:00.000Z",
"value" : 3220.818863879957,
"type" : "SAMPLE",
"aggregateStatistics" : {
"sampleTime" : "2014-03-10T22:28:48.000Z",
"sampleValue" : 7200.0,
"count" : 933,
"min" : 1536.0,
"minTime" : "2014-03-10T21:02:17.000Z",
"max" : 7200.0,
"maxTime" : "2014-03-10T22:28:48.000Z",
"mean" : 3220.818863879957,
"stdDev" : 2188.6143063503378,
"crossEntityMetadata" : {
"maxEntityDisplayName" : "cleroy-9-1.ent.cloudera.com",
"minEntityDisplayName" : "cleroy-9-4.ent.cloudera.com",
"numEntities" : 3.9787037037037036
}
}
}
These differ from non-aggregate data points by having the aggregateStatistics structure. Note that the value field in
the point structure will always be the same as the aggregteStatistics mean field. The Cloudera Manager UI presents
aggregate statistics in a number of ways. First, aggregate statistics are made available in the hover detail and chart
popover when dealing with aggregate data. Second, it is possible to turn on and turn off the display of minimum and
maximum time-series streams in line charts of aggregate data. These streams are displayed using dotted lines and give
a visual indication of the underlying metric values data range over the time considered, entities considered or both.
These lines are displayed by default for single stream line charts of aggregate data. For all line charts this behavior can
be turned on and turned off using the chart popover.
Logs
The Logs page presents log information for Hadoop services, filtered by service, role, host, or search phrase as well log
level (severity).
To configure logs, see Configuring Log Events on page 285.
Viewing Logs
1. Select Diagnostics > Logs on the top navigation bar.
2. Click Search.
The logs for all roles display. If any of the hosts cannot be searched, an error message notifies you of the error and the
host(s) on which it occurred.
Logs List
Log results are displayed in a list with the following columns:
• Host - The host where this log entry appeared. Clicking this link will take you to the Host Status page (see Host
Details on page 305).
• Log Level - The log level (severity) associated with this log entry.
• Time - The date and time this log entry was created.
• Source - The class that generated the message.
• Message - The message portion of the log entry. Clicking View Log File displays the Log Details on page 390 page,
which presents a display of the full log, showing the selected message (highlighted) and the 100 messages before
and after it in the log.
If there are more results than can be shown on one page (per the Results per Page setting you selected), Next and
Prev buttons let you view additional results.
Filtering Logs
You filter logs by selecting a time range and specifying filter parameters.
You can use the Time Range Selector or a duration link ( ) to set the time range.
(See Time Line on page 271 for details). However, logs are, by definition, historical, and are meaningful only in that
context. So the Time Marker, used to pinpoint status at a specific point in time, is not available on this page. The Now
button ( ) is available.
1. Specify any of the log filter parameters:
• Search Phrase - A string to match against the log message content. The search is case-insensitive, and the
string can be a regular expression, such that wildcards and other regular expression primitives are supported.
• Select Sources - A list of all the service instances and roles currently instantiated in your cluster. By default,
all services and roles are selected to be included in your log search; the All Sources checkbox lets you select
or clear all services and roles in one operation. You can expand each service and limit the search to specific
roles by selecting or clearing individual roles.
• Hosts - The hosts to be included in the search. As soon as you start typing a hostname, Cloudera Manager
provides a list of hosts that match the partial name. You can add multiple names, separated by commas. The
default is to search all hosts.
• Minimum Log Level - The minimum severity level for messages to be included in the search results. Results
include all log entries at the selected level or higher. This defaults to WARN (that is, a search will return log
entries with severity of WARN, ERROR, or FATAL only.
• Additional Settings
– Search Timeout - A time (in seconds) after which the search will time out. The default is 20 seconds.
– Results per Page - The number of results (log entries) to be displayed per page.
2. Click Search. The Logs list displays the log entries that match the specified filter.
Log Details
The Log Details page presents a portion of the full log, showing the selected message (highlighted), and messages
before and after it in the log. The page shows you:
• The host
• The role
• The full path and name of the log file you are viewing.
• Messages before and after the one you selected.
The log displays the following information for each message:
• Time - the time the entry was logged
• Log Level - the severity of the entry
• Source - the source class that logged the entry
• Log Message
You can switch to display only messages or all columns using the buttons.
Note: You can also view the Cloudera Manager Server log at
/var/log/cloudera-scm-server/cloudera-scm-server.log on the Server host.
Note: You can also manage these configurations using role groups, which you can use to configure
similar hosts with the same configuration values. See Managing Roles on page 265.
Role Type Logging Threshold Logging level to limit the number of Depends on the role.
entries saved in the log file.
(not available for all roles)
Reports
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
The Reports page lets you create reports about the usage of HDFS in your cluster—data size and file count by user,
group, or directory. It also lets you report on the MapReduce activity in your cluster, by user.
To display the Reports page, select Clusters > Cluster name > Reports.
For users with the Administrator role, the Search Files and Manage Directories button on the Reports page opens a
file browser for searching files, managing directories, and setting quotas.
If you are managing multiple clusters, or have multiple nameservices configured (if high availability or federation is
configured) there will be separate reports for each cluster and nameservice.
Click the Reports link next to the Directory Usage title to go back to the Reports menu. You can also click links to go
to the cluster and HDFS service home pages.
Click any column header to sort the display.
Click a directory name to view the files and subdirectories in the directory.
Select one or more rows by checking the boxes on the left and then choose an action to perform on the selection from
the Actions for Selected drop-down menu. You can select the following actions:
• Manage Quota – A dialog box opens in which you can set a quota for the number of files or disk space. These
values are displayed in columns in the file listing.
• Include selected directories in disk usage reports – The selected directories appear in the Disk Usage Reports.
• Exclude selected directories from disk usage reports – The selected directories do not appear in the Disk Usage
Reports.
Filters
You can use filters to limit the display and to search for files. To apply filters to the directory usage report, click the
Filters drop-down menu near the top of the page and select one of the following preconfigured filters:
• Large Files
• Large Directories
• By Specific Owner
• By Specific Group
• Old Files
• Old Directories
• Files with Low Replication
• Overpopulated Directories
• Directories with Quotas
• Directories Watched
To modify any of these filters, click the Customize link and select new criteria. Click Clear to revert to the preconfigured
criteria for the filter.
Click the Search button to display the report with the filters applied.
You can also select Custom from the Filters drop-down menu to create a report in which you define the criteria. To
create a custom report:
1. Select any of the following criteria from the drop-down menu on the left:
• Filename
• Owner
• Group
• Path
• Last Modified
• Size
• Diskspace Quota
• Namespace Quota
• Last Access
• File and Directory Count
• Replication
• Parent
• Raw Size
2. Select an operator from the drop-down menu.
3. Enter a value and units of measure for the comparison.
4. Select the units of measure for the comparison from the drop-down menu. (Some criteria do not require units of
measure.)
5. Click the icon to add additional criteria.
6. Click the Search button to display the directory usage report with the custom filter applied.
The report changes to display the result of applying the filter. A new column, Parent is added that contains the
full path to each file or subdirectory.
Bytes The logical number of bytes in the files, aggregated by user, group, or directory. This is
based on the actual files sizes, not taking replication into account.
Raw Bytes The physical number of bytes (total disk space in HDFS) used by the files aggregated by
user, group, or directory. This does include replication, and so is actually Bytes times
the number of replicas.
File and Directory Count The number of files aggregated by user, group, or directory.
Bytes and Raw Bytes are shown in IEC binary prefix notation (1 GiB = 1 * 230).
The directories shown in the Current Disk Usage by Directory report are the HDFS directories you have set as watched
directories. You can add or remove directories to or from the watch list from this report; click the Search Files and
Manage Directories button at the top right of the set of reports for the cluster or nameservice (see Designating
Directories to Include in Disk Usage Reports on page 397).
The report data is also shown in chart format:
• Move the cursor over the graph to highlight a specific period on the graph and see the actual value (data size) for
that period.
• You can also move the cursor over the user, group, or directory name (in the graph legend) to highlight the portion
of the graph for that name.
• You can right-click within the chart area to save the whole chart display as a single image (a .PNG file) or as a PDF
file. You can also print to the printer configured for your browser.
at the top and its immediate subdirectories below. Click any directory to drill down into the contents of that directory
or to select that directory for available actions.
3. Click the Generate Report button to generate a custom report containing the search results.
If you search within a directory, only files within that directory will be found. For example, if you browse /user and
do a search, you might find /user/foo/file, but you will not find /bar/baz.
Enabling Snapshots
To enable snapshots for an HDFS directory and its contents, see Managing HDFS Snapshots on page 608.
Setting Quotas
To set quotas for an HDFS directory and its contents, see Setting HDFS Quotas on page 157.
You are unable to start service on the The server has been Go to
Cloudera Manager server, that is, disconnected from the /etc/cloudera-scm-server/db.properties
service cloudera-scm-server database or the database and make sure the database you are trying to
start does not work and there are has stopped responding or connect to is listed there and has been started.
errors in the log file located at has shut down.
/var/log/cloudera-scm-server/cloudera-scm-server.log
Logs include APPARENT DEADLOCK These deadlock messages There are a variety of ways to react to these
entries for c3p0. are cause by the c3p0 log entries.
process not making progress
• You may ignore these messages if system
at the expected rate. This
performance is not otherwise affected.
can indicate either that c3p0
Because these entries often occur during
is deadlocked or that its
slow progress, they may be ignored in
progress is slow enough to
some cases.
trigger these messages. In
many cases, progress is • You may modify the timer triggers. If c3p0
occurring and these is making slow progress, increasing the
messages should not be period of time during which progress is
seen as catastrophic. evaluated stop the log entries from
occurring. The default time between
Timer triggers is 10 seconds and is
configurable indirectly by configuring
maxAdministrativeTaskTime. For
more information, see
maxAdministrativeTaskTime.
• You may increase the number of threads
in the c3p0 pool, thereby increasing the
resources available to make progress on
tasks. For more information, see
numHelperThreads.
Starting Services
After you click Start to start a service, A port specified in the Enter an available port number in the port
the Finished status displays but there Configuration tab of the property (such as JobTracker port) in the
are error messages. The subcommands service is already being used Configuration tab of the service.
to start service components (such as in your cluster. For example,
JobTracker and one or more the JobTracker port is in use
TaskTrackers) do not start. by another process.
There are incorrect Enter correct directories in the Configuration
directories specified in the tab of the service.
Configuration tab of the
service (such as the log
directory).
Job is Failing No space left on device. One approach is to use a system monitoring
tool such as Nagios to alert on the disk space
or quickly check disk space across all systems.
If you do not have Nagios or equivalent you
can do the following to determine the source
of the space issue:
In the JobTracker Web UI, drill down from the
job, to the map or reduce, to the task attempt
details to see which TaskTracker the task
executed and failed on due to disk space. For
example:
http://JTHost:50030/taskdetails.jsp?tipid=TaskID.
You can see on which host the task is failing
in the Machine column.
In the NameNode Web UI, inspect the % used
column on the NameNode Live Nodes page:
http://namenode:50070/dfsnodelist.jsp?whatNodes=LIVE.
Monitoring Reference
Performance Management
This section describes mechanisms and best practices for improving performance.
Related Information
• Tuning Impala for Performance
This section provides solutions to some performance problems, and describes configuration best practices.
Important: Work with your network administrators and hardware vendors to ensure that you have
the proper NIC firmware, drivers, and configurations in place and that your network performs properly.
Cloudera recognizes that network setup and upgrade are challenging problems, and will do its best
to share useful experiences.
tuned-adm off
tuned-adm list
To see whether transparent hugepages are enabled, run the following commands and check the output:
cat defrag_file_pathname
cat enabled_file_pathname
• RHEL/CentOS 6.x
chmod +x /etc/rc.d/rc.local
2. If your cluster hosts are running RHEL/CentOS 7.x, modify the GRUB configuration to disable THP:
• Add the following line to the GRUB_CMDLINE_LINUX options in the /etc/default/grub file:
transparent_hugepage=never
grub2-mkconfig -o /boot/grub2/grub.cfg
inactive processes are swapped out from physical memory. The lower the value, the less they are swapped, forcing
filesystem buffers to be emptied.
On most systems, vm.swappiness is set to 60 by default. This is not suitable for Hadoop clusters because processes
are sometimes swapped even when enough memory is available. This can cause lengthy garbage collection pauses for
important system daemons, affecting stability and performance.
Cloudera recommends that you set vm.swappiness to a value between 1 and 10, preferably 1, for minimum swapping
on systems where the RHEL kernel is 2.6.32-642.el6 or higher.
To view your current setting for vm.swappiness, run:
cat /proc/sys/vm/swappiness
<property>
<name>mapreduce.tasktracker.outofband.heartbeat</name>
<value>true</value>
</property>
Reduce the interval for JobClient status reports on single node systems
The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which
JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may
want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large
production cluster may lead to unwanted client-server traffic.
<property>
<name>jobclient.progress.monitor.poll.interval</name>
<value>10</value>
</property>
<property>
<name>mapreduce.jobtracker.heartbeat.interval.min</name>
<value>10</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0</value>
</property>
To add JARs to the classpath, use -libjars jar1,jar2. This copies the local JAR files to HDFS and uses the distributed
cache mechanism to ensure they are available on the task nodes and added to the task classpath.
The advantage of this, over JobConf.setJar, is that if the JAR is on a task node, it does not need to be copied again
if a second task from the same job runs on that node, though it will still need to be copied from the launch machine
to HDFS.
Note: -libjars works only if your MapReduce driver uses ToolRunner. If it does not, you would
need to use the DistributedCache APIs (Cloudera does not recommend this).
For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.
Changing the Logging Level on a Job (MRv1)
You can change the logging level for an individual job. You do this by setting the following properties in the job
configuration (JobConf):
• mapreduce.map.log.level
• mapreduce.reduce.log.level
Valid values are NONE, INFO, WARN, DEBUG, TRACE, and ALL.
Example:
conf.set("mapreduce.map.log.level", "DEBUG");
conf.set("mapreduce.reduce.log.level", "TRACE");
...
The Reserved block count is the number of ext3/ext4 filesystem blocks that are reserved. The block size is the size
in bytes. In this example, 150 GB (139.72 Gigabytes) are reserved on this filesystem.
Cloudera recommends reducing the root user block reservation from 5% to 1% for the DataNode volumes. To set
reserved space to 1% with the tune2fs command:
# tune2fs -m 1 /dev/sde1
– MRv1
"-Dmapred.output.compress=true"
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
-Dsolr.hdfs.blockcache.slab.count=100
Garbage collection options, such as -XX:+PrintGCTimeStamps, can also be set here. Use spaces to separate
multiple parameters.
JAVA_OPTS="-Xmx10g -XX:MaxDirectMemorySize=20g \
-XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=100" \
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails
<luceneMatchVersion>4.4</luceneMatchVersion>
• Consider using the text type that applies to your language, instead of using String. For example, you might use
text_en. Text types support returning results for subsets of an entry. For example, querying on "john" would
find "John Smith", whereas with the string type, only exact matches are returned.
• For IDs, use the string type.
General Tuning
The following tuning categories can be completed at any time. It is less important to implement these changes before
beginning to use your system.
General Tips
• Enabling multi-threaded faceting can provide better performance for field faceting. When multi-threaded faceting
is enabled, field faceting tasks are completed in a parallel with a thread working on every field faceting task
simultaneously. Performance improvements do not occur in all cases, but improvements are likely when all of the
following are true:
– The system uses highly concurrent hardware.
– Faceting operations apply to large data sets over multiple fields.
– There is not an unusually high number of queries occurring simultaneously on the system. Systems that are
lightly loaded or that are mainly engaged with ingestion and indexing may be helped by multi-threaded
faceting; for example, a system ingesting articles and being queried by a researcher. Systems heavily loaded
by user queries are less likely to be helped by multi-threaded faceting; for example, an e-commerce site with
heavy user-traffic.
Note: Multi-threaded faceting only applies to field faceting and not to query faceting.
• Field faceting identifies the number of unique entries for a field. For example, multi-threaded
faceting could be used to simultaneously facet for the number of unique entries for the
fields, "color" and "size". In such a case, there would be two threads, and each thread would
work on faceting one of the two fields.
• Query faceting identifies the number of unique entries that match a query for a field. For
example, query faceting could be used to find the number of unique entries in the "size"
field are between 1 and 5. Multi-threaded faceting does not apply to these operations.
To enable multi-threaded faceting, add facet-threads to queries. For example, to use up to 1000 threads, you
might use a query as follows:
http://localhost:8983/solr/collection1/select?q=*:*&facet=true&fl=id&facet.field=f0_ws&facet.threads=1000
For Cloudera Manager environments, you can set these flags at Solr service > Configuration > Category >
Java Configuration Options for Solr Server.
For unmanaged environments, you can configure Java options by adding or modifying the JAVA_OPTS
environment variable in /etc/default/solr:
JAVA_OPTS="-Xmx10g -XX:MaxDirectMemorySize=20g \
-XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=100" \
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails
Warning: Do not enable the Solr HDFS write cache, because it can lead to index corruption.
Cloudera Search enables Solr to store indexes in an HDFS filesystem. To maintain performance, an HDFS block cache
has been implemented using Least Recently Used (LRU) semantics. This enables Solr to cache HDFS index files on read
and write, storing the portions of the file in JVM direct memory (off heap) by default, or optionally in the JVM heap.
Batch jobs typically do not use the cache, while Solr servers (when serving queries or indexing documents) should.
When running indexing using MapReduce, the MR jobs themselves do not use the block cache. Block write caching is
turned off by default and should be left disabled.
Tuning of this cache is complex and best practices are continually being refined. In general, allocate a cache that is
about 10-20% of the amount of memory available on the system. For example, when running HDFS and Solr on a host
with 96 GB of memory, allocate 10-20 GB of memory using solr.hdfs.blockcache.slab.count. As index sizes
grow you may need to tune this parameter to maintain optimal performance.
Configuration
The following parameters control caching. They can be configured at the Solr process level by setting the respective
Java system property or by editing solrconfig.xml directly. For more information on setting Java system properties,
see Setting Java System Properties for Solr on page 406.
If the parameters are set at the collection level (using solrconfig.xml), the first collection loaded by the Solr server
takes precedence, and block cache settings in all other collections are ignored. Because you cannot control the order
in which collections are loaded, you must make sure to set identical block cache settings in every collection
solrconfig.xml. Block cache parameters set at the collection level in solrconfig.xml also take precedence over
parameters at the process level.
Warning:
Do
not
enable
the
Solr
HDFS
write
cache,
because
it
can
lead
to
index
corupto
in.
Note:
Increasing the direct memory cache size may make it necessary to increase the maximum direct
memory size allowed by the JVM. Each Solr slab allocates memory, which is 128 MB by default, as
well as allocating some additional direct memory overhead. Therefore, ensure that the
MaxDirectMemorySize is set comfortably above the value expected for slabs alone. The amount of
additional memory required varies according to multiple factors, but for most cases, setting
MaxDirectMemorySize to at least 20-30% more than the total memory configured for slabs is
sufficient. Setting MaxDirectMemorySize to the number of slabs multiplied by the slab size does
not provide enough memory.
To set MaxDirectMemorySize using Cloudera Manager:
1. Go to the Solr service.
2. Click the Configuration tab.
3. In the Search box, type Java Direct Memory Size of Solr Server in Bytes.
4. Set the new direct memory value.
5. Restart Solr servers after editing the parameter.
To set MaxDirectMemorySize in unmanaged environments:
1. Add -XX:MaxDirectMemorySize=20g to the JAVA_OPTS environment variable in
/etc/default/solr.
2. Restart Solr servers:
Solr HDFS optimizes caching when performing NRT indexing using Lucene's NRTCachingDirectory.
Lucene caches a newly created segment if both of the following conditions are true:
• The segment is the result of a flush or a merge and the estimated size of the merged segment is <=
solr.hdfs.nrtcachingdirectory.maxmergesizemb.
• The total cached bytes is <= solr.hdfs.nrtcachingdirectory.maxcachedmb.
The following parameters control NRT caching behavior:
<directoryFactory name="DirectoryFactory">
<bool name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}</bool>
<int name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1}</int>
<bool
name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:true}</bool>
<int
name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384}</int>
<bool
name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true}</bool>
<bool
name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true}</bool>
<int
name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}</int>
<int
name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}</int>
</directoryFactory>
The following example illustrates passing Java options by editing the /etc/default/solr or
/opt/cloudera/parcels/CDH-*/etc/default/solr configuration file:
For better performance, Cloudera recommends setting the Linux swap space on all Solr server hosts as shown below:
• Minimize swappiness:
sudo swapoff -a
Garbage Collection
Choose different garbage collection options for best performance in different environments. Some garbage collection
options typically chosen include:
• Concurrent low pause collector: Use this collector in most cases. This collector attempts to minimize "Stop the
World" events. Avoiding these events can reduce connection timeouts, such as with ZooKeeper, and may improve
user experience. This collector is enabled using the Java system property -XX:+UseConcMarkSweepGC.
• Throughput collector: Consider this collector if raw throughput is more important than user experience. This
collector typically uses more "Stop the World" events so this may negatively affect user experience and connection
timeouts such as ZooKeeper heartbeats. This collector is enabled using the Java system property
-XX:+UseParallelGC. If UseParallelGC "Stop the World" events create problems, such as ZooKeeper timeouts,
consider using the UseParNewGC collector as an alternative collector with similar throughput benefits.
For information on setting Java system properties, see Setting Java System Properties for Solr on page 406.
You can also affect garbage collection behavior by increasing the Eden space to accommodate new objects. With
additional Eden space, garbage collection does not need to run as frequently on new objects.
Replication
You can adjust the degree to which different data is replicated.
Replication Settings
Note: Do not adjust HDFS replication settings for Solr in most cases.
To adjust the Solr replication factor for index files stored in HDFS:
• Cloudera Manager:
1. Go to Solr service > Configuration > Category > Advanced.
2. Click the plus sign next to Solr Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml to
add a new property with the following values:
Name: dfs.replication
Value: 2
3. Click Save Changes.
4. Restart the Solr service (Solr service > Actions > Restart).
• Unmanaged:
1. Configure the solr.hdfs.confdir system property to refer to the Solr HDFS configuration files. Typically
the value is /etc/solrhdfs/. For information on setting Java system properties, see Setting Java System
Properties for Solr on page 406.
2. Set the DFS replication value in the HDFS configuration file at the location you specified in the previous step.
For example, to set the replication value to 2, you would change the dfs.replication setting as follows:
<property>
<name>dfs.replication<name>
<value>2<value>
<property>
Replicas
If you have sufficient additional hardware, add more replicas for a linear boost of query throughput. Note that adding
replicas may slow write performance on the first replica, but otherwise this should have minimal negative consequences.
Transaction Log Replication
Beginning with CDH 5.4.1, Search supports configurable transaction log replication levels for replication logs stored in
HDFS. Cloudera recommends leaving the value unchanged at 3 or, barring that, setting it to at least 2.
Configure the transaction log replication factor for a collection by modifying the tlogDfsReplication setting in
solrconfig.xml. The tlogDfsReplication is a new setting in the updateLog settings area. An excerpt of the
solrconfig.xml file where the transaction log replication factor is set is as follows:
<updateHandler class="solr.DirectUpdateHandler2">
<!-- Enables a transaction log, used for real-time get, durability, and
and solr cloud replica recovery. The log can grow as big as
uncommitted changes to the index, so use of a hard autoCommit
is recommended (see below).
"dir" - the target directory for transaction logs, defaults to the
solr data directory. -->
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
<int name="tlogDfsReplication">${solr.ulog.tlogDfsReplication:3}</int>
<int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>
The default replication level is 3. For clusters with fewer than three DataNodes (such as proof-of-concept clusters),
reduce this number to the amount of DataNodes in the cluster. Changing the replication level only applies to new
transaction logs.
Initial testing shows no significant performance regression for common use cases.
Shards
In some cases, oversharding can help improve performance including intake speed. If your environment includes
massively parallel hardware and you want to use these available resources, consider oversharding. You might increase
the number of replicas per host from 1 to 2 or 3. Making such changes creates complex interactions, so you should
continue to monitor your system's performance to ensure that the benefits of oversharding do not outweigh the costs.
Commits
Changing commit values may improve performance in some situation. These changes result in tradeoffs and may not
be beneficial in all cases.
• For hard commit values, the default value of 60000 (60 seconds) is typically effective, though changing this value
to 120 seconds may improve performance in some cases. Note that setting this value to higher values, such as
600 seconds may result in undesirable performance tradeoffs.
• Consider increasing the auto-commit value from 15000 (15 seconds) to 120000 (120 seconds).
• Enable soft commits and set the value to the largest value that meets your requirements. The default value of
1000 (1 second) is too aggressive for some environments.
Other Resources
• General information on Solr caching is available on the Query Settings in SolrConfig page in the Solr Reference
Guide.
• Information on issues that influence performance is available on the SolrPerformanceFactors page on the Solr
Wiki.
• Resource Management describes how to use Cloudera Manager to manage resources, for example with Linux
cgroups.
• For information on improving querying performance, see How to make searching faster.
• For information on improving indexing performance, see How to make indexing faster.
Operations such as coalesce can result in a task processing multiple input partitions, but the transformation is still
considered narrow because the input records used to compute any single output record can still only reside in a limited
subset of the partitions.
Spark also supports transformations with wide dependencies, such as groupByKey and reduceByKey. In these
dependencies, the data required to compute the records in a single partition can reside in many partitions of the parent
dataset. To perform these transformations, all of the tuples with the same key must end up in the same partition,
processed by the same task. To satisfy this requirement, Spark performs a shuffle, which transfers data around the
cluster and results in a new stage with a new set of partitions.
For example, consider the following code:
sc.textFile("someFile.txt").map(mapFunc).flatMap(flatMapFunc).filter(filterFunc).count()
It runs a single action, count, which depends on a sequence of three transformations on a dataset derived from a text
file. This code runs in a single stage, because none of the outputs of these three transformations depend on data that
comes from different partitions than their inputs.
In contrast, this Scala code finds how many times each character appears in all the words that appear more than 1,000
times in a text file:
This example has three stages. The two reduceByKey transformations each trigger stage boundaries, because computing
their outputs requires repartitioning the data by keys.
A final example is this more complicated transformation graph, which includes a join transformation with multiple
dependencies:
The pink boxes show the resulting stage graph used to run it:
At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by
tasks in the child stage. Because they incur high disk and network I/O, stage boundaries can be expensive and should
be avoided when possible. The number of data partitions in a parent stage may be different than the number of
partitions in a child stage. Transformations that can trigger a stage boundary typically accept a numPartitions
argument, which specifies into how many partitions to split the data in the child stage. Just as the number of reducers
is an important parameter in MapReduce jobs, the number of partitions at stage boundaries can determine an
application's performance. Tuning the Number of Partitions on page 418 describes how to tune this number.
This results in unnecessary object creation because a new set must be allocated for each record.
Instead, use aggregateByKey, which performs the map-side aggregation more efficiently:
• flatMap-join-groupBy. When two datasets are already grouped by key and you want to join them and keep
them grouped, use cogroup. This avoids the overhead associated with unpacking and repacking the groups.
rdd1 = someRdd.reduceByKey(...)
rdd2 = someOtherRdd.reduceByKey(...)
rdd3 = rdd1.join(rdd2)
Because no partitioner is passed to reduceByKey, the default partitioner is used, resulting in rdd1 and rdd2 both
being hash-partitioned. These two reduceByKey transformations result in two shuffles. If the datasets have the same
number of partitions, a join requires no additional shuffling. Because the datasets are partitioned identically, the set
of keys in any single partition of rdd1 can only occur in a single partition of rdd2. Therefore, the contents of any single
output partition of rdd3 depends only on the contents of a single partition in rdd1 and single partition in rdd2, and
a third shuffle is not required.
For example, if someRdd has four partitions, someOtherRdd has two partitions, and both the reduceByKeys use
three partitions, the set of tasks that run would look like this:
If rdd1 and rdd2 use different partitioners or use the default (hash) partitioner with different numbers of partitions,
only one of the datasets (the one with the fewer number of partitions) needs to be reshuffled for the join:
To avoid shuffles when joining two datasets, you can use broadcast variables. When one of the datasets is small enough
to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every
executor. A map transformation can then reference the hash table to do lookups.
while not generating enough partitions to use all available cores. In this case, invoking repartition with a high number
of partitions (which triggers a shuffle) after loading the data allows the transformations that follow to use more of the
cluster's CPU.
Another example arises when using the reduce or aggregate action to aggregate data into the driver. When
aggregating over a high number of partitions, the computation can quickly become bottlenecked on a single thread in
the driver merging all the results together. To lighten the load on the driver, first use reduceByKey or aggregateByKey
to perform a round of distributed aggregation that divides the dataset into a smaller number of partitions. The values
in each partition are merged with each other in parallel, before being sent to the driver for a final round of aggregation.
See treeReduce and treeAggregate for examples of how to do that.
This method is especially useful when the aggregation is already grouped by a key. For example, consider an application
that counts the occurrences of each word in a corpus and pulls the results into the driver as a map. One approach,
which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge
the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey, is to perform
the count in a fully distributed way, and then simply collectAsMap the results to the driver.
Secondary Sort
The repartitionAndSortWithinPartitions transformation repartitions the dataset according to a partitioner and, within
each resulting partition, sorts records by their keys. This transformation pushes sorting down into the shuffle machinery,
where large amounts of data can be spilled efficiently and sorting can be combined with other operations.
For example, Apache Hive on Spark uses this transformation inside its join implementation. It also acts as a vital
building block in the secondary sort pattern, in which you group records by key and then, when iterating over the
values that correspond to a key, have them appear in a particular order. This scenario occurs in algorithms that need
to group events by user and then analyze the events for each user, based on the time they occurred.
Consider also how the resources requested by Spark fit into resources YARN has available. The relevant YARN properties
are:
• yarn.nodemanager.resource.memory-mb controls the maximum sum of memory used by the containers on
each host.
• yarn.nodemanager.resource.cpu-vcores controls the maximum sum of cores used by the containers on
each host.
Requesting five executor cores results in a request to YARN for five cores. The memory requested from YARN is more
complex for two reasons:
• The --executor-memory/spark.executor.memory property controls the executor heap size, but executors
can also use some memory off heap, for example, Java NIO direct buffers. The value of the
• The union transformation creates a dataset with the sum of its parents' number of partitions.
• The cartesian transformation creates a dataset with the product of its parents' number of partitions.
Datasets with no parents, such as those produced by textFile or hadoopFile, have their partitions determined by
the underlying MapReduce InputFormat used. Typically, there is a partition for each HDFS block being read. The
number of partitions for datasets produced by parallelize are specified in the method, or
spark.default.parallelism if not specified. To determine the number of partitions in an dataset, call
rdd.partitions().size().
If the number of tasks is smaller than number of slots available to run them, CPU usage is suboptimal. In addition, more
memory is used by any aggregation operations that occur in each task. In join, cogroup, or *ByKey operations,
objects are held in hashmaps or in-memory buffers to group or sort. join, cogroup, and groupByKey use these data
structures in the tasks for the stages that are on the fetching side of the shuffles they trigger. reduceByKey and
aggregateByKey use data structures in the tasks for the stages on both sides of the shuffles they trigger. If the records
in these aggregation operations exceed memory, the following issues can occur:
• Increased garbage collection, which can lead to pauses in computation.
• Spilling data to disk, causing disk I/O and sorting, which leads to job stalls.
To increase the number of partitions if the stage is reading from Hadoop:
• Use the repartition transformation, which triggers a shuffle.
• Configure your InputFormat to create more splits.
• Write the input data to HDFS with a smaller block size.
If the stage is receiving input from another stage, the transformation that triggered the stage boundary accepts a
numPartitions argument:
Determining the optimal value for X requires experimentation. Find the number of partitions in the parent dataset,
and then multiply that by 1.5 until performance stops improving.
You can also calculate X using a formula, but some quantities in the formula are difficult to calculate. The main goal is
to run enough tasks so that the data destined for each task fits in the memory available to that task. The memory
available to each task is:
Then, round up slightly, because too many partitions is usually better than too few.
When in doubt, err on the side of a larger number of tasks (and thus partitions). This contrasts with recommendations
for MapReduce, which unlike Spark, has a high startup overhead for tasks.
The spark.serializer property controls the serializer used to convert between these two representations. Cloudera
recommends using the Kryo serializer, org.apache.spark.serializer.KryoSerializer.
The footprint of your records in these two representations has a significant impact on Spark performance. Review the
data types that are passed and look for places to reduce their size. Large deserialized objects result in Spark spilling
data to disk more often and reduces the number of deserialized records Spark can cache (for example, at the MEMORY
storage level). The Apache Spark tuning guide describes how to reduce the size of such objects. Large serialized objects
result in greater disk and network I/O, as well as reduce the number of serialized records Spark can cache (for example,
at the MEMORY_SER storage level.) Make sure to register any custom classes you use with the
SparkConf#registerKryoClasses API.
Tuning YARN
This topic applies to YARN clusters only, and describes how to tune and optimize YARN for your cluster.
Note: Download the Cloudera YARN tuning spreadsheet to help calculate YARN configurations. For
a short video overview, see Tuning YARN Applications.
Overview
This overview provides an abstract description of a YARN cluster and the goals of YARN tuning.
YARN tuning has three phases. The phases correspond to the tabs in the YARN tuning spreadsheet.
1. Cluster configuration, where you configure your hosts.
2. YARN configuration, where you quantify memory and vcores.
3. MapReduce configuration, where you allocate minimum and maximum resources for specific map and reduce
tasks.
YARN and MapReduce have many configurable properties. For a complete list, see Cloudera Manager Configuration
Properties. The YARN tuning spreadsheet lists the essential subset of these properties that are most likely to improve
performance for common MapReduce applications.
Cluster Configuration
In the Cluster Configuration tab, you define the worker host configuration and cluster size for your YARN implementation.
As with any system, the more memory and CPU resources available, the faster the cluster can process large amounts
of data. A machine with 4 CPUs with HyperThreading, each with 6 cores, provides 48 vcores per host.
3 TB hard drives in a 2-unit server installation with 12 available slots in JBOD (Just a Bunch Of Disks) configuration is a
reasonable balance of performance and pricing at the time the spreadsheet was created. The cost of storage decreases
over time, so you might consider 4 TB disks. Larger disks are expensive and not required for all use cases.
Two 1-Gigabit Ethernet ports provide sufficient throughput at the time the spreadsheet was published, but 10-Gigabit
Ethernet ports are an option where price is of less concern than speed.
Start with at least 8 GB for your operating system, and 1 GB for Cloudera Manager. If services outside of CDH require
additional resources, add those numbers under Other Services.
The HDFS DataNode uses a minimum of 1 core and about 1 GB of memory. The same requirements apply to the YARN
NodeManager.
The spreadsheet lists several optional services:
• Impala daemon requires at least 16 GB for the daemon.
• HBase Region Servers requires 12-16 GB of memory.
YARN Configuration
On the YARN Configuration tab, you verify your available resources and set minimum and maximum limits for each
container.
MapReduce Configuration
On the MapReduce Configuration tab, you can plan for increased task-specific memory capacity.
Continuous Scheduling
Enabling or disabling continuous scheduling changes how often YARN schedules, either continuously or based on the
node heartbeats. For larger clusters (more than 75 nodes) seeing heavy YARN workloads, disabling continuous scheduling
with the following settings is recommended in general:
• yarn.scheduler.fair.continuous-scheduling-enabled should be false
• yarn.scheduler.fair.assignmultiple should be true
On large clusters, continuous scheduling can cause the ResourceManager to appear unresponsive since continuous
scheduling iterates through all the nodes in the cluster.
For more information about continuous scheduling tuning, see the following knowledge base article: FairScheduler
Tuning with assignmultiple and Continuous Scheduling
Note: OpenJDK 11 is only available with Cloudera Manager and CDH 6.3 and higher.
Cloudera Manager alerts you when memory is overcommitted on cluster hosts. To view these alerts and adjust the
allocations:
1. Log in to the Cloudera Manager Admin Console
2. Go to Home > Configuration > Configuration Issues.
3. Look for entries labeled Memory Overcommit Validation Threshold and note the hostname of the affected host.
{{JAVA_GC_ARGS}} -XX:MaxPermSize=512M
7. To replace the default Java options, delete the {{JAVA_GC_ARGS}} placeholder and replace it with one or more
Java options, separated by spaces.
8. The service will now have a stale configuration and must be restarted. See Restarting a Service on page 254.
-XX:+CMSParallelRemarkEnabled
-XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=70
-XX:+CMSParallelRemarkEnabled
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:CMSInitiatingOccupancyFraction=70
-XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=70
-XX:+CMSParallelRemarkEnabled
-XX:+CMSParallelRemarkEnabled
-Dlibrary.leveldbjni.path={{CMF_CONF_DIR}}
Resource Management
Resource management helps ensure predictable behavior by defining the impact of different services on cluster
resources. Use resource management to:
• Guarantee completion in a reasonable time frame for critical workloads.
• Support reasonable cluster scheduling between groups of users based on fair allocation of resources per group.
• Prevent users from depriving other users access to the cluster.
Static Allocation
Statically allocating resources using cgroups is configurable through a single static service pool wizard. You allocate
services as a percentage of total resources, and the wizard configures the cgroups.
For example, the following figure illustrates static pools for HBase, HDFS, Impala, and YARN services that are respectively
assigned 20%, 30%, 20%, and 30% of cluster resources.
Dynamic Allocation
You can dynamically apportion resources that are statically allocated to YARN and Impala by using dynamic resource
pools.
Depending on the version of CDH you are using, dynamic resource pools in Cloudera Manager support the following
scenarios:
• YARN - YARN manages the virtual cores, memory, running applications, maximum resources for undeclared
children (for parent pools), and scheduling policy for each pool. In the preceding diagram, three dynamic resource
pools—Dev, Product, and Mktg with weights 3, 2, and 1 respectively—are defined for YARN. If an application starts
and is assigned to the Product pool, and other applications are using the Dev and Mktg pools, the Product resource
pool receives 30% x 2/6 (or 10%) of the total cluster resources. If no applications are using the Dev and Mktg pools,
the YARN Product pool is allocated 30% of the cluster resources.
• Impala - Impala manages memory for pools running queries and limits the number of running and queued queries
in each pool.
The scenarios in which YARN manages resources map to the YARN scheduler policy. The scenarios in which Impala
independently manages resources use Impala admission control.
Note:
• I/O allocation only works when short-circuit reads are enabled.
• I/O allocation does not handle write side I/O because cgroups in the Linux kernel do not currently
support buffered writes.
5. Step 4 of 4: Progress displays the status of the restart commands. Click Finished after the restart commands
complete.
6. After you enable static service pools, there are three additional tasks.
a. Delete everything under the local directory path on NodeManager hosts. The local directory path is
configurable, and can be verified in Cloudera Manager with YARN > Configuration > NodeManager Local
Directories.
b. Enable cgroups for resource management. You can enable cgroups in Cloudera Manager with Yarn >
Configuration > Use CGroups for Resource Management.
c. If you are using the optional Impala scratch directory, delete all files in the Impala scratch directory. The
directory path is configurable, and can be verified in Cloudera Manager with Impala > Configuration > Impala
Daemon Scratch Directories.
Distribution CPU Shares I/O Weight Memory Soft Limit Memory Hard Limit
Red Hat Enterprise Linux,
CentOS, and Oracle Enterprise
Linux 7
Red Hat Enterprise Linux,
CentOS, and Oracle Enterprise
Linux 6
Distribution CPU Shares I/O Weight Memory Soft Limit Memory Hard Limit
SUSE Linux Enterprise Server
12
SUSE Linux Enterprise Server
11
Distribution CPU Shares I/O Weight Memory Soft Limit Memory Hard Limit
Ubuntu 16.04 LTS
Ubuntu 16.04 LTS
Ubuntu 14.04 LTS
Ubuntu 12.04 LTS
Distribution CPU Shares I/O Weight Memory Soft Limit Memory Hard Limit
Debian 7.1
Debian 7.0
Debian 6.0
Distribution CPU Shares I/O Weight Memory Soft Limit Memory Hard Limit
Oracle linux 7
Oracle linux 6
The exact level of support can be found in the Cloudera Manager Agent log file, shortly after the Agent has started.
See Viewing the Cloudera Manager Server Log on page 391 to find the Agent log. In the log file, look for an entry like
this:
The has_cpu and similar entries correspond directly to support for the CPU, I/O, and memory parameters.
Further Reading
• https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
• https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt
• https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
• Managing System Resources on Red Hat Enterprise Linux 6
• Managing System Resources on Red Hat Enterprise Linux 7
Limitations
• Role group and role instance override cgroup-based resource management parameters must be saved one at a
time. Otherwise some of the changes that should be reflected dynamically will be ignored.
• The role group abstraction is an imperfect fit for resource management parameters, where the goal is often to
take a numeric value for a host resource and distribute it amongst running roles. The role group represents a
"horizontal" slice: the same role across a set of hosts. However, the cluster is often viewed in terms of "vertical"
slices, each being a combination of worker roles (such as TaskTracker, DataNode, RegionServer, Impala Daemon,
and so on). Nothing in Cloudera Manager guarantees that these disparate horizontal slices are "aligned" (meaning,
that the role assignment is identical across hosts). If they are unaligned, some of the role group values will be
incorrect on unaligned hosts. For example a host whose role groups have been configured with memory limits
but that's missing a role will probably have unassigned memory.
• I/O Weight - The greater the I/O weight, the higher priority will be given to I/O requests made by the role when
I/O is under contention (either by roles managed by Cloudera Manager or by other system processes).
This only affects read requests; write requests remain unprioritized. The Linux I/O scheduler controls when buffered
writes are flushed to disk, based on time and quantity thresholds. It continually flushes buffered writes from
multiple sources, not certain prioritized processes.
Updates to this parameter are dynamically reflected in the running role.
• Memory Soft Limit - When the limit is reached, the kernel will reclaim pages charged to the process if and only if
the host is facing memory pressure. If reclaiming fails, the kernel may kill the process. Both anonymous as well
as page cache pages contribute to the limit.
After updating this parameter, you must restart the role for changes to take effect.
• Memory Hard Limit - When a role's resident set size (RSS) exceeds the value of this parameter, the kernel will
swap out some of the role's memory. If it is unable to do so, it will kill the process. The kernel measures memory
consumption in a manner that does not necessarily match what the top or ps report for RSS, so expect that this
limit is a rough approximation.
After updating this parameter, you must restart the role for changes to take effect.
Action Procedure
CPU 1. Leave DataNode and TaskTracker role group CPU shares at 1024.
2. Set Impala Daemon role group's CPU shares to 256.
3. The TaskTracker role group should be configured with a Maximum Number of Simultaneous
Map Tasks of 2 and a Maximum Number of Simultaneous Reduce Tasks of 1. This yields an
upper bound of three MapReduce tasks at any given time; this is an important detail for
memory sizing.
Memory 1. Set Impala Daemon role group memory limit to 1024 MB.
2. Leave DataNode maximum Java heap size at 1 GB.
3. Leave TaskTracker maximum Java heap size at 1 GB.
4. Leave MapReduce Child Java Maximum Heap Size for Gateway at 1 GB.
5. Leave cgroups hard memory limits alone. We'll rely on "cooperative" memory limits
exclusively, as they yield a nicer user experience than the cgroups-based hard memory limits.
I/O 1. Leave DataNode and TaskTracker role group I/O weight at 500.
2. Impala Daemon role group I/O weight is set to 125.
When you're done with configuration, restart all services for these changes to take effect. The results are:
1. When MapReduce jobs are running, all Impala queries together will consume up to a fifth of the cluster's CPU
resources.
2. Individual Impala Daemons will not consume more than 1 GB of RAM. If this figure is exceeded, new queries will
be cancelled.
Pool Hierarchy
YARN resource pools can be nested, with subpools restricted by the settings of their parent pool. This allows you to
specify a set of pools whose resources are limited by the resources of a parent. Each subpool can have its own resource
restrictions; if those restrictions fall within the configuration of the parent pool, the limits for the subpool take effect.
If the limits for the subpool exceed those of the parent, the parent pool limits take precedence.
You create a parent pool either by configuring it as a parent or by creating a subpool under the pool. Once a pool is a
parent, you cannot submit jobs to that pool; they must be submitted to a subpool.
• YARN - Weight, Virtual Cores, Min and Max Memory, Max Running Apps, Max Resources for Undeclared Children,
and Scheduling Policy
• Impala Admission Control - Max Memory, Max Running Queries, Max Queued Queries, Queue Timeout, Minimum
Query Memory Limit, Maximum Query Memory Limit, Clamp MEM_LIMIT Query Option, and Default Query
Memory Limit
To view dynamic resource pool configuration:
1. Select Clusters > Cluster name > Dynamic Resource Pool Configuration. If the cluster has a YARN service, the
YARN > Resource Pools tab displays. If the cluster has an Impala service enabled as described in Enabling or
Disabling Impala Admission Control in Cloudera Manager on page 439, the Impala Admission Control > Resource
Pools tab displays.
2. Click the YARN or Impala Admission Control tab.
Important:
In and higher, admission control and dynamic resource pools are enabled by default. However, until
you configure the settings for the dynamic resource pools, the admission control feature is effectively
not enabled as, with the default the settings, each dynamic pool will allow all queries of any memory
requirement to execute in the pool.
If you specify Max Memory, you should specify the amount of memory to allocate to each query in this pool. You
can do this in two ways:
• By setting Maximum Query Memory Limit and Minimum Query Memory Limit. This is preferred in CDH 6.1
and higher and gives Impala flexibility to set aside more memory to queries that are expected to be
memory-hungry.
• By setting Default Query Memory Limit to the exact amount of memory that Impala should set aside for queries
in that pool.
Note that if you do not set any of the above options, or set Default Query Memory Limit to 0, Impala will rely
entirely on memory estimates to determine how much memory to set aside for each query. This is not recommended
because it can result in queries not running or being starved for memory if the estimates are inaccurate.
For example, consider the following scenario:
• The cluster is running impalad daemons on five hosts.
• A dynamic resource pool has Max Memory set to 100 GB.
• The Maximum Query Memory Limit for the pool is 10 GB and Minimum Query Memory Limit is 2 GB. Therefore,
any query running in this pool could use up to 50 GB of memory (Maximum Query Memory Limit * number
of Impala nodes).
• Impala will execute varying numbers of queries concurrently because queries may be given memory limits
anywhere between 2 GB and 10 GB, depending on the estimated memory requirements. For example, Impala
may execute up to 10 small queries with 2 GB memory limits or two large queries with 10 GB memory limits
because that is what will fit in the 100 GB cluster-wide limit when executing on five hosts.
• The executing queries may use less memory than the per-host memory limit or the Max Memory cluster-wide
limit if they do not need that much memory. In general this is not a problem so long as you are able to execute
enough queries concurrently to meet your needs.
Minimum Query Memory Limit and Maximum Query Memory Limit
These two options determine the minimum and maximum per-host memory limit that will be chosen by Impala
Admission control for queries in this resource pool. If set, Impala admission control will choose a memory limit
between the minimum and maximum value based on the per-host memory estimate for the query. The memory
limit chosen determines the amount of memory that Impala admission control will set aside for this query on each
host that the query is running on. The aggregate memory across all of the hosts that the query is running on is
counted against the pool’s Max Memory.
Minimum Query Memory Limit must be less than or equal to Maximum Query Memory Limit and Max Memory.
You can override Impala’s choice of memory limit by setting the MEM_LIMIT query option. If the Clamp MEM_LIMIT
Query Option is selected and the user sets MEM_LIMIT to a value that is outside of the range specified by these
two options, then the effective memory limit will be either the minimum or maximum, depending on whether
MEM_LIMIT is lower than or higher than the range.
Default Query Memory Limit
The default memory limit applied to queries executing in this pool when no explicit MEM_LIMIT query option is set.
The memory limit chosen determines the amount of memory that Impala Admission control will set aside for this
query on each host that the query is running on. The aggregate memory across all of the hosts that the query is
running on is counted against the pool’s Max Memory.
This option is deprecated from CDH 6.1 and higher and is replaced by Maximum Query Memory Limit and Minimum
Query Memory Limit. Do not set this field if either Maximum Query Memory Limit or Minimum Query Memory
Limit is set.
Max Running Queries
Maximum number of concurrently running queries in this pool. The default value is unlimited for CDH 5.7 or higher.
(optional)
The maximum number of queries that can run concurrently in this pool. The default value is unlimited. Any queries
for this pool that exceed Max Running Queries are added to the admission control queue until other queries finish.
You can use Max Running Queries in the early stages of resource management, when you do not have extensive
data about query memory usage, to determine if the cluster performs better overall if throttling is applied to Impala
queries.
For a workload with many small queries, you typically specify a high value for this setting, or leave the default setting
of “unlimited”. For a workload with expensive queries, where some number of concurrent queries saturate the
memory, I/O, CPU, or network capacity of the cluster, set the value low enough that the cluster resources are not
overcommitted for Impala.
Once you have enabled memory-based admission control using other pool settings, you can still use Max Running
Queries as a safeguard. If queries exceed either the total estimated memory or the maximum number of concurrent
queries, they are added to the queue.
Max Queued Queries
Maximum number of queries that can be queued in this pool. The default value is 200 for CDH 5.3 or higher and
50 for previous versions of Impala. (optional)
Queue Timeout
The amount of time, in milliseconds, that a query waits in the admission control queue for this pool before being
canceled. The default value is 60,000 milliseconds.
It the following cases, Queue Timeout is not significant, and you can specify a high value to avoid canceling queries
unexpectedly:
• In a low-concurrency workload where few or no queries are queued
• In an environment without a strict SLA, where it does not matter if queries occasionally take longer than usual
because they are held in admission control
You might also need to increase the value to use Impala with some business intelligence tools that have their own
timeout intervals for queries.
In a high-concurrency workload, especially for queries with a tight SLA, long wait times in admission control can
cause a serious problem. For example, if a query needs to run in 10 seconds, and you have tuned it so that it runs
in 8 seconds, it violates its SLA if it waits in the admission control queue longer than 2 seconds. In a case like this,
set a low timeout value and monitor how many queries are cancelled because of timeouts. This technique helps
you to discover capacity, tuning, and scaling problems early, and helps avoid wasting resources by running expensive
queries that have already missed their SLA.
If you identify some queries that can have a high timeout value, and others that benefit from a low timeout value,
you can create separate pools with different values for this setting.
Clamp MEM_LIMIT Query Option
If this field is not selected, the MEM_LIMIT query option will not be bounded by the Maximum Query Memory
Limit and the Minimum Query Memory Limit values specified for this resource pool. By default, this field is selected
in CDH 6.1 and higher. The field is disabled if both Minimum Query Memory Limit and Maximum Query Memory
Limit are not set.
See Monitoring Dynamic Resource Pools on page 303 for more information.
Configuring ACLs
To configure which users and groups can submit and kill applications in any resource pool:
1. Enable ACLs.
2. In Cloudera Manager select the YARN service.
3. Click the Configuration tab.
4. Search for Admin ACL property, and specify which users and groups can submit and kill applications.
Note: Group names in YARN Resource Manager ACLs are case sensitive. Consequently, if you
specify an uppercase group name in the ACL, it will not match the group name resolved from the
Active Directory because the Active Directory group name is resolved in lowercase.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Return to the Home page by clicking the Cloudera Manager logo.
7. Click the icon that is next to any stale services to invoke the cluster restart wizard.
8. Click Restart Stale Services.
9. Click Restart Now.
10. Click Finish.
Enabling ACLs
To specify whether ACLs are checked:
1. In Cloudera Manager select the YARN service.
2. Click the Configuration tab.
3. Search for the Enable ResourceManager ACL property, and select the YARN service.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
5. Return to the Home page by clicking the Cloudera Manager logo.
6. Click the icon that is next to any stale services to invoke the cluster restart wizard.
7. Click Restart Stale Services.
8. Click Restart Now.
9. Click Finish.
The Off Hour configuration set assigns the production and development pools an equal amount of resources:
If you use an existing configuration set, select one from the drop-down list.
6. Configure the rule to repeat or not:
• To repeat the rule, keep the Repeat field selected and specify the repeat frequency. If the frequency is weekly,
specify the repeat day or days.
• If the rule does not repeat, clear the Repeat field, click the left side of the on field to display a drop-down
calendar where you set the starting date and time. When you specify the date and time, a default time window
of two hours is set in the right side of the on field. Click the right side to adjust the date and time.
7. Click Create.
8. Click Refresh Dynamic Resource Pools.
Cloudera Manager allows you to specify a set of ordered rules for assigning applications and queries to pools. You can
also specify default pool settings directly in the YARN fair scheduler configuration.
Some rules allow to you specify that the pool be created in the dynamic resource pool configuration if it does not
already exist. Allowing pools to be created is optional. If a rule is satisfied and you do not create a pool, YARN runs the
job "ad hoc" in a pool to which resources are not assigned or managed.
If no rule is satisfied when the application or query runs, the YARN application or Impala query is rejected.
– root.[pool name] - Use root.pool name, where pool name is the name you specify in the Pool Name
field that displays after you select the rule.
– root.[username] - Use the pool that matches the name of the user that submitted the query. This is not
recommended.
– root.[primary group] - Use the pool that matches the primary group of the user that submitted the
query.
– root.[secondary group] - Use the pool that matches one of the secondary groups of the user that
submitted the query.
– root.default - Use the root.default pool.
Note: Currently, it's not possible to map a username to a resource pool with different name.
For example, you cannot map the user1 user to the root.pool1 resource pool.
For more information about these rules, see the description of the queuePlacementPolicy element in Allocation
File Format.
5. (YARN only) To indicate that the pool should be created if it does not exist when the application runs, check the
Create pool if it does not exist checkbox.
6. Click Create. The rule is added to the top of the placement rule list and becomes the first rule evaluated.
7. Click Refresh Dynamic Resource Pools.
Disabling Impala Admission Control in Cloudera Manager on page 439, the Impala Admission Control > Resource
Pools tab displays.
2. Click the YARN or Impala Admission Control tab.
3. Click the Placement Rules tab.
4. Click Reorder Placement Rules.
5. Click Move Up or Move Down in a rule row.
6. Click Save.
7. Click Refresh Dynamic Resource Pools.
Example Placement Rules
The following figures show the default pool placement rule setting for YARN:
If a pool is specified at run time, that pool is used for the job and the pool is created if it did not exist. If no pool is
specified at run time, a pool named according to the user submitting the job within the root.users parent pool is
used. If that pool cannot be used (for example, because the root.users pool is a leaf pool), pool root.default is
used.
If you move rule 2 down (which specifies to run the job in a pool named after the user running the job nested within
the parent pool root.users), rule 2 becomes disabled because the previous rule (Use the pool root.default) is
always satisfied.
A scheduler determines which jobs run, where and when they run, and the resources allocated to the jobs. The YARN
(MRv2) and MapReduce (MRv1) computation frameworks support the following schedulers:
• FIFO - Allocates resources based on arrival time.
• Fair - Allocates resources to weighted pools, with fair sharing within each pool. When configuring the scheduling
policy of a pool, Domain Resource Fairness (DRF) is a type of fair scheduler.
In Cloudera Manager the Dynamic Resource Pools Configuration screen provides an enhanced interface for configuring
the Fair Scheduler. In addition to allowing you to configure resource allocation properties, you can define schedules
for changing the values of the properties. Cloudera Manager automatically updates Fair Scheduler configuration files
according to the schedule.
Property Description
yarn.scheduler.fair.allow-undeclared-pools When set to true, the Fair Scheduler uses the username as the default pool name, in
the event that a pool name is not specified. When set to false, all applications are run
in a shared pool, called default.
Default: true.
yarn.scheduler.fair.user-as-default-queue When set to true, pools specified in applications but not explicitly configured, are created
at runtime with default settings. When set to false, applications specifying pools not
explicitly configured run in a pool named default. This setting applies when an application
explicitly specifies a pool and when the application runs in a pool named with the
username associated with the application.
Default: true.
yarn.scheduler.fair.preemption When enabled, under certain conditions, the Fair Scheduler preempts applications in
other pools. Preemption guarantees that production applications are not starved while
Property Description
also allowing the cluster to be used for experimental and research applications. To
minimize wasted computation, the Fair Scheduler preempts the most recently launched
applications.
Default: false.
yarn.scheduler.fair.preemption.cluster-utilization-threshold The cluster utilization threshold above which preemption is triggered. If the cluster
utilization is under this threshold, preemption is not triggered even if there are starved
queues. The utilization is computed as the maximum ratio of usage to capacity among
all resources.
Default: .8.
For example:
...
<property>
<name>yarn.scheduler.fair.allow-undeclared-pools</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.fair.user-as-default-queue</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.fair.preemption</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.fair.preemption.cluster-utilization-threshold</name>
<value>0.8</value>
</property>
...
minResources, Minimum and maximum share of resources that can allocated to the
maxResources resource pool in the form X mb, Y vcores. Values computed by
the weight settings are limited by (or constrained by) the minimum
and maximum values.
maxAMShare Fraction of the resource pool's fair share that can be used to run
ApplicationMasters. For example, if set to 1.0, then ApplicationMasters
in the pool can take up to 100% of both the memory and CPU fair
share. The value of -1.0 disables this feature, and the
ApplicationMaster share is not checked. The default value is 0.5.
maxRunningApps See default elements.
fairSharePreemptionThreshold See default elements.
For example:
<allocations>
<queue name="root">
<weight>1.0</weight>
<schedulingPolicy>drf</schedulingPolicy>
<aclSubmitApps> </aclSubmitApps>
<aclAdministerApps>*</aclAdministerApps>
<queue name="production">
<minResources>1024 mb, 10 vcores</minResources>
<maxResources>5120 mb, 20 vcores</maxResources>
<weight>4.0</weight>
<schedulingPolicy>drf</schedulingPolicy>
<aclSubmitApps>*</aclSubmitApps>
<aclAdministerApps>*</aclAdministerApps>
</queue>
<queue name="development">
<weight>1.0</weight>
<schedulingPolicy>drf</schedulingPolicy>
<aclSubmitApps>*</aclSubmitApps>
<aclAdministerApps>*</aclAdministerApps>
</queue>
</queue>
<defaultQueueSchedulingPolicy>drf</defaultQueueSchedulingPolicy>
<queuePlacementPolicy>
<rule name="specified" create="true"/>
<rule name="user" create="true"/>
</queuePlacementPolicy>
</allocations>
Dynamic resource pools allow you to configure scheduler properties. See Configuring Default YARN Fair Scheduler
Properties on page 442.
Enabling Preemption
1. Select Clusters > Cluster name > Dynamic Resource Pool Configuration. If the cluster has a YARN service, the
YARN > Resource Pools tab displays.
2. Click Default Settings.
3. Click the Enable Fair Scheduler Preemption link.
4. Select the ResourceManager Default Group checkbox.
5. Enter a Reason for change, and then click Save Changes to commit the changes.
6. Return to the Home page by clicking the Cloudera Manager logo.
7. Click the icon that is next to any stale services to invoke the cluster restart wizard.
8. Click Restart Stale Services.
9. Click Restart Now.
10. Click Finish.
Disabling Preemption
1. Select Clusters > Cluster name > Dynamic Resource Pool Configuration. If the cluster has a YARN service, the
YARN > Resource Pools tab displays.
2. In Cloudera Manager select the YARN service.
3. Clear the Enable Fair Scheduler Preemption checkbox.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
5. Return to the Home page by clicking the Cloudera Manager logo.
6. Click the icon that is next to any stale services to invoke the cluster restart wizard.
7. Click Restart Stale Services.
8. Click Restart Now.
9. Click Finish.
ln -s /data/2/impala_data /data/1/service_monitor/impala
To configure memory allocations, determine how many entities are being monitored and then consult the tables below
for required and recommended memory configurations.
To determine the number of entities being monitored:
1. Go to Clusters > Cloudera Management Service.
2. Locate the chart with the title Cloudera Management Service Monitored Entities.
The number of monitored entities for the Host Monitor and Service Monitor displays at the bottom of the chart.
In the following example, the Host Monitor has 46 monitored entities and the Service Monitor has 230 monitored
entities.
3. Use the number of monitored entities for the Host Monitor to determine its memory requirements and
recommendations in the tables below.
4. Use the number of monitored entities for the Service Monitor to determine its memory requirements and
recommendations in the tables below.
Number of Monitored Number of Hosts Required Java Heap Size Recommended Non-Java
Entities Heap Size
0-2,000 0-100 1 GB 6 GB
2,000-4,000 100-200 1.5 GB 6 GB
4,000-8,000 200-400 1.5 GB 12 GB
8,000-16,000 400-800 2.5 GB 12 GB
16,000-20,000 800-1,000 3.5 GB 12 GB
Number of Monitored Number of Hosts Required Java Heap Size Recommended Non-Java
Entities Heap Size
0-30,000 0-100 2 GB 12 GB
30,000-60,000 100-200 3 GB 12 GB
60,000-120,000 200-400 3.5 GB 12 GB
120,000-240,000 400-800 8 GB 20 GB
The Cluster Utilization Report screens in Cloudera Manager display aggregated utilization information for YARN and
Impala jobs. The reports display CPU utilization, memory utilization, resource allocations made due to the YARN fair
scheduler, and Impala queries. The report displays aggregated utilization for the entire cluster and also breaks out
utilization by tenant, which is either a user or a resource pool. You can configure the report to display utilization for a
range of dates, specific days of the week, and time ranges.
The report displays the current utilization of CPU and memory resources and the resources that were allocated using
the Cloudera Manager resource management features. See Resource Management on page 431.
Using the information displayed in the Cluster Utilization Report, a CDH cluster administrator can verify that sufficient
resources are available for the number and types of jobs running in the cluster. An administrator can use the reports
to tune resource allocations so that resources are used efficiently and meet business requirements. Tool tips in the
report pages provide suggestions about how to improve performance based on the information displayed in the report.
Hover over a label to see these suggestions and other information. For example:
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
If you want to create your own reports with similar functionality, or if you want to export the report results, see Creating
a Custom Cluster Utilization Report on page 466.
Note: The user that is configured with the Container Usage MapReduce Job User property
in the YARN service requires permissions to read the subdirectories of the HDFS directory
specified with the Cloudera Manager Container Usage Metrics Directory property. The
default umask of 022 allows any user to read from that directory. However, if a more strict
umask (for example, 027) is used, then those directories are not readable by any user. In
that case the user specified with the Container Usage MapReduce Job User property should
be added to the same group that owns the subdirectories.
For example, if the /tmp/cmYarnContainerMetrics/20161010 subdirectory is owned
by user and group yarn:hadoop, the user specified in Container Usage MapReduce Job
User should be added to the hadoop group.
Note: The directories you specify with the Cloudera Manager Container Usage Metrics
Directory and Container Usage Output Directory properties should not be located in
encryption zones.
g. (Optional) Enter the resource pool in which the container usage collection MapReduce job runs in the Container
Usage MapReduce Job Pool parameter. Cloudera recommends that you dedicate a resource pool for running
this MapReduce job.
Note: If you specify a custom resource pool, ensure that the placement rules for the cluster
allow for it. The first rule must be for resource pools to be specified at run time with the
Create pool if it does not exist option selected. Alternatively, ensure that the pool you specify
already exists. If the placement rule is not properly configured or the resource pool does not
already exist, the job may run in a different pool.
h. Enter a Reason for change, and then click Save Changes to commit the changes.
i. Click the Actions drop-down list and select Create CM Container Usage Metrics Dir.
j. Restart the YARN service:
a. Go to the YARN service.
b. Select Actions > Restart.
You can apply a configuration and date range that applies to all tabs in the report:
1. Click the Configuration drop-down menu.
2. Select one the configured options, or create a new configuration:
a. Click Create New Configuration.
b. Enter a Configuration Name.
c. Select the Tenant Type, either Pool or User.
d. Select the days of the week for which you want to report utilization.
e. Select All Day, or use the drop-down menus to specify a utilization time range for the report.
f. Click Create.
The configuration you created is now available from the Configuration drop-down menu.
3. Select a date range for the report:
a. Click the date range button.
b. Select one of the range options (Today, Yesterday, Last 7 Days, Last 30 Days, or This Month) or click Custom
Range and select the beginning and ending dates for the date range.
Note: The report updates utilization information every hour. The utilization information for Impala
and YARN queries does not display in the Cluster Utilization Report until captured by the hourly
update.
Overview Tab
The Overview tab provides a summary of CPU and memory utilization for the entire cluster and also for only YARN
applications and Impala queries. Two sections, CPU Utilization and Memory Utilization, display the following information:
YARN Tab
The YARN tab displays CPU and memory utilization for YARN applications on three tabs:
• Utilization Tab on page 463
• Capacity Planning Tab on page 464
• Preemption Tuning Tab on page 464
For information about managing YARN resources, see:
• YARN (MRv2) and MapReduce (MRv1) Schedulers on page 449
• Enabling and Disabling Fair Scheduler Preemption on page 453
• Dynamic Resource Pools on page 437
Utilization Tab
Impala Tab
The Impala tab displays CPU and memory utilization for Impala queries using three tabs:
Note: This histogram is generated from the minute-level metrics for Impala daemons. If the
minute-level metrics for the timestamp at which peak allocation happened are no longer
present in the Cloudera Service Monitor Time-Series Storage, the histogram shows no data.
To maintain a longer history for the minute-level metrics, increase the value of the Time-Series
Storage property for the Cloudera Service Monitor. (Go to the Cloudera Management
Service > Configuration and search for Time-Series Storage.)
• Max Utilized
– Peak Usage Time – The time when Impala used the maximum amount of memory for queries.
Click the drop-down list next to the date and time and select View Impala Queries Running at the Time to
see details about the queries.
– Max Utilized – The maximum memory that was used by Impala for executing queries. If the percentage is
high, consider increasing the number of hosts in the cluster.
– Reserved at the Time – The amount of memory reserved by Impala at the time when it was using the maximum
memory for executing queries.
Click View Time Series Chart to view a chart of peak memory utilization.
– Histogram of Utilized Memory at Peak Usage Time – Distribution of memory used per Impala daemon for
executing queries at the time Impala used the maximum memory. If some Impala daemons are using memory
close to the configured limit, consider adding more physical memory to the hosts.
Note: This histogram is generated from the minute-level metrics for Impala daemons. If the
minute-level metrics for the timestamp at which peak allocation happened are no longer
present in the Cloudera Service Monitor Time-Series Storage, the histogram shows no data.
To maintain a longer history for the minute-level metrics, increase the value of the Time-Series
Storage property for the Cloudera Service Monitor. (Go to the Cloudera Management
Service > Configuration and search for Time-Series Storage.)
queries you can use to build these custom reports. These reports all use the tsquery Language on page 377 to chart
time-series data.
SELECT
cpu_percent_across_hosts
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
total_cores_across_hosts
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
100 * total_physical_memory_used_across_hosts/total_physical_memory_total_across_hosts
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
total_physical_memory_total_across_hosts
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
counter_delta(impala_query_thread_cpu_time_rate)
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
counter_delta(impala_query_memory_accrual_rate)
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
yarn_reports_containers_used_cpu FROM REPORTS
WHERE
category=SERVICE
AND clusterName=Cluster_Name
SELECT
yarn_reports_containers_used_memory
FROM
REPORTS
WHERE
category=SERVICE
AND clusterName=Cluster_Name
SELECT
counter_delta(impala_query_thread_cpu_time_rate)
WHERE
category=IMPALA_POOL
AND poolName=Pool_Name
SELECT
counter_delta(impala_query_memory_accrual_rate)
WHERE
category=IMPALA_POOL
AND poolName=Pool_Name
SELECT
yarn_reports_containers_used_cpu FROM REPORTS
WHERE
category=YARN_POOL_USER
SELECT
yarn_reports_containers_used_memory
FROM
REPORTS
WHERE
category=YARN_POOL_USER
YARN Metrics
YARN VCore usage
Data Granularity: Raw
Units: VCore seconds
tsquery:
SELECT
yarn_reports_containers_used_vcores
FROM
REPORTS
WHERE
category=SERVICE
AND clusterName=Cluster_Name
SELECT
total_allocated_vcores_across_yarn_pools + total_available_vcores_across_yarn_pools
WHERE
category=SERVICE
AND clusterName=Cluster_Name
SELECT
yarn_reports_containers_used_memory FROM REPORTS
WHERE
category=SERVICE
AND clusterName=Cluster_Name
SELECT
total_available_memory_mb_across_yarn_pools +
total_allocated_memory_mb_across_yarn_pools
WHERE
category=SERVICE
AND clusterName=Cluster_Name
tsquery:
SELECT
yarn_reports_containers_used_vcores FROM REPORTS
WHERE
category=YARN_POOL_USER
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool-level memory usage
The results of this query return the usage for each user in each pool. To see the total usage for a pool, sum all users
of the pool.
Data Granularity: Raw
Units: MB seconds
tsquery:
SELECT
yarn_reports_containers_used_memory FROM REPORTS
WHERE
category=YARN_POOL_USER
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool-level allocated VCores
The results of this query return the usage for each user in each pool. To see the total usage for a pool, sum all users
of the pool.
Data Granularity: raw metric value
Units: VCore seconds
tsquery:
SELECT
yarn_reports_containers_allocated_vcores FROM REPORTS
WHERE
category=YARN_POOL_USER
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool-level allocated memory
The results of this query return the usage for each user in each pool. To see the total usage for a pool, sum all users
of the pool.
Data Granularity: raw metric value
Units: megabyte seconds
tsquery:
SELECT
yarn_reports_containers_allocated_memory
FROM
REPORTS
WHERE
category=YARN_POOL_USER
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool-level steady fair share VCore
Data Granularity: hourly
Units: VCores
tsquery:
SELECT
steady_fair_share_vcores
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool-level fair share VCore
Data Granularity: hourly
Units: VCores
tsquery:
SELECT
fair_share_vcores
WHERE
category=YARN_POOL
SELECT
steady_fair_share_mb
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool-level fair share memory
Data Granularity: hourly
Units: MB
tsquery:
SELECT
fair_share_mb
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Metric indicating contention
Data Granularity: hourly
Units: percentage
tsquery:
SELECT
container_wait_ratio
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
SELECT
allocated_vcores_with_pending_containers
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool level steady fair share VCores when contention occurs
Data Granularity: hourly
Units: VCores
tsquery:
SELECT
steady_fair_share_vcores_with_pending_containers
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool level fair share VCores when contention occurs
Data Granularity: hourly
Units: VCores
tsquery:
SELECT
fair_share_vcores_with_pending_containers
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool level allocated memory when contention occurs
Data Granularity: hourly
Units: MB
tsquery:
SELECT
allocated_memory_mb_with_pending_containers
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool level steady fair share memory when contention occurs
Data Granularity: hourly
Units: MB
tsquery:
SELECT
steady_fair_share_mb_with_pending_containers
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Pool level fair share memory when contention occurs
Data Granularity: hourly
Units: MB
tsquery:
SELECT
fair_share_mb_with_pending_containers
WHERE
category=YARN_POOL
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Impala-Specific Metrics
To view metrics for a specific pool, add poolName=Pool Name to the tsquery statement.
Total reserved memory
Data Granularity: hourly
Units: MB seconds
tsquery:
SELECT
total_impala_admission_controller_local_backend_mem_reserved_across_impala_daemon_pools
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
total_impala_admission_controller_local_backend_mem_usage_across_impala_daemon_pools
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
SELECT
total_mem_tracker_process_limit_across_impalads
WHERE
category=CLUSTER
AND clusterName=Cluster_Name
Note: To query for pool-level metrics, change the category to IMPALA-POOL in the above tsquery
statements.
SELECT
counter_delta(queries_ingested_rate)
WHERE
category=IMPALA_POOL
AND clusterName=Cluster_Name
AND serviceName=Service_Name
and Time-Series Metric Data on page 455). This data is aggregated from raw metric values such as minimum, maximum,
etc. within the corresponding data window.
For example, if you do not need the metric data at a specific timestamp but care more about the hourly usage, HOURLY
data should be good enough. In general, the longer the granular window it is, the less storage it is taking, and thus the
longer period of time you are able to keep that level of data without being purged when the storage hits the configured
limit. In the case of Cloudera Manager Cluster Utilization Reports, Cloudera Manager generates the reports based on
an hourly window.
To view the Cloudera Manager Service Monitor data storage granularities, go to Clusters > Cloudera Management
Service > Service Monitor > Charts Library > Service Monitor Storage and scroll down to see the Data Duration Covered
table to see the earliest available data points for each level of granularity. The value in the last(duration_covered)
column indicates the age of the oldest data in the table.
To configure the Time series storage used by the Service Monitor, go to Clusters > Cloudera Management Service >
Configuration > Charts Library > Service Monitor Storage and search for "Time-Series Storage".
High Availability
This guide is for Apache Hadoop system administrators who want to enable continuous availability by configuring
clusters without single points of failure.
Not all Hadoop components currently support highly availability configurations. However, some currently SPOF (single
point of failure) components can be configured to restart automatically in the event of a failure (Auto-Restart
Configurable, in the table below). Some components support high availability implicitly because they comprise distributed
processes (identified with an asterisk (*) in the table). In addition, some components depend on external databases
which must also be configured to support high availability.
Background
In a standard configuration, the NameNode is a single point of failure (SPOF) in an HDFS cluster. Each cluster has a
single NameNode, and if that host or process became unavailable, the cluster as a whole is unavailable until the
NameNode is either restarted or brought up on a new host. The Secondary NameNode does not provide failover
capability.
The standard configuration reduces the total availability of an HDFS cluster in two major ways:
• In the case of an unplanned event such as a host crash, the cluster is unavailable until an operator restarts the
NameNode.
• Planned maintenance events such as software or hardware upgrades on the NameNode machine result in periods
of cluster downtime.
HDFS HA addresses the above problems by providing the option of running two NameNodes in the same cluster, in an
active/passive configuration. These are referred to as the active NameNode and the standby NameNode. Unlike the
Secondary NameNode, the standby NameNode is hot standby, allowing a fast automatic failover to a new NameNode
in the case that a host crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
You cannot have more than two NameNodes.
Implementation
Cloudera Manager and CDH support Quorum-based Storage to implement HA.
Quorum-based Storage
Quorum-based Storage refers to the HA implementation that uses a Quorum Journal Manager (QJM).
For the standby NameNode to keep its state synchronized with the active NameNode in this implementation, both
nodes communicate with a group of separate daemons called JournalNodes. When any namespace modification is
performed by the active NameNode, it durably logs a record of the modification to a majority of the JournalNodes.
The standby NameNode is capable of reading the edits from the JournalNodes, and is constantly watching them for
changes to the edit log. As the standby Node sees the edits, it applies them to its own namespace. In the event of a
failover, the standby ensures that it has read all of the edits from the JournalNodes before promoting itself to the
active state. This ensures that the namespace state is fully synchronized before a failover occurs.
To provide a fast failover, it is also necessary that the standby NameNode has up-to-date information regarding the
location of blocks in the cluster. To achieve this, DataNodes are configured with the location of both NameNodes, and
they send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the NameNodes be active at a time. Otherwise,
the namespace state would quickly diverge between the two, risking data loss or other incorrect results. To ensure
this property and prevent the so-called "split-brain scenario," JournalNodes only ever allow a single NameNode to be
a writer at a time. During a failover, the NameNode which is to become active simply takes over the role of writing to
the JournalNodes, which effectively prevents the other NameNode from continuing in the active state, allowing the
new active NameNode to safely proceed with failover.
Note: Because of this, fencing is not required, but it is still useful; see Enabling HDFS HA on page 481.
Automatic Failover
Automatic failover relies on two additional components in an HDFS: a ZooKeeper quorum, and the
ZKFailoverController process (abbreviated as ZKFC). In Cloudera Manager, the ZKFC process maps to the HDFS
Failover Controller role.
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients
of changes in that data, and monitoring clients for failures. The implementation of HDFS automatic failover relies on
ZooKeeper for the following functions:
• Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper.
If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should
be triggered.
• Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the
current active NameNode crashes, another node can take a special exclusive lock in ZooKeeper indicating that it
should become the next active NameNode.
The ZKFailoverController (ZKFC) is a ZooKeeper client that also monitors and manages the state of the NameNode.
Each of the hosts that run a NameNode also run a ZKFC. The ZKFC is responsible for:
• Health monitoring - the ZKFC contacts its local NameNode on a periodic basis with a health-check command. So
long as the NameNode responds promptly with a healthy status, the ZKFC considers the NameNode healthy. If
the NameNode has crashed, frozen, or otherwise entered an unhealthy state, the health monitor marks it as
unhealthy.
• ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper.
If the local NameNode is active, it also holds a special lock znode. This lock uses ZooKeeper's support for
"ephemeral" nodes; if the session expires, the lock node is automatically deleted.
• ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other NameNode currently
holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has "won the election", and is
responsible for running a failover to make its local NameNode active. The failover process is similar to the manual
failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions
to active state.
can tolerate at most (N - 1) / 2 failures and continue to function normally. If the requisite quorum is not available,
the NameNode will not format or start, and you will see an error similar to this:
12/10/01 17:34:18 WARN namenode.FSEditLog: Unable to determine input streams from QJM
to [10.0.1.10:8485, 10.0.1.10:8486, 10.0.1.10:8487]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
Note: In an HA cluster, the standby NameNode also performs checkpoints of the namespace state,
and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an
HA cluster. In fact, to do so would be an error. If you are reconfiguring a non-HA-enabled HDFS cluster
to be HA-enabled, you can reuse the hardware which you had previously dedicated to the Secondary
NameNode.
Enabling HDFS HA
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
An HDFS high availability (HA) cluster uses two NameNodes—an active NameNode and a standby NameNode. Only
one NameNode can be active at any point in time. HDFS HA depends on maintaining a log of all namespace modifications
in a location available to both NameNodes, so that in the event of a failure, the standby NameNode has up-to-date
information about the edits and location of blocks in the cluster.
Important: Enabling and disabling HA causes a service outage for the HDFS service and all services
that depend on HDFS. Before enabling or disabling HA, ensure that there are no jobs running on your
cluster.
Important:
• Enabling or disabling HA causes the previous monitoring history to become unavailable.
• Some parameters will be automatically set as follows once you have enabled JobTracker HA. If
you want to change the value from the default for these parameters, use an advanced
configuration snippet.
– mapred.jobtracker.restart.recover: true
– mapred.job.tracker.persist.jobstatus.active: true
– mapred.ha.automatic-failover.enabled: true
– mapred.ha.fencing.methods: shell(true)
Important: Some steps, such as formatting the NameNode may report failure if the action
was already completed. However, the configuration steps continue to execute after reporting
non-critical failed steps.
5. If you want to use other services in a cluster with HA configured, follow the procedures in Configuring Other CDH
Components to Use HDFS HA on page 485.
rmr /hadoop-ha/nameservice1
Fencing Methods
To ensure that only one NameNode is active at a time, a fencing method is required for the shared edits directory.
During a failover, the fencing method is responsible for ensuring that the previous active NameNode no longer has
access to the shared edits directory, so that the new active NameNode can safely proceed writing to it.
By default, Cloudera Manager configures HDFS to use a shell fencing method (shell(true)).
The fencing parameters are found in the Service-Wide > High Availability category under the configuration properties
for your HDFS service.
Fencing Configuration
dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the active NameNode during
a failover
It is desirable for correctness of the system that only one NameNode be in the active state at any given time.
When you use Quorum-based Storage, only one NameNode will ever be allowed to write to the JournalNodes, so there
is no potential for corrupting the file system metadata in a "split-brain" scenario. This is reflected in the default value
of shell(true) for the dfs.ha.fencing.methods, which does not explicitly try to fence the standby NameNode.
In the absence of explicitly fencing, there is a narrow time window where the previously active NameNode may serve
out-of-date responses to reads from clients. This window ends when the previously active NameNode tries to write
to the JournalNodes, at which point the NameNode shuts down.
This window of stale read responses is rarely an issue for applications since there is no danger of split-brain corruption.
In rare or special cases where strong read consistency is required, use an explicit fencing method such as the
agent-based fencer.
Note: If you choose to use the agent-based fencing method, you should still configure something
shell(true) as a fallback fencing option since agent-based fencing fails if the other NameNode is
unresponsive.
The fencing methods used during a failover are configured as a carriage-return-separated list, and these will be
attempted in order until one of them indicates that fencing has succeeded.
For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer
class.
Configuring the shell fencing method
shell - run an arbitrary shell command to fence the active NameNode
The shell fencing method runs an arbitrary shell command, which you can configure as shown below:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
The string between '(' and ')' is passed directly to a bash shell and cannot include any closing parentheses.
When executed, the first argument to the configured script will be the address of the NameNode to be fenced, followed
by all arguments specified in the configuration.
The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables,
with the '_' character replacing any '.' characters in the configuration keys. The configuration used has already had any
NameNode-specific configurations promoted to their generic forms - for example dfs_namenode_rpc-address will
contain the RPC address of the target node, even though the configuration may specify that variable as
dfs.namenode.rpc-address.ns1.nn1.
The following variables referring to the target node to be fenced are also available:
Variable Description
$target_host Hostname of the node to be fenced
$target_port IPC port of the node to be fenced
$target_address The above two variables, combined as host:port
$target_nameserviceid The nameservice ID of the NameNode to be fenced
$target_namenodeid The NameNode ID of the NameNode to be fenced
You can also use these environment variables as substitutions in the shell command itself. For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid
$target_host:$target_port)</value>
</property>
If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit
code, the fencing was not successful and the next fencing method in the list will be attempted.
Note: This fencing method does not implement any timeout. If timeouts are necessary, they should
be implemented in the shell script itself (for example, by forking a subshell to kill its parent in some
number of seconds).
Note: You may want to stop the Hue and Impala services first, if present, as they depend on the
Hive service.
2. Issue the INVALIDATE METADATA statement from an Impala shell. This one-time operation makes all Impala
daemons across the cluster aware of the latest settings for the Hive metastore database. Alternatively, restart
the Impala service.
Example:
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>hdfs://ha-nn
Note: For advanced use only: You can set the Force Failover checkbox to force the selected
NameNode to be active, irrespective of its state or the other NameNode's state. Forcing a failover
will first attempt to failover the selected NameNode to active mode and the other NameNode
to standby mode. It will do so even if the selected NameNode is in safe mode. If this fails, it will
proceed to transition the selected NameNode to active mode. To avoid having two NameNodes
be active, use this only if the other NameNode is either definitely stopped, or can be transitioned
to standby mode by the first failover step.
getServiceState
getServiceState - determine whether the given NameNode is active or standby
Connect to the provided NameNode to determine its current state, printing either "standby" or "active" to STDOUT as
appropriate. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently
based on whether the NameNode is currently active or standby.
checkHealth
checkHealth - check the health of the given NameNode
Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on
itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is
healthy, non-zero otherwise. One might use this command for monitoring purposes.
For a full list of dfsadmin command options, run: hdfs dfsadmin -help.
Changing a Nameservice Name for Highly Available HDFS Using Cloudera Manager
For background on HDFS high availability, see Enabling HDFS HA Using Cloudera Manager on page 481.
Before you start, make note of the name of the active NameNode role instance. You can find the list of NameNode
instances on the Instances tab for the HDFS service in the Cloudera Manager Admin Console.
Complete the following steps to change the NameService name for HDFS with HA:
1. Stop all services except ZooKeeper.
2. On a ZooKeeper server host, run zookeeper-client.
a. Execute the following to remove the configured nameservice. This example assumes the name of the
nameservice is nameservice1. You can identify the nameservice from the Federation and High Availability
section on the HDFS Instances tab:
rmr /hadoop-ha/nameservice1
3. In the Cloudera Manager Admin Console, update the NameNode nameservice name.
a. Go to the HDFS service.
b. Click the Configuration tab.
c. Type nameservice in the Search field.
d. For the NameNode Nameservice property, type the nameservice name in the NameNode (instance_name)
field. The name must be unique and can contain only alphanumeric characters.
e. Type quorum in the Search field.
f. For the Quorum-based Storage Journal name property, type the nameservice name in the NameNode
(instance_name) field.
g. Enter a Reason for change, and then click Save Changes to commit the changes.
4. Click the Instances tab.
5. In the Federation and High Availability pane, select Actions > Initialize High Availability State in ZooKeeper.
6. Go to the Hive service.
7. Select Actions > Update Hive Metastore NameNodes.
8. Go to the HDFS service.
9. Click the Instances tab.
10. Select the checkboxes next to the JournalNode role instances.
11. Select Actions for Selected > Start.
12. Click on an active NameNode role instance.
13. Select Actions > Initialize Shared Edits Directory.
14. Click the Cloudera Manager logo to return to the Home page.
15. Redeploy client configuration files.
16. Start all services except ZooKeeper.
Architecture
ResourceManager HA is implemented by means of an active-standby pair of ResourceManagers. On start-up, each
ResourceManager is in the standby state; the process is started, but the state is not loaded. When one of the
ResourceManagers is transitioning to the active state, the ResourceManager loads the internal state from the designated
state store and starts all the internal services. The stimulus to transition to active comes from either the administrator
(through the CLI) or through the integrated failover controller when automatic failover is enabled. The subsections
that follow provide more details about the components of ResourceManager HA.
ResourceManager Restart
Restarting the ResourceManager allows for the recovery of in-flight applications if recovery is enabled. To achieve this,
the ResourceManager stores its internal state, primarily application-related data and tokens, to the RMStateStore;
the cluster resources are re-constructed when the NodeManagers connect. The available alternatives for the state
store are MemoryRMStateStore (a memory-based implementation) and ZKRMStateStore (ZooKeeper-based
implementation). Note that MemoryRMStateStore will not work for HA.
Fencing
When running two ResourceManagers, a split-brain situation can arise where both ResourceManagers assume they
are active. To avoid this, only a single ResourceManager should be able to perform active operations and the other
ResourceManager should be "fenced". The ZooKeeper-based state store (ZKRMStateStore) allows only a single
ResourceManager to make changes to the stored state, implicitly fencing the other ResourceManager. This is
accomplished by the ResourceManager claiming exclusive create-delete permissions on the root znode. The ACLs on
the root znode are automatically created based on the ACLs configured for the store; in case of secure clusters, Cloudera
recommends that you set ACLs for the root host such that both ResourceManagers share read-write-admin access,
but have exclusive create-delete access. The fencing is implicit and does not require explicit configuration (as fencing
in HDFS and MRv1 does). You can plug in a custom "Fencer" if you choose to – for example, to use a different
implementation of the state store.
Configuration and FailoverProxy
In an HA setting, you should configure two ResourceManagers to use different ports (for example, ports on different
hosts). To facilitate this, YARN uses the notion of an ResourceManager Identifier (rm-id). Each ResourceManager has
a unique rm-id, and all the RPC configurations (<rpc-address>; for example yarn.resourcemanager.address) for
that ResourceManager can be configured via <rpc-address>.<rm-id>. Clients, ApplicationMasters, and
NodeManagers use these RPC addresses to talk to the active ResourceManager automatically, even after a failover.
To achieve this, they cycle through the list of ResourceManagers in the configuration. This is done automatically and
does not require any configuration (as it does in HDFS and MapReduce (MRv1)).
Automatic Failover
By default, ResourceManager HA uses ZKFC (ZooKeeper-based failover controller) for automatic failover in case the
active ResourceManager is unreachable or goes down. Internally, the StandbyElector is used to elect the active
ResourceManager. The failover controller runs as part of the ResourceManager.
You can plug in a custom failover controller if you prefer.
Manual Transitions and Failover
You can use the command-line tool yarn rmadmin to transition a particular ResourceManager to active or standby
state, to fail over from one ResourceManager to the other, to get the HA state of an ResourceManager, and to monitor
an ResourceManager's health.
Important: Enabling or disabling HA will cause the previous monitoring history to become unavailable.
3. Select the host where you want the standby ResourceManager to be installed, and click Continue. Cloudera
Manager proceeds to run a set of commands that stop the YARN service, add a standby ResourceManager, initialize
the ResourceManager high availability state in ZooKeeper, restart YARN, and redeploy the relevant client
configurations.
4. Work preserving recovery is enabled for the ResourceManager by default when you enable ResourceManager HA
in Cloudera Manager. For more information, including instructions on disabling work preserving recovery, see
Work Preserving Recovery for YARN Components on page 490.
Note: ResourceManager HA does not affect the JobHistory Server (JHS). JHS does not maintain any
state, so if the host fails you can simply assign it to a new host. You can also enable process auto-restart
by doing the following:
1. Go to the YARN service.
2. Click the Configuration tab.
3. Select Scope > JobHistory Server.
4. Select Category > Advanced.
5. Locate the Automatically Restart Process property or search for it by typing its name in the Search
box.
6. Click Edit Individual Values
7. Select the JobHistory Server Default Group.
8. Restart the JobHistory Server role.
[-transitionToActive serviceId]
[-transitionToStandby serviceId]
[-getServiceState serviceId]
[-checkHealth <serviceId]
[-help <command>]
Note: Even though -help lists the -failover option, it is not supported by yarn rmadmin.
Note: YARN does not support high availability for the JobHistory Server (JHS). If the JHS goes down,
Cloudera Manager will restart it automatically.
Note:
After moving the JobHistory Server to a new host, the URLs listed for the JobHistory Server on the
ResourceManager web UI still point to the old JobHistory Server. This affects existing jobs only. New
jobs started after the move are not affected. For any existing jobs that have the incorrect JobHistory
Server URL, there is no option other than to allow the jobs to roll off the history over time. For new
jobs, make sure that all clients have the updated mapred-site.xml that references the correct
JobHistory Server.
The following example configuration can be used with a Cloudera Manager advanced configuration snippet. Adjust
the configuration to suit your environment.
<property>
<name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
<value>true</value>
<description>Whether to enable work preserving recovery for the Resource
Manager</description>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
<description>Whether to enable work preserving recovery for the Node
Manager</description>
</property>
<property>
<name>yarn.nodemanager.recovery.dir</name>
<value>/home/cloudera/recovery</value>
<description>The location for stored state on the Node Manager, if work preserving
recovery
is enabled.</description>
</property>
<property>
<name>yarn.nodemanager.address</name>
<value>0.0.0.0:45454</value>
</property>
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Follow the instructions in this section to configure high availability (HA) for JobTracker.
Important: Enabling or disabling JobTracker HA will cause the previous monitoring history to become
unavailable.
• You may enter more than one directory, though it is not required. The paths do not need to be the same on
both JobTracker hosts.
• If the directories you specify do not exist, they will be created with the appropriate permissions. If they already
exist, they must be empty and have the appropriate permissions.
• If the directories are not empty, Cloudera Manager will not delete the contents.
5. Optionally use the checkbox under Advanced Options to force initialize the ZooKeeper znode for auto-failover.
6. Click Continue. Cloudera Manager runs a set of commands that stop the MapReduce service, add a standby
JobTracker and Failover controller, initialize the JobTracker high availability state in ZooKeeper, create the job
status directory, restart MapReduce, and redeploy the relevant client configurations.
Cloudera recommends monitoring both Key Trustee Servers. If a Key Trustee Server fails catastrophically, restore it
from backup to a new host with the same hostname and IP address as the failed host. See Backing Up and Restoring
Key Trustee Server and Clients for more information. Cloudera does not support PostgreSQL promotion to convert a
passive Key Trustee Server to an active Key Trustee Server.
Depending on your cluster configuration and the security practices in your organization, you might need to restrict the
allowed versions of TLS/SSL used by Key Trustee Server. For details, see Specifying TLS/SSL Minimum Allowed Version
and Ciphers.
Important: You must assign the Key Trustee Server and Database roles to the same host. Assign the
Active Key Trustee Server and Active Database roles to one host, and the Passive Key Trustee Server
and Passive Database roles to a separate host.
After completing the Add Role Instances wizard, the Passive Key Trustee Server and Passive Database roles fail to start.
Complete the following manual actions to start these roles:
1. Stop the Key Trustee Server service (Key Trustee Server service > Actions > Stop).
2. Run the Set Up Key Trustee Server Database command (Key Trustee Server service > Actions > Set Up Key Trustee
Server Database).
3. Run the following command on the Active Key Trustee Server:
Replace keytrustee02.example.com with the hostname of the Passive Key Trustee Server.
4. Run the following command on the Passive Key Trustee Server:
5. Start the Key Trustee Server service (Key Trustee Server service > Actions > Start).
Important: Starting or restarting the Key Trustee Server service attempts to start the Active
Database and Passive Database roles. If the Active Database is not running when the Passive
Database attempts to start, the Passive Database fails to start. If this occurs, manually restart the
Passive Database role after confirming that the Active Database role is running.
6. Enable synchronous replication (Key Trustee Server service > Actions > Setup Enable Synchronous Replication
in HA mode).
7. Restart the Key Trustee Server service (Key Trustee Server service > Actions > Restart).
For parcel-based Key Trustee Server releases 5.8 and higher, Cloudera Manager automatically backs up Key Trustee
Server (using the ktbackup.sh script) after adding the Key Trustee Server service. It also schedules automatic backups
using cron. For package-based installations, you must manually back up Key Trustee Server and configure a cron job.
Cloudera Manager configures cron to run the backup script hourly. The latest 10 backups are retained in
/var/lib/keytrustee in cleartext. For information about using the backup script and configuring the cron job
(including how to encrypt backups), see Backing Up Key Trustee Server and Key Trustee KMS Using the ktbackup.sh
Script.
4. Click Select hosts and check the box for the host where you want to add the additional Key Management Server
Proxy role. See Resource Planning for Data at Rest Encryption for considerations when selecting a host. Click OK
and then Continue.
5. On the Review Changes page of the wizard, confirm the authorization code, organization name, and settings, and
then click Finish.
Important: The initial startup of the KMS instance may fail with the following error message:
If this occurs, it indicates that the KMS attempted to verify the Key Trustee private key has been
synchronized with the new instance, but was unable to because that synchronization has not yet
taken place. This is expected behavior at this point in the process. Proceed to the next step, and
the new KMS instance will come up when the KMS service is restarted after the synchronization.
6. If it is not already running, start the new KMS instance. Select the new instance and go to Actions for Selected >
Start.
7. Go to Key Trustee KMS service > Configuration and make sure that the ZooKeeper Service dependency is set to
the ZooKeeper service for your cluster.
8. Synchronize the Key Trustee KMS private key.
Warning: It is very important that you perform this step. Failure to do so leaves Key Trustee KMS
in a state where keys are intermittently inaccessible, depending on which Key Trustee KMS host
a client interacts with, because cryptographic key material encrypted by one Key Trustee KMS
host cannot be decrypted by another. If you are already running multiple Key Trustee KMS hosts
with different private keys, immediately back up all Key Trustee KMS hosts, and contact Cloudera
Support for assistance correcting the issue.
If you fail to maintain proper synchronization of private keys between Key Trustee KMS hosts,
then the GPG validation check that runs automatically when the Key Trustee KMS is restarted
will return the following error and abort the restart operation, forcing you to synchronize private
keys before a restart can occur:
To determine whether the Key Trustee KMS private keys are different, compare the MD5 hash of the private keys.
On each Key Trustee KMS host, run the following command:
md5sum /var/lib/kms-keytrustee/keytrustee/.keytrustee/secring.gpg
If the outputs are different, contact Cloudera Support for assistance. Do not attempt to synchronize existing keys.
If you overwrite the private key and do not have a backup, any keys encrypted by that private key are permanently
inaccessible, and any data encrypted by those keys is permanently irretrievable. If you are configuring Key Trustee
KMS high availability for the first time, continue synchronizing the private keys.
Cloudera recommends following security best practices and transferring the private key using offline media, such
as a removable USB drive. For convenience (for example, in a development or testing environment where maximum
security is not required), you can copy the private key over the network by running the following rsync command
on the original Key Trustee KMS host:
Replace ktkms02.example.com with the hostname of the Key Trustee KMS host that you are adding.
9. Restart the Key Trustee KMS service (Key Trustee KMS service > Actions > Restart).
10. Restart the cluster.
11. Redeploy the client configuration (Home > Cluster-wide > Deploy Client Configuration).
12. Re-run the steps in Validating Hadoop Key Operations.
Warning: The same host must be specified for the Navigator HSM KMS metastore and Navigator
HSM KMS proxy.
4. On the Review Changes page of the wizard, confirm the HSM KMS settings, and then click Finish.
5. Go to HSM KMS service > Configuration and make sure that the ZooKeeper Service dependency is set to the
ZooKeeper service for your cluster.
6. In the Add Role Instance path, the initialize metastore action does not run automatically (as it does for the Add
Service wizard). When a new metastore instance is added, the initialize metastore action must be run manually
before starting the metastore). So, stop both role instances (metastore and proxy) and then run the initialize
metastore action.
7. Restart the HSM KMS service (HSM KMS service > Actions > Restart).
8. Restart the cluster.
9. Redeploy the client configuration (Home > Cluster-wide > Select from Cluster drop-down menu (arrow icon) >
Deploy Client Configuration).
10. Re-run the steps in Validating Hadoop Key Operations.
To restore from a backup: bring up a completely new instance of the HSM KMS service, and copy the /var/lib/hsmkp
and /var/lib/hsmkp-meta directories from the backup onto the file system of the restored nodes before starting
HSM KMS for the first time.
includes whether it came from the primary or a secondary RegionServer. If it came from a secondary, the client can
choose to verify the read later or not to treat it as definitive.
Keeping Replicas Current
The read replica feature includes two different mechanisms for keeping replicas up to date:
Using a Timer
In this mode, replicas are refreshed at a time interval controlled by the configuration option
hbase.regionserver.storefile.refresh.period.
Using Replication
In this mode, replicas are kept current between a source and sink cluster using HBase replication. This can potentially
allow for faster synchronization than using a timer. Each time a flush occurs on the source cluster, a notification is
pushed to the sink clusters for the table. To use replication to keep replicas current, you must first set the column
family attribute REGION_MEMSTORE_REPLICATION to false, then set the HBase configuration property
hbase.region.replica.replication.enabled to true.
Important: Read-replica updates using replication are not supported for the hbase:meta table.
Columns of hbase:meta must always have their REGION_MEMSTORE_REPLICATION attribute set
to false.
Important:
Before you enable read-replica support, make sure to account for their increased heap memory
requirements. Although no additional copies of HFile data are created, read-only replicas regions have
the same memory footprint as normal regions and need to be considered when calculating the amount
of increased heap memory required. For example, if your table requires 8 GB of heap memory, when
you enable three replicas, you need about 24 GB of heap memory.
To enable support for read replicas in HBase, you must set several properties.
6. Locate the HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml property or search
for it by typing its name in the Search box.
7. Using the chart above, create a configuration and paste it into the text field. The following example configuration
demonstrates the syntax:
<property>
<name>hbase.regionserver.storefile.refresh.period</name>
<value>0</value>
</property>
<property>
<name>hbase.ipc.client.allowsInterrupt</name>
<value>true</value>
<description>Whether to enable interruption of RPC threads at the client. The default
value of true is
required to enable Primary RegionServers to access other RegionServers in secondary
mode. </description>
</property>
<property>
<name>hbase.client.primaryCallTimeout.get</name>
<value>10</value>
</property>
<property>
<name>hbase.client.primaryCallTimeout.multiget</name>
<value>10</value>
</property>
<topology>
<node name="host1.example.com" rack="/dc1/r1"/>
<node name="host2.example.com" rack="/dc1/r1"/>
<node name="host3.example.com" rack="/dc1/r2"/>
<node name="host4.example.com" rack="/dc1/r2"/>
<node name="host5.example.com" rack="/dc2/r1"/>
<node name="host6.example.com" rack="/dc2/r1"/>
<node name="host7.example.com" rack="/dc2/r2"/>
<node name="host8.example.com" rack="/dc2/r2"/>
</topology>
At Table Creation
To create a new table with read replication capabilities enabled, set the REGION_REPLICATION property on the table.
Use a command like the following, in HBase Shell:
Get Request
Scan Request
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
In CDH 5, you can configure multiple active Oozie servers against the same database. Oozie high availability is
"active-active" or "hot-hot" so that both Oozie servers are active at the same time, with no failover. High availability
for Oozie is supported in both MRv1 and MRv2 (YARN).
Important: Enabling or disabling high availability makes the previous monitoring history unavailable.
nightly6x-1.vpc.cloudera.com
• HTTP Port
For example:
5002
• HTTPS Port
For example:
5000
6. Click Continue.
Cloudera Manager stops the Oozie servers, adds another Oozie server, initializes the Oozie server High Availability
state in ZooKeeper, configures Hue to reference the Oozie load balancer, and restarts the Oozie servers and dependent
services. In addition, Cloudera Manager generates Kerberos credentials for the new Oozie server and regenerates
credentials for existing servers.
Load Balancing
Using a proxy server to relay requests to and from the Apache Solr service can help meet availability requirements in
production clusters serving many users.
For information on configuring a load balancer for the Solr service, see Using a Load Balancer with Solr on page 188.
Data Ingestion
Mission critical, large-scale online production systems need to make progress without downtime despite some issues.
Cloudera Search provides two routes to configurable, highly available, and fault-tolerant data ingestion:
• Near Real Time (NRT) ingestion using the Flume Solr Sink
• MapReduce based batch ingestion using the MapReduceIndexerTool
If Cloudera Search throws an exception according the rules described above, the caller, meaning Flume Solr Sink and
MapReduceIndexerTool, can catch the exception and retry the task if it meets the criteria for such retries.
agent.sinks.solrSink.isProductionMode = true
agent.sinks.solrSink.isIgnoringRecoverableExceptions = true
In addition, Flume SolrSink automatically attempts to load balance and failover among the hosts of a SolrCloud before
it considers the transaction rollback and retry. Load balancing and failover is done with the help of ZooKeeper, which
itself can be configured to be highly available.
Further, Cloudera Manager can configure Flume so it automatically restarts if its process crashes.
To tolerate extended periods of Solr downtime, you can configure Flume to use a high-performance transactional
persistent queue in the form of a FileChannel. A FileChannel can use any number of local disk drives to buffer significant
amounts of data. For example, you might buffer many terabytes of events corresponding to a week of data. Further,
using the Replicating Channel Selector Flume feature, you can configure Flume to replicate the same data both into
HDFS as well as into Solr. Doing so ensures that if the Flume SolrSink channel runs out of disk space, data delivery is
still delivered to HDFS, and this data can later be ingested from HDFS into Solr using MapReduce.
Many machines with many Flume Solr Sinks and FileChannels can be used in a failover and load balancing configuration
to improve high availability and scalability. Flume SolrSink servers can be either co-located with live Solr servers serving
end user queries, or Flume SolrSink servers can be deployed on separate industry standard hardware for improved
scalability and reliability. By spreading indexing load across a large number of Flume SolrSink servers you can improve
scalability. Indexing load can be replicated across multiple Flume SolrSink servers for high availability, for example
using Flume features such as Load balancing Sink Processor.
Important: Cloudera Support supports all of the configuration and modification to Cloudera software
detailed in this document. However, Cloudera Support is unable to assist with issues or failures with
the third-party software that is used. Use of any third-party software, or software not directly covered
by Cloudera Support, is at the risk of the end user.
Warning: Cloudera Navigator Audit and Lineage do not support this high availability configuration.
Cloudera Navigator will only be available on one of the two Cloudera Manager server hosts. If a failover
is triggered, Cloudera Manager may fail over to a host where Cloudera Navigator is not available and
data loss may result.
The Cloudera Manager Agent software includes an agent and a supervisor process. The agent process handles RPC
communication with Cloudera Manager and with the roles of the Cloudera Management Service, and primarily handles
configuration changes to your roles. The supervisor process handles the local Cloudera-deployed process lifecycle and
handles auto-restarts (if configured) of failed processes.
• A multi-homed TCP load balancer, or two TCP load balancers, capable of proxying requests on specific ports to
one server from a set of backing servers.
– The load balancer does not need to support termination of TLS/SSL connections.
– This load balancer can be hardware or software based, but should be capable of proxying multiple ports.
HTTP/HTTPS-based load balancers are insufficient because Cloudera Manager uses several non-HTTP-based
protocols internally.
– This document uses HAProxy, a small, open-source, TCP-capable load balancer, to demonstrate a workable
configuration.
• A networked storage device that you can configure to be highly available. Typically this is an NFS store, a SAN
device, or a storage array that satisfies the read/write throughput requirements of the Cloudera Management
Service. This document assumes the use of NFS due to the simplicity of its configuration and because it is an easy,
vendor-neutral illustration.
• The procedures in this document require ssh access to all the hosts in the cluster where you are enabling high
availability for Cloudera Manager.
In addition, the second instance of Cloudera Manager is automatically shut down, resulting in messages similar to the
following in the log file:
When a Cloudera Manager instance fails or becomes unavailable and remains offline for more than 30 seconds, any
new instance that is deployed claims ownership of the database and continues to manage the cluster normally.
/etc/default/cloudera-scm-server
2. Add the following property (separate each property with a space) to the line that begins with export
CMF_JAVA_OPTS:
-Dcom.cloudera.server.cmf.components.scmActive.killOnError=false
For example:
3. Restart the Cloudera Manager server by running the following command on the Cloudera Manager server host:
Note: When you disable automatic shutdown, a message is still logged when more than one instance
of Cloudera Manager is running .
Important:
Unless stated otherwise, run all commands mentioned in this topic as the root user.
You do not need to stop the CDH cluster to configure Cloudera Manager high availability.
Figure 15: High-level layout of components for Cloudera Manager high availability
Note: The hostnames used here are placeholders and are used throughout this document. When
configuring your cluster, substitute the actual names of the hosts you use in your environment.
Note: HAProxy is used here for demonstration purposes. Production-level performance requirements
determine the load balancer that you select for your installation. HAProxy version 1.5.2 is used for
these procedures.
HAProxy 1.5.4-2 has a bug that affects the functioning of tcp-check. Cloudera recommends that you
use version 1.7.x or later.
1. Reserve two hostnames in your DNS system, and assign them to each of the load balancer hosts. (The names
CMSHostname, and MGMTHostname are used in this example; substitute the correct hostname for your
environment.) These hostnames will be the externally accessible hostnames for Cloudera Manager Server and
Cloudera Management Service. (Alternatively, use one load balancer with separate, resolvable IP addresses—one
each to back CMSHostname and MGMTHostname respectively).
• CMSHostname is used to access Cloudera Manager Admin Console.
• MGMTHostname is used for internal access to the Cloudera Management Service from Cloudera Manager
Server and Cloudera Manager Agents.
2. Set up two hosts using any supported Linux distribution (RHEL, CentOS, Ubuntu or SUSE; see CDH and Cloudera
Manager Supported Operating Systems) with the hostnames listed above. See the HAProxy documentation for
recommendations on configuring the hardware of these hosts.
3. Install the version of HAProxy that is recommended for the version of Linux installed on the two hosts:
RHEL/CentOS:
Ubuntu (use a current Personal Package Archive (PPA) for 1.5 from http://haproxy.debian.net):
SUSE:
chkconfig haproxy on
Ubuntu:
5. Configure HAProxy.
• On CMSHostname, edit the /etc/haproxy/haproxy.cfg files and make sure that the ports listed at Ports
Used by Cloudera Manager and Cloudera Navigator for “Cloudera Manager Server” are proxied. For Cloudera
Manager, this list includes the following ports as defaults:
– 7180
– 7182
– 7183
Sample HAProxy Configuration for CMSHostname
• On MGMTHostname, edit the /etc/haproxy/haproxy.cfg file and make sure that the ports for Cloudera
Management Service are proxied (see Ports Used by Cloudera Manager and Cloudera Navigator). For Cloudera
Manager, this list includes the following ports as defaults:
– 5678
– 7184
– 7185
– 7186
– 7187
– 8083
– 8084
– 8086
– 8087
– 8091
– 9000
– 9994
– 9995
– 9996
– 9997
– 9998
– 9999
– 10101
After updating the configuration, restart HAProxy on both the MGMTHostname and CMSHostname hosts:
Important: The embedded Postgres database cannot be configured for high availability and
should not be used in a high-availability configuration.
2. Configure your databases to be highly available. Consult the vendor documentation for specific information.
MySQL, PostgreSQL, and Oracle each have many options for configuring high availability. See Database High
Availability Configuration on page 536 for some external references on configuring high availability for your Cloudera
Manager databases.
Note: Using NFS as a shared storage mechanism is used here for demonstration purposes. Refer to
your Linux distribution documentation on production NFS configuration and security. Production-level
performance requirements determine the storage that you select for your installation.
This section describes how to configure an NFS server and assumes that you understand how to configure highly
available remote storage devices. Further details are beyond the scope and intent of this guide.
There are no intrinsic limitations on where this NFS server is located, but because overlapping failure domains can
cause problems with fault containment and error tracing, Cloudera recommends that you not co-locate the NFS server
with any CDH or Cloudera Manager servers or the load-balancer hosts detailed in this document.
1. Install NFS on your designated server:
RHEL/CentOS
Ubuntu
SUSE
RHEL/CentOS:
chkconfig nfs on
service rpcbind start
service nfs start
Ubuntu:
SUSE:
chkconfig nfs on
service rpcbind start
service nfs-kernel-server start
Note: Later sections describe mounting the shared directories and sharing them between the primary
and secondary instances.
Step 2: Installing and Configuring Cloudera Manager Server for High Availability
You can use an existing Cloudera Manager installation and extend it to a high-availability configuration, as long as you
are not using the embedded PostgreSQL database.
This section describes how to install and configure a failover secondary for Cloudera Manager Server that can take
over if the primary fails.
This section does not cover installing instances of Cloudera Manager Agent on CMS1 or CMS2 and configuring them to
be highly available. See Cloudera Installation Guide.
Setting up NFS Mounts for Cloudera Manager Server
1. Create the following directories on the NFS server you created in a previous step:
mkdir -p /media/cloudera-scm-server
2. Mark these mounts by adding these lines to the /etc/exports file on the NFS server:
/media/cloudera-scm-server CMS1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-server CMS2(rw,sync,no_root_squash,no_subtree_check)
3. Export the mounts by running the following command on the NFS server:
exportfs -a
Ubuntu:
SUSE:
rm -rf /var/lib/cloudera-scm-server
mkdir -p /var/lib/cloudera-scm-server
c. Mount the following directory to the NFS mounts, on both CMS1 and CMS2:
d. Set up fstab to persist the mounts across restarts by editing the /etc/fstab file on CMS1 and CMS2 and
adding the following lines:
Important: Deleting the Cloudera Management Service leads to loss of all existing data from the Host
Monitor and Service Monitor roles that store health and monitoring information for your cluster on
the local disk associated with the host(s) where those roles are installed.
Fresh Installation
Follow the instructions in Installing Cloudera Manager, CDH, and Managed Services to install Cloudera Manager Server,
but do not add “Cloudera Management Service” to your deployment until you complete Step 3: Installing and Configuring
Cloudera Management Service for High Availability on page 519, which describes how to set up the Cloudera Management
Service.
See:
• Installing Cloudera Manager, CDH, and Managed Services
You can now start the freshly-installed Cloudera Manager Server on CMS1:
Before proceeding, verify that you can access the Cloudera Manager Admin Console at http://CMS1:7180.
If you have just installed Cloudera Manager, click the Cloudera Manager logo to skip adding new hosts and to gain
access to the Administration menu, which you need for the following steps.
HTTP Referer Configuration
Cloudera recommends that you disable the HTTP Referer check because it causes problems for some proxies and load
balancers. Check the configuration manual of your proxy or load balancer to determine if this is necessary.
To disable HTTP Referer in the Cloudera Manager Admin Console:
1. Select Administration > Settings.
2. Select Category > Security.
3. Clear the HTTP Referer Check property.
Before proceeding, verify that you can access the Cloudera Manager Admin Console through the load balancer at
http://CMSHostname:7180.
mkdir -p /etc/cloudera-scm-server
scp [<ssh-user>@]CMS1:/etc/cloudera-scm-server/db.properties
/etc/cloudera-scm-server/db.properties
3. If you configured Cloudera Manager TLS encryption or authentication, or Kerberos authentication in your primary
installation, see TLS and Kerberos Configuration for Cloudera Manager High Availability on page 536 for additional
configuration steps.
4. Do not start the cloudera-scm-server service on this host yet, and disable autostart on the secondary to avoid
automatically starting the service on this host.
RHEL/CentOS/SUSEL:
Ubuntu:
(You will also disable autostart on the primary when you configure automatic failover in a later step.) Data corruption
can result if both primary and secondary Cloudera Manager Server instances are running at the same time, and
it is not supported. :
Testing Failover
Test failover manually by using the following steps:
1. Stop cloudera-scm-server on your primary host (CMS1):
3. Wait a few minutes for the service to load, and then access the Cloudera Manager Admin Console through a web
browser, using the load-balanced hostname (for example: http://CMSHostname:CMS_port).
Now, fail back to the primary before configuring the Cloudera Management Service on your installation:
1. Stop cloudera-scm-server on your secondary machine (CMS2):
3. Wait a few minutes for the service to load, and then access the Cloudera Manager Admin Console through a web
browser, using the load-balanced hostname (for example: http://CMSHostname:7180).
Updating Cloudera Manager Agents to use the Load Balancer
After completing the primary and secondary installation steps listed previously, update the Cloudera Manager Agent
configuration on all of the hosts associated with this Cloudera Manager installation, except the MGMT1, MGMT2, CMS1,
and CMS2 hosts, to use the load balancer address:
1. Connect to a shell on each host where CDH processes are installed and running. (The MGMT1, MGMT2, CMS1, and
CMS2 hosts do not need to be modified as part of this step.)
2. Update the /etc/cloudera-scm-agent/config.ini file and change the server_host line:
server_host = <CMSHostname>
3. Restart the agent (this command starts the agents if they are not running):
Step 3: Installing and Configuring Cloudera Management Service for High Availability
This section demonstrates how to set up shared mounts on MGMT1 and MGMT2, and then install Cloudera Management
Service to use those mounts on the primary and secondary servers.
Important: Do not start the primary and secondary servers that are running Cloudera Management
Service at the same time. Data corruption can result.
mkdir -p /media/cloudera-host-monitor
mkdir -p /media/cloudera-scm-agent
mkdir -p /media/cloudera-scm-eventserver
mkdir -p /media/cloudera-scm-headlamp
mkdir -p /media/cloudera-service-monitor
mkdir -p /media/cloudera-scm-navigator
mkdir -p /media/etc-cloudera-scm-agent
2. Mark these mounts by adding the following lines to the /etc/exports file on the NFS server:
/media/cloudera-host-monitor MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-agent MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-eventserver MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-headlamp MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-service-monitor MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-navigator MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/etc-cloudera-scm-agent MGMT1(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-host-monitor MGMT2(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-agent MGMT2(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-eventserver MGMT2(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-headlamp MGMT2(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-service-monitor MGMT2(rw,sync,no_root_squash,no_subtree_check)
/media/cloudera-scm-navigator MGMT2(rw,sync,no_root_squash,no_subtree_check)
/media/etc-cloudera-scm-agent MGMT2(rw,sync,no_root_squash,no_subtree_check)
3. Export the mounts running the following command on the NFS server:
exportfs -a
Ubuntu:
SUSE:
mkdir -p /var/lib/cloudera-host-monitor
mkdir -p /var/lib/cloudera-scm-agent
mkdir -p /var/lib/cloudera-scm-eventserver
mkdir -p /var/lib/cloudera-scm-headlamp
mkdir -p /var/lib/cloudera-service-monitor
mkdir -p /var/lib/cloudera-scm-navigator
mkdir -p /etc/cloudera-scm-agent
c. Mount the following directories to the NFS mounts, on both MGMT1 and MGMT2 (NFS refers to the server NFS
hostname or IP address):
5. Set up fstab to persist the mounts across restarts. Edit the /etc/fstab file and add these lines:
server_host=CMSHostname
listening_hostname=MGMTHostname
b. Edit the /etc/hosts file and add MGMTHostname as an alias for your public IP address for MGMT1 by adding
a line like this at the end of your /etc/hosts file:
MGMT1 IP MGMTHostname
c. Confirm that the alias has taken effect by running the ping command. For example:
d. Make sure that the cloudera-scm user and the cloudera-scm group have access to the mounted directories
under /var/lib, by using the chown command on cloudera-scm. For example, run the following on MGMT1:
Note: The cloudera-scm user and the cloudera-scm group are the default owners as
specified in Cloudera Management Service advanced configuration. If you alter these settings,
modify the above chown instructions to use the altered user or group name.
e. Restart the agent on MGMT1 (this also starts the agent if it is not running):
g. Make sure you install all of the roles of the Cloudera Management Service on the host named MGMTHostname.
h. Proceed through the steps to configure the roles of the service to use your database server, and use defaults
for the storage directory for Host Monitor or Service Monitor.
i. After you have completed the steps, wait for the Cloudera Management Service to finish starting, and verify
the health status of your clusters as well as the health of the Cloudera Management Service as reported in
the Cloudera Manager Admin Console. The health status indicators should be green, as shown:
The service health for Cloudera Management Service might, however, show as red:
In this case, you need to identify whether the health test failure is caused by the Hostname and Canonical
Name Health Check for the MGMTHostname host, which might look like this:
This test can fail in this way because of the way you modified /etc/hosts on MGMT1 and MGMT2 to allow
the resolution of MGMTHostname locally. This test can be safely disabled on the MGMTHostname host from
the Cloudera Manager Admin Console.
j. If you are configuring Kerberos and TLS/SSL, see TLS and Kerberos Configuration for Cloudera Manager High
Availability on page 536 for configuration changes as part of this step.
4. Configure the agent to report its hostname as MGMTHostname to Cloudera Manager, as described previously in
Installing the Primary on page 521.
a. Make sure that /etc/cloudera-scm-agent/config.ini has the following lines (because this is a shared
mount with the primary, it should be the same as in the primary installation):
server_host=<CMHostname>
listening_hostname=<MGMTHostname>
b. Edit the /etc/hosts file and add MGMTHostname as an alias for your public IP address for MGMT2, by adding
a line like this at the end of your /etc/hosts file:
<MGMT2-IP> <MGMTHostname>
c. Confirm that the alias is working by running the ping command. For example:
6. Log into the Cloudera Manager Admin Console in a web browser and start all Cloudera Management Service roles.
This starts the Cloudera Management Service on MGMT2.
a. Wait for the Cloudera Manager Admin Console to report that the services have started.
b. Confirm that the services have started on this host by running the following command on MGMT2:
You should see ten total processes running on that host, including the eight Cloudera Management Service
processes, a Cloudera Manager Agent process, and a Supervisor process.
c. Test the secondary installation through the Cloudera Management Admin Console, and inspect the health
of the Cloudera Management Service roles, before proceeding.
Note:
Make sure that the UID and GID for the cloudera-scm user on the primary and secondary Cloudera
Management Service hosts are same; this ensures that the correct permissions are available on the
shared directories after failover.
Note: The versions referred to for setting up automatic failover in this document are Pacemaker
1.1.11 and Corosync 1.4.7. See http://clusterlabs.org/wiki/Install to determine what works best
for your Linux distribution.
RHEL/CentOS:
Ubuntu:
SUSE:
2. Make sure that the crm tool exists on all of the hosts. This procedure uses the crm tool, which works with Pacemaker
configuration. If this tool is not installed when you installed Pacemaker (verify this by running which crm), you
can download and install the tool for your distribution using the instructions at http://crmsh.github.io/installation.
About Corosync and Pacemaker
• By default, Corosync and Pacemaker are not autostarted as part of the boot sequence. Cloudera recommends
leaving this as is. If the machine crashes and restarts, manually make sure that failover was successful and determine
the cause of the restart before manually starting these processes to achieve higher availability.
– If the /etc/default/corosync file exists, make sure that START is set to yes in that file:
START=yes
– Make sure that Corosync is not set to start automatically, by running the following command:
RHEL/CentOS/SUSE:
Ubuntu:
• Note which version of Corosync is installed. The contents of the configuration file for Corosync (corosync.conf)
that you edit varies based on the version suitable for your distribution. Sample configurations are supplied in this
document and are labeled with the Corosync version.
• This document does not demonstrate configuring Corosync with authentication (with secauth set to on). The
Corosync website demonstrates a mechanism to encrypt traffic using symmetric keys.
• Firewall configuration:
Corosync uses UDP transport on ports 5404 and 5405, and these ports must be open for both inbound and
outbound traffic on all hosts. If you are using IP tables, run a command similar to the following:
sudo iptables -I INPUT -m state --state NEW -p udp -m multiport --dports 5404,5405 -j
ACCEPT
sudo iptables -I OUTPUT -m state --state NEW -p udp -m multiport --sports 5404,5405 -j
ACCEPT
compatibility: whitetank
totem {
version: 2
secauth: off
interface {
member {
memberaddr: CMS1
}
member {
memberaddr: CMS2
}
ringnumber: 0
bindnetaddr: CMS1
mcastport: 5405
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
#
}
totem {
version: 2
secauth: off
cluster_name: cmf
transport: udpu
}
nodelist {
node {
ring0_addr: CMS1
nodeid: 1
}
node {
ring0_addr: CMS2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
2. Edit the /etc/corosync/corosync.conf file on CMS2, and replace the entire contents with the following text
(use the correct version for your environment):
Corosync version 1.x:
compatibility: whitetank
totem {
version: 2
secauth: off
interface {
member {
memberaddr: CMS1
}
member {
memberaddr: CMS2
}
ringnumber: 0
bindnetaddr: CMS2
mcastport: 5405
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
#
}
totem {
version: 2
secauth: off
cluster_name: cmf
transport: udpu
}
nodelist {
node {
ring0_addr: CMS1
nodeid: 1
}
node {
ring0_addr: CMS2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
3. Restart Corosync on CMS1 and CMS2 so that the new configuration takes effect:
Setting up Pacemaker
You use Pacemaker to set up Cloudera Manager Server as a cluster resource.
See the Pacemaker configuration reference at
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ for more details about
Pacemaker options.
The following steps demonstrate one way, recommended by Cloudera, to configure Pacemaker for simple use:
1. Disable autostart for Cloudera Manager Server (because you manage its lifecycle through Pacemaker) on both
CMS1 and CMS2:
RHEL/CentOS/SUSE:
Ubuntu:
2. Make sure that Pacemaker has been started on both CMS1 and CMS2:
/etc/init.d/pacemaker start
# crm status
Last updated: Wed Mar 4 18:55:27 2015
Last change: Wed Mar 4 18:38:40 2015 via crmd on CMS1
Stack: corosync
Current DC: CMS1 (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
0 Resources configured
crm_mon
For example:
$ crm_mon
Last updated: Tue Jan 27 15:01:35 2015
Last change: Mon Jan 27 14:10:11 2015
Stack: classic openais (with plugin)
Current DC: CMS1 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
1 Resources configured
Online: [ CMS1 CMS2 ]
cloudera-scm-server (lsb:cloudera-scm-server): Started CMS1
At this point, Pacemaker manages the status of the cloudera-scm-server service on hosts CMS1 and CMS2, ensuring
that only one instance is running at a time.
Note: Pacemaker expects all lifecycle actions, such as start and stop, to go through Pacemaker;
therefore, running direct service start or service stop commands breaks that assumption.
Test the resource move by connecting to a shell on CMS2 and verifying that the cloudera-scm-server process is
now active on that host. It takes usually a few minutes for the new services to come up on the new host.
Enabling STONITH (Shoot the other node in the head)
The following link provides an explanation of the problem of fencing and ensuring (within reasonable limits) that only
one host is running a shared resource at a time:
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html#idm140603947390416
As noted in that link, you can use several methods (such as IPMI) to achieve reasonable guarantees on remote host
shutdown. Cloudera recommends enabling STONITH, based on the hardware configuration in your environment.
Setting up the Cloudera Manager Service
Setting Up Corosync
1. Edit the /etc/corosync/corosync.conf file on MGMT1 and replace the entire contents with the contents
below; make sure to use the correct section for your version of Corosync:
Corosync version 1.x:
compatibility: whitetank
totem {
version: 2
secauth: off
interface {
member {
memberaddr: MGMT1
}
member {
memberaddr: MGMT2
}
ringnumber: 0
bindnetaddr: MGMT1
mcastport: 5405
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
#
}
totem {
version: 2
secauth: off
cluster_name: mgmt
transport: udpu
}
nodelist {
node {
ring0_addr: MGMT1
nodeid: 1
}
node {
ring0_addr: MGMT2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
2. Edit the /etc/corosync/corosync.conf file on MGMT2 andf replace the contents with the contents below:
Corosync version 1.x:
compatibility: whitetank
totem {
version: 2
secauth: off
interface {
member {
memberaddr: MGMT1
}
member {
memberaddr: MGMT2
}
ringnumber: 0
bindnetaddr: MGMT2
mcastport: 5405
}
transport: udpu
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
#
}
totem {
version: 2
secauth: off
cluster_name: mgmt
transport: udpu
}
nodelist {
node {
ring0_addr: CMS1
nodeid: 1
}
node {
ring0_addr: CMS2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
3. Restart Corosync on MGMT1 and MGMT2 for the new configuration to take effect:
4. Test whether Corosync has set up a cluster, by using the corosync-cmapctl or corosync-objctl commands.
You should see two members with status joined:
Setting Up Pacemaker
Use Pacemaker to set up Cloudera Management Service as a cluster resource.
RHEL/CentOS/SUSE
Ubuntu:
/etc/init.d/pacemaker start
3. Make sure that the crm command reports two nodes in the cluster; you can run this command on either host:
# crm status
Last updated: Wed Mar 4 18:55:27 2015
Last change: Wed Mar 4 18:38:40 2015 via crmd on MGMT1
Stack: corosync
Current DC: MGMT1 (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
0 Resources configured
As with Cloudera Manager Server Pacemaker configuration, this step disables quorum checks, disables STONITH
explicitly, and reduces the likelihood of resources being moved between hosts.
5. Create an Open Cluster Framework (OCF) provider on both MGMT1 and MGMT2 for Cloudera Manager Agent for
use with Pacemaker:
a. Create an OCF directory for creating OCF resources for Cloudera Manager:
mkdir -p /usr/lib/ocf/resource.d/cm
#!/bin/sh
#######################################################################
# CM Agent OCF script
#######################################################################
#######################################################################
# Initialization:
: ${__OCF_ACTION=$1}
OCF_SUCCESS=0
OCF_ERROR=1
OCF_STOPPED=7
#######################################################################
meta_data() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="Cloudera Manager Agent" version="1.0">
<version>1.0</version>
<longdesc lang="en">
This OCF agent handles simple monitoring, start, stop of the Cloudera
Manager Agent, intended for use with Pacemaker/corosync for failover.
</longdesc>
<shortdesc lang="en">Cloudera Manager Agent OCF script</shortdesc>
<parameters />
<actions>
<action name="start" timeout="20" />
<action name="stop" timeout="20" />
<action name="monitor" timeout="20" interval="10" depth="0"/>
<action name="meta-data" timeout="5" />
</actions>
</resource-agent>
END
}
#######################################################################
agent_usage() {
cat <<END
usage: $0 {start|stop|monitor|meta-data}
Cloudera Manager Agent HA OCF script - used for managing Cloudera Manager Agent and
managed processes lifecycle for use with Pacemaker.
END
}
agent_start() {
service cloudera-scm-agent start
if [ $? = 0 ]; then
return $OCF_SUCCESS
fi
return $OCF_ERROR
}
agent_stop() {
service cloudera-scm-agent next_stop_hard
service cloudera-scm-agent stop
if [ $? = 0 ]; then
return $OCF_SUCCESS
fi
return $OCF_ERROR
}
agent_monitor() {
# Monitor _MUST!_ differentiate correctly between running
# (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT RUNNING).
# That is THREE states, not just yes/no.
service cloudera-scm-agent status
if [ $? = 0 ]; then
return $OCF_SUCCESS
fi
return $OCF_STOPPED
}
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) agent_start;;
stop) agent_stop;;
monitor) agent_monitor;;
usage|help) agent_usage
exit $OCF_SUCCESS
;;
*) agent_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
rc=$?
exit $rc
#!/bin/sh
#######################################################################
# CM Agent OCF script
#######################################################################
#######################################################################
# Initialization:
: ${__OCF_ACTION=$1}
OCF_SUCCESS=0
OCF_ERROR=1
OCF_STOPPED=7
#######################################################################
meta_data() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="Cloudera Manager Agent" version="1.0">
<version>1.0</version>
<longdesc lang="en">
This OCF agent handles simple monitoring, start, stop of the Cloudera
Manager Agent, intended for use with Pacemaker/corosync for failover.
</longdesc>
<shortdesc lang="en">Cloudera Manager Agent OCF script</shortdesc>
<parameters />
<actions>
<action name="start" timeout="20" />
<action name="stop" timeout="20" />
<action name="monitor" timeout="20" interval="10" depth="0"/>
<action name="meta-data" timeout="5" />
</actions>
</resource-agent>
END
}
#######################################################################
agent_usage() {
cat <<END
usage: $0 {start|stop|monitor|meta-data}
Cloudera Manager Agent HA OCF script - used for managing Cloudera Manager Agent and
managed processes lifecycle for use with Pacemaker.
END
}
agent_start() {
service cloudera-scm-agent start
if [ $? = 0 ]; then
return $OCF_SUCCESS
fi
return $OCF_ERROR
}
agent_stop() {
service cloudera-scm-agent hard_stop_confirmed
if [ $? = 0 ]; then
return $OCF_SUCCESS
fi
return $OCF_ERROR
}
agent_monitor() {
# Monitor _MUST!_ differentiate correctly between running
# (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT RUNNING).
# That is THREE states, not just yes/no.
service cloudera-scm-agent status
if [ $? = 0 ]; then
return $OCF_SUCCESS
fi
return $OCF_STOPPED
}
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) agent_start;;
stop) agent_stop;;
monitor) agent_monitor;;
usage|help) agent_usage
exit $OCF_SUCCESS
;;
*) agent_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
rc=$?
exit $rc
/usr/lib/ocf/resource.d/cm/agent monitor
This script should return the current running status of the SCM agent.
7. Add Cloudera Manager Agent as an OCF-managed resource (either on MGMT1 or MGMT2):
8. Verify that the primitive has been picked up by Pacemaker by running the following command:
crm_mon
For example:
>crm_mon
Last updated: Tue Jan 27 15:01:35 2015
Last change: Mon Jan 27 14:10:11 2015ls /
Stack: classic openais (with plugin)
Current DC: CMS1 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
1 Resources configured
Online: [ MGMT1 MGMT2 ]
cloudera-scm-agent (ocf:cm:agent): Started MGMT2
Pacemaker starts managing the status of the cloudera-scm-agent service on hosts MGMT1 and MGMT2, ensuring
that only one instance is running at a time.
Note: Pacemaker expects that all lifecycle actions, such as start and stop, go through Pacemaker;
therefore, running direct service start or service stop commands on one of the hosts breaks
that assumption and could cause Pacemaker to start the service on the other host.
Test the resource move by connecting to a shell on MGMT2 and verifying that the cloudera-scm-agent and the
associated Cloudera Management Services processes are now active on that host. It usually takes a few minutes for
the new services to come up on the new host.
Database-Specific Mechanisms
• MariaDB:
Configuring MariaDB for high availability requires configuring MariaDB for replication. For more information, see
https://mariadb.com/kb/en/mariadb/setting-up-replication/.
• MySQL:
Configuring MySQL for high availability requires configuring MySQL for replication. Replication configuration
depends on which version of MySQL you are using. For version 5.1,
http://dev.mysql.com/doc/refman/5.1/en/replication-howto.html provides an introduction.
MySQL GTID-based replication is not supported.
• PostgreSQL:
PostgreSQL has extensive documentation on high availability, especially for versions 9.0 and higher. For information
about options available for version 9.1, see http://www.postgresql.org/docs/9.1/static/high-availability.html.
• Oracle:
Oracle supports a wide variety of free and paid upgrades to their database technology that support increased
availability guarantees, such as their Maximum Availability Architecture (MAA) recommendations. For more
information, see
http://www.oracle.com/technetwork/database/features/availability/oracle-database-maa-best-practices-155386.html.
Disk-Based Mechanisms
DRBD is an open-source Linux-based disk replication mechanism that works at the individual write level to replicate
writes on multiple machines. Although not directly supported by major database vendors (at the time of writing of
this document), it provides a way to inexpensively configure redundant distributed disk for disk-consistent databases
(such as MySQL, PostgreSQL, and Oracle). For information, see http://drbd.linbit.com.
Example hostnames used throughout Configuring Cloudera Manager for High Availability With a Load Balancer on page
506 are summarized in the table below.
https://[CMSHostname]:7183
This assumes that the load balancer has been set up for TLS pass-through and that the Cloudera Manager Server host
has been set up as detailed below.
Configure Load Balancers for TLS Pass-Through
As detailed in Configuring Cloudera Manager for High Availability With a Load Balancer on page 506, high availability
for Cloudera Manager Server clusters requires secondary nodes that act as backups for the primary Cloudera Manager
Server and Cloudera Management Service nodes, respectively. Only the primary nodes are active at any time, but if
these fail, requests are redirected by a load balancer (CMSHostname or MGMTHostname) to the appropriate secondary
node.
When the Cloudera Manager Server cluster is configured for TLS in addition to high availability, the load balancers
must be configured for TLS pass-through—traffic from clients is not decrypted until it receives the actual server host
system. Keep this in mind when you are setting up the load balancer for your Cloudera Manager High Availability
deployment.
Server Certificate Requirements for HA Deployments
TLS-enabled Cloudera Manager Server clusters require certificates that authenticate the host identity prior to encryption,
as detailed in Manually Configuring TLS Encryption for Cloudera Manager. When deploying Cloudera Manager Server
for high availability, however, the certificate must identify the load balancer and both primary and secondary nodes
(rather than the primary host alone). That means you must create your certificate signing request (CSR) as follows:
• Use the FQDN of the load balancer (for example, (CMSHostname) for the CN (common name).
• Use the primary and secondary Cloudera Manager Server host names (for example, CMS1 and CMS2, respectively)
for the SubjectAlternativeName (SAN) values.
To create a CSR using these example load balancer and server host names:
Alternatively, if the Cloudera Manager Server certificates on the hosts do not specify the load balancer name and SAN
names, you can make the following change to the configuration. From the Cloudera Manager Admin Console, go to:
• Administration > Ports and Addresses
• Enter the FQDN of the load balancer in the Cloudera Manager Hostname Override
In addition to using correctly created certificates (or over-riding the hostname), you must:
• Store the keystore and truststore in the same path on both primary and secondary Cloudera Manager Server hosts
(CMS1, CMS2), or point to the same shared network mount point from each host.
Cloudera Manager Agent Host Requirements for HA Deployments
Cloudera Manager Server hosts can present their certificate to agents prior to encrypting the connection (see Enable
Server Certificate Verification on Cloudera Manager Agents for details). For a high availability deployment:
• Use the same setting for verify_cert_file (in the /etc/cloudera-scm-agent/config.ini file) on each
agent host system. To simplify the set, share the file path to verify_cert_file or copy the files manually as
specified in the config file between MGMT1 and MGMT2.
Cloudera Manager Agent hosts can present certificates to requesting processes such as the Cloudera Manager Server
prior to encryption (see Configure Agent Certificate Authentication). For a high availability deployment:
• Share the certificate and key for use by all Cloudera Manager Agent host systems on NFS, or copy them to the
same path on both MGMT1 and MGMT2.
Important: Restart cloudera-scm-agent after making changes to the certificates or other files, or
to the configuration.
This is expected. To resolve, re-generate the Kerberos credentials for the roles:
1. Log in to the Cloudera Manager Admin Console.
2. Select Administration > Kerberos > Credentials > Generate Credentials.
Important: This feature requires a Cloudera Enterprise license. It is not available in Cloudera Express.
See Managing Licenses on page 50 for more information.
You can also use Cloudera Manager to schedule, save, and restore snapshots of HDFS directories and HBase tables.
Cloudera Manager provides key functionality in the Cloudera Manager Admin Console:
• Select - Choose datasets that are critical for your business operations.
• Schedule - Create an appropriate schedule for data replication and snapshots. Trigger replication and snapshots
as required for your business needs.
• Monitor - Track progress of your snapshots and replication jobs through a central console and easily identify issues
or files that failed to be transferred.
• Alert - Issue alerts when a snapshot or replication job fails or is aborted so that the problem can be diagnosed
quickly.
BDR functions consistently across HDFS and Hive:
• You can set it up on files or directories in HDFS and on External tables in Hive—without manual translation of Hive
datasets to HDFS datasets, or vice versa. Hive Metastore information is also replicated.
• Applications that depend on External table definitions stored in Hive, operate on both replica and source as table
definitions are updated.
You can also perform a “dry run” to verify configuration and understand the cost of the overall operation before actually
copying the entire dataset.
Table 29:
See Ports for more information, including how to verify the current values for these ports.
Data Replication
Cloudera Manager enables you to replicate data across data centers for disaster recovery scenarios. Replications can
include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impala metadata (catalog server
metadata) associated with Impala tables registered in the Hive metastore. When critical data is stored on HDFS, Cloudera
Manager helps to ensure that the data is available at all times, even in case of complete shutdown of a datacenter.
You can also replicate HDFS data to and from Amazon S3 and you can replicate Hive data and metadata to and from
Amazon S3.
For an overview of data replication, view this video about Backing Up Data Using Cloudera Manager.
You can also use the HBase shell to replicate HBase data. (Cloudera Manager does not manage HBase replications.)
View a video about Backing up Data Using Cloudera Manager .
Cloud Storage
BDR supports replicating to or from Amazon S3 and Microsoft Azure ADLS Gen1 and Microsoft Azure ADLS Gen2
(ABFS).
TLS
You can use TLS with BDR. Additionally, BDR supports replication scenarios where TLS is enabled for non-Hadoop
services (Hive/Impala) and TLS is disabled Hadoop services (such as HDFS,YARN, and MapReduce).
Hive replication
Hive replication to and from Microsoft ADLS Gen2 (ABFS) is supported from Cloudera Manager 6.3.4.
Ensure that the following files are available before you replicate Hive data:
1. cp
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.px.xxxxxx/jars/wildfly-openssl-1.0.4.Final.jar
/opt/cloudera/cm/lib/cdh6/
2. cp
/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.px.xxxxxx/jars/hadoop-azure-3.0.0-cdh6.3.4.jar
/opt/cloudera/cm/lib/cdh6/
3. chmod 644 /opt/cloudera/cm/lib/cdh6/wildfly-openssl-1.0.4.Final.jar
4. chmod 644 /opt/cloudera/cm/lib/cdh6/hadoop-azure-3.0.0-cdh6.3.4.jar
5. service cloudera-scm-server restart
Note: If you are using Isilon storage for CDH, see Supported Replication Scenarios for Clusters using
Isilon Storage.
Versions
Replicating to or from Cloudera Manager 6 managed clusters with Cloudera Manager versions earlier than 5.14.0
are not supported.
Kerberos
BDR does not support the following replication scenarios when Kerberos authentication is used on a cluster:
• Secure source to an insecure destination is not supported.
General
BDR does not support Hadoop Archive (HAR file system) for data replication.
To configure Amazon S3 as a source or destination for HDFS or Hive/Impala replication, you configure AWS Credentials
that specify the type of authentication to use, the Access Key ID, and Secret Key. See How to Configure AWS Credentials.
To configure Microsoft ADLS as a source or destination for HDFS or Hive/Imapla replication, you configure the service
principal for ADLS. See Configuring ADLS Gen1 Connectivity on page 673 or Configuring ADLS Gen2 Connectivity on
page 678.
After configuring S3 or ADLS, you can click the Replication Schedules link to define a replication schedule. See HDFS
Replication on page 545 or Hive/Impala Replication on page 559 for details about creating replication schedules. You
can also click Close and create the replication schedules later. Select the AWS Credentials account in the Source or
Destination drop-down lists when creating the schedules.
Important: Automatic log expiration purges custom set replication log and metadata files too. These
paths are set by Log Path and Directory for Metadata arguments that are present on the UI as per the
schedule fields. It is the user's responsibility to set valid paths (For example, specify the legal HDFS
paths that are writable by current user) and maintain this information for each replication schedule.
Note: If your cluster uses SAML Authentication, see Configuring Peers with SAML Authentication on
page 544 before configuring a peer.
1. Go to the Peers page by selecting Backup > Peers. If there are no existing peers, you will see only an Add Peer
button in addition to a short message. If peers already exist, they display in the Peers list.
2. Click the Add Peer button.
3. In the Add Peer dialog box, provide a name, the URL (including the port) of the Cloudera Manager Server source
for the data to be replicated, and the login credentials for that server.
Important: The role assigned to the login on the source server must be either a User Administrator
or a Full Administrator.
Cloudera recommends that TLS/SSL be used. A warning is shown if the URL scheme is http instead of https.
After configuring both peers to use TLS/SSL, add the remote source Cloudera Manager TLS/SSL certificate to the
local Cloudera Manager truststore, and vice versa. See Manually Configuring TLS Encryption for Cloudera Manager.
4. Click the Add Peer button in the dialog box to create the peer relationship.
The peer is added to the Peers list. Cloudera Manager automatically tests the connection between the Cloudera
Manager Server and the peer. You can also click Test Connectivity to test the connection. Test Connectivity also
tests the Kerberos configuration for the clusters. For more information about this part of the test, see Kerberos
Connectivity Test on page 579.
Modifying Peers
1. Go to the Peers page by selecting Backup > Peers. If there are no existing peers, you will see only an Add Peer
button in addition to a short message. If peers already exist, they display in the Peers list.
2. Do one of the following:
• Edit
1. In the row for the peer, select Edit.
2. Make your changes.
3. Click Update Peer to save your changes.
• Delete - In the row for the peer, click Delete.
You can also use an existing user that has one of these roles. Since you will only use this user to create the peer
relationship, you can delete the user account after adding the peer.
2. Create or modify the peer, as described in this topic.
3. (Optional) Delete the Cloudera Manager user account you just created.
HDFS Replication
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Source Data
While a replication runs, ensure that the source directory is not modified. A file added during replication does not get
replicated. If you delete a file during replication, the replication fails.
Additionally, ensure that all files in the directory are closed. Replication fails if source files are open. If you cannot
ensure that all source files are closed, you can configure the replication to continue despite errors. Uncheck the Abort
on Error option for the HDFS replication. For more information, see Configuring Replication of HDFS Data on page 548
After the replication completes, you can view the log for the replication to identify opened files. Ensure these files are
closed before the next replication occurs.
Note: Cloudera Manager provides downloadable data that you can use to diagnose HDFS replication
performance. See Monitoring the Performance of HDFS Replications on page 557.
bdr-only-user, bdr-only-user@ElephantRealm
Note: The Run As Username field is used to launch MapReduce job for copying data. Run on
Peer as Username field is used to run copy listing on source, if different than Run as Username.
Important: Make sure to set the value of Run on Peer as Username same as Run as Username, else
BDR reads ACL from the source as hdfs, which pulls the Sentry provided ACLs over to the target cluster
and applies them to the files in HDFS. It can result in additional usage of NameNode heap in the target
cluster.
• Verify that HDFS snapshots are immutable on the source and destination clusters.
In the Cloudera Manager Admin Console, go to Clusters > <HDFS service> > Configuration and search for Enable
Immutable Snapshots.
• Do not use snapshot diff for globbed paths. It is not optimized for globbed paths.
• Set the snapshot root directory as low in the hierarchy as possible.
• To use the Snapshot diff feature, the user who is configured to run the job, needs to be either a super user or the
owner of the snapshottable root, because the run-as-user must have the permission to list the snapshots.
• Decide if you want BDR to abort on a snapshot diff failure or continue the replication. If you choose to configure
BDR to continue the replication when it encounters an error, BDR performs a complete replication. Note that
continuing the replication can result in a longer duration since a complete replication is performed.
• BDR performs a complete replication when one or more of the following change: Delete Policy, Preserve Policy,
Target Path, or Exclusion Path.
• Paths from both source and destination clusters in the replication schedule must be under a snapshottable root
or should be snapshottable for the schedule to run using snapshot diff.
• Maintain a maximum of one million changes in a snapshot diff for an optimum performance of the snapshot delete
operation.
The time taken by a NameNode to delete a snapshot is proportional to the number of changes between the current
snapshot and the previous snapshot. The changes include addition, deletion, and updation of files. If a snapshot
contains more than a million changes, the snapshot delete operation might prevent the NameNode from processing
other requests, which may result in premature failover thereby destabilising the cluster.
Note: In replication scenarios where a destination cluster has multiple source clusters, all the source
clusters must either be secure or insecure. BDR does not support replication from a mixture of secure
and insecure source clusters.
To enable replication from an insecure cluster to a secure cluster, you need a user that exists on all the hosts on both
the source cluster and destination cluster. Specify this user in the Run As Username field when you create a replication
schedule.
The following steps describe how to add a user:
1. On a host in the source or destination cluster, add a user with the following command:
2. Set the permissions for the user directory with the following command:
For example, the following command makes milton the owner of the milton directory:
3. Create the supergroup group for the user you created in step 1 with the following command:
groupadd supergroup
4. Add the user you created in step 1 to the group you created:
5. Repeat this process for all hosts in the source and destination clusters so that the user and group exists on all of
them.
After you complete this process, specify the user you created in the Run As Username field when you create a replication
schedule.
<property>
<name>hadoop.proxyuser.hdfsdest.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfsdest.hosts</name>
<value>*</value>
</property>
Deploy the client configuration and restart all services on the source cluster.
3. If the source cluster is managed by a different Cloudera Manager server than the destination cluster, configure a
peer relationship. If the source or destination is S3 or ADLS, you must configure AWS credentials or configure
ADLS access.
4. Do one of the following:
1. Select Backup > Replication Schedules
2. Click Create Schedule > HDFS Replication.
or
1. Select Clusters > <HDFS service>.
2. Select Quick Links > Replication.
3. Click Create Schedule > HDFS Replication.
The Create HDFS Replication dialog box displays, and opens displaying the General tab. Click the Peer or AWS
Credentials link if your replication job requires them and you need to create these entities.
5. Select the General tab to configure the following:
a. Click the Name field and add a unique name for the replication schedule.
b. Click the Source field and select the source HDFS service. You can select HDFS services managed by a peer
Cloudera Manager Server, local HDFS services (managed by the Cloudera Manager Server for the Admin
Console you are logged into), or you can select AWS Credentials or Azure Credentials.
c. Enter the Source Path to the directory (or file) you want to replicate.
For replication from Amazon S3, enter the path using the following form:
s3a://bucket name/path
For replication from ADLS Gen 1, enter the path using the following form:
adl://<accountname>.azuredatalakestore.net/<path>
For replication from ADLS Gen 2, enter the path using the following form:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/
You can also use a glob path to specify more than one path for replication.
d. Click the Destination field and select the destination HDFS service from the HDFS services managed by the
Cloudera Manager Server for the Admin Console you are logged into, or select AWS Credentials.
e. Enter the Destination Path where the source files should be saved.
f. For replication to S3, enter the path using the following form:
s3a://bucket name/path
For replication to ADLS Gen 1, enter the path using the following form:
adl://<accountname>.azuredatalakestore.net/<path>
For replication to ADLS Gen 2, enter the path using the following form:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/
g. Select a Schedule:
• Immediate - Run the schedule Immediately.
• Once - Run the schedule one time in the future. Set the date and time.
• Recurring - Run the schedule periodically in the future. Set the date, time, and interval between runs.
h. Enter the user to run the replication job in the Run As Username field. By default this is hdfs. If you want to
run the job as a different user, enter the user name here. If you are using Kerberos, you must provide a user
name here, and it must be one with an ID greater than 1000. (You can also configure the minimum user ID
number with the min.user.id property in the YARN or MapReduce service.) Verify that the user running the
job has a home directory, /user/username, owned by username:supergroup in HDFS. This user must
have permissions to read from the source directory and write to the destination directory.
Note the following:
• The User must not be present in the list of banned users specified with the Banned System Users property
in the YARN configuration (Go to the YARN service, select Configuration tab and search for the property).
For security purposes, the hdfs user is banned by default from running YARN containers.
• The requirement for a user ID that is greater than 1000 can be overridden by adding the user to the
"white list" of users that is specified with the Allowed System Users property. (Go to the YARN service,
select Configuration tab and search for the property.)
• Scheduler Pool – (Optional) Enter the name of a resource pool in the field. The value you enter is used by
the MapReduce Service you specified when Cloudera Manager executes the MapReduce job for the replication.
The job specifies the value using one of these properties:
– MapReduce – Fair scheduler: mapred.fairscheduler.pool
– MapReduce – Capacity scheduler: queue.name
– YARN – mapreduce.job.queuename
• Maximum Map Slots - Limits for the number of map slots per mapper. The default value is 20.
• Maximum Bandwidth - Limits for the bandwidth per mapper. The default is 100 MB.
• Replication Strategy - Whether file replication tasks should be distributed among the mappers statically or
dynamically. (The default is Dynamic.) Static replication distributes file replication tasks among the mappers
up front to achieve a uniform distribution based on the file sizes. Dynamic replication distributes file replication
tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and
processes the next unallocated set of tasks.
7. Select the Advanced Options tab, to configure the following:
• Add Exclusion click the link to exclude one or more paths from the replication.
The Regular Expression-Based Path Exclusion field displays, where you can enter a regular expression-based
path. When you add an exclusion, include the snapshotted relative path for the regex. For example, to exclude
the /user/bdr directory, use the following regular expression, which includes the snapshots for the bdr
directory:
.*/user/\.snapshot/.+/bdr.*
To exclude top-level directories from replication in a globbed source path, you can specify the relative path
for the regex without including .snapshot in the path. For example, to exclude the bdr directory from
replication, use the following regular expression:
.*/user+/bdr.*
Important:
You must skip checksum checks to prevent replication failure due to non-matching
checksums in the following cases:
• Replications from an encrypted zone on the source cluster to an encrypted zone
on a destination cluster.
• Replications from an encryption zone on the source cluster to an unencrypted zone
on the destination cluster.
• Replications from an unencrypted zone on the source cluster to an encrypted zone
on the destination cluster.
Checksums are used for two purposes:
• To skip replication of files that have already been copied. If Skip Checksum Checks
is selected, the replication job skips copying a file if the file lengths and modification
times are identical between the source and destination clusters. Otherwise, the job
copies the file from the source to the destination.
• To redundantly verify the integrity of data. However, checksums are not required
to guarantee accurate transfers between clusters. HDFS data transfers are protected
by checksums during transfer and storage hardware also uses checksums to ensure
that data is accurately stored. These two mechanisms work together to validate
the integrity of the copied data.
– Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine
whether they are same or not. If skipped, the file size and last modified time are used to determine if
files are the same or not. Skipping the check improves performance during the mapper phase. Note that
if you select the Skip Checksum Checks option, this check is also skipped.
– Abort on Error - Whether to abort the job on an error. If selected, files copied up to that point remain
on the destination, but no additional files are copied. Abort on Error is off by default.
– Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, BDR uses a complete copy to
replicate data. If you select this option, the BDR aborts the replication when it encounters an error
instead.
• Preserve - Whether to preserve the block size, replication count, permissions (including ACLs), and extended
attributes (XAttrs) as they exist on the source file system, or to use the settings as configured on the destination
file system. By default source system settings are preserved. When Permission is checked, and both the
source and destination clusters support ACLs, replication preserves ACLs. Otherwise, ACLs are not replicated.
When Extended attributes is checked, and both the source and destination clusters support extended
attributes, replication preserves them. (This option only displays when both source and destination clusters
support extended attributes.)
If you select one or more of the Preserve options and you are replicating to S3 or ADLS, the values all of these
items are saved in meta data files on S3 or ADLS. When you replicate from S3 or ADls to HDFS, you can select
which of these options you want to preserve.
See Replication of Encrypted Data on page 580 and HDFS Transparent Encryption.
• Delete Policy - Whether files that were deleted on the source should also be deleted from the destination
directory. This policy also determines the handling of files in the destination location that are unrelated to
the source. Options include:
– Keep Deleted Files - Retains the destination files even when they no longer exist at the source. (This is
the default.).
– Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. (Not supported when
replicating to S3 or ADLS.)
– Delete Permanently - Uses the least amount of space; use with caution.
• Alerts - Whether to generate alerts for various state changes in the replication workflow. You can alert on
failure, on start, on success, or when the replication workflow is aborted.
8. Click Save Schedule.
The replication task now appears as a row in the Replications Schedule table. (It can take up to 15 seconds for
the task to appear.)
If you selected Immediate in the Schedule field, the replication job begins running when you click Save Schedule.
To specify additional replication tasks, select Create > HDFS Replication.
Note: If your replication job takes a long time to complete, and files change before the replication
finishes, the replication may fail. Consider making the directories snapshottable, so that the replication
job creates snapshots of the directories before copying the files and then copies files from these
snapshottable directories when executing the replication. See Using Snapshots with Replication on
page 576.
HOST_WHITELIST=host-1.mycompany.com,host-2.mycompany.com
5. Enter a Reason for change, and then click Save Changes to commit the changes.
Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same
replication schedule starts before the previous one has finished, the second one is canceled.
You can limit the replication jobs that are displayed by selecting filters on the left. If you do not see an expected
schedule, adjust or clear the filters. Use the search box to search the list of schedules for path, database, or table
names.
The Replication Schedules columns are described in the following table.
Column Description
ID An internally generated ID number that identifies the schedule. Provides a convenient way to
identify a schedule.
Click the ID column label to sort the replication schedule table by ID.
Name The unique name you specify when you create a schedule.
Type The type of replication scheduled, either HDFS or Hive.
Source The source cluster for the replication.
Destination The destination cluster for the replication.
Throughput Average throughput per mapper/file of all the files written. Note that throughput does not
include the following information: the combined throughput of all mappers and the time taken
to perform a checksum on a file after the file is written.
Progress The progress of the replication.
Last Run The date and time when the replication last ran. Displays None if the scheduled replication has
not yet been run. Click the date and time link to view the Replication History page for the
replication.
Displays one of the following icons:
• - Successful. Displays the date and time of the last run replication.
• - Failed. Displays the date and time of a failed replication.
• - None. This scheduled replication has not yet run.
•
- Running. Displays a spinner and bar showing the progress of the replication.
Click the Last Run column label to sort the Replication Schedules table by the last run date.
Next Run The date and time when the next replication is scheduled, based on the schedule parameters
specified for the schedule. Hover over the date to view additional details about the scheduled
replication.
Click the Next Run column label to sort the Replication Schedules table by the next run date.
Objects Displays on the bottom line of each row, depending on the type of replication:
• Hive - A list of tables selected for replication.
• HDFS - A list of paths selected for replication.
For example:
Column Description
Actions The following items are available from the Action button:
• Show History - Opens the Replication History page for a replication. See Viewing Replication
History on page 555.
• Edit Configuration - Opens the Edit Replication Schedule page.
• Dry Run - Simulates a run of the replication task but does not actually copy any files or
tables. After a Dry Run, you can select Show History, which opens the Replication History
page where you can view any error messages and the number and size of files or tables
that would be copied in an actual replication.
• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows you
to collect replication-specific diagnostic data for the last 10 runs of the schedule:
1. Select Send Diagnostic Data to Cloudera to automatically send the bundle to Cloudera
Support. You can also enter a ticket number and comments when sending the bundle.
2. Click Collect and Send Diagnostic Data to generate the bundle and open the
Replications Diagnostics Command screen.
3. When the command finishes, click Download Result Data to download a zip file
containing the bundle.
• Run Now - Runs the replication task immediately.
• Disable | Enable - Disables or enables the replication schedule. No further replications are
scheduled for disabled replication schedules.
• Delete - Deletes the schedule. Deleting a replication schedule does not delete copied files
or tables.
• While a job is in progress, the Last Run column displays a spinner and progress bar, and each stage of the replication
task is indicated in the message beneath the job's row. Click the Command Details link to view details about the
execution of the command.
• If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source
since the previous job, then that file is not copied. As a result, after the initial job, only a subset of the files may
actually be copied, and this is indicated in the success message.
• If the job fails, the icon displays.
• To view more information about a completed job, select Actions > Show History. See Viewing Replication History
on page 555.
Enabling, Disabling, or Deleting A Replication Schedule
When you create a new replication schedule, it is automatically enabled. If you disable a replication schedule, it can
be re-enabled at a later time.
To enable, disable, or delete a replication schedule:
• Click Actions > Enable|Disable|Delete in the row for a replication schedule.
To enable, disable, or delete multiple replication schedules:
1. Select one or more replication schedules in the table by clicking the check box the in the left column of the table.
2. Click Actions for Selected > Enable|Disable|Delete.
The Replication History page displays a table of previously run replication jobs with the following columns:
Column Description
Start Time Time when the replication job started.
Expand the display and show details of the replication. In this screen, you can:
• Click the View link to open the Command Details page, which displays details and
messages about each step in the execution of the command. Expand the display for a
Step to:
– View the actual command string.
– View the Start time and duration of the command.
– Click the Context link to view the service status page relevant to the command.
– Select one of the tabs to view the Role Log, stdout, and stderr for the command.
See Viewing Running and Recent Commands on page 301.
• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows
you to collect replication-specific diagnostic data for this run of the schedule:
1. Select Send Diagnostic Data to Cloudera to automatically send the bundle to
Cloudera Support. You can also enter a ticket number and comments when sending
the bundle.
Column Description
2. Click Collect and Send Diagnostic Data to generate the bundle and open the
Replications Diagnostics Command screen.
3. When the command finishes, click Download Result Data to download a zip file
containing the bundle.
• (HDFS only) Link to view details on the MapReduce Job used for the replication. See
Viewing and Filtering MapReduce Activities on page 309.
• (Dry Run only) View the number of Replicable Files. Displays the number of files that
would be replicated during an actual replication.
• (Dry Run only) View the number of Replicable Bytes. Displays the number of bytes that
would be replicated during an actual replication.
• Link to download a CSV file containing a Replication Report. This file lists the databases
and tables that were replicated.
• View the number of Errors that occurred during the replication.
• View the number of Impala UDFs replicated. (Displays only for Hive/Impala replications
where Replicate Impala Metadata is selected.)
• Click the link to download a CSV file containing a Download Listing. This file lists the files
and directories that were replicated.
• Click the link to download a CSV file containing Download Status.
• If a user was specified in the Run As Username field when creating the replication job,
the selected user displays.
• View messages returned from the replication job.
s3:Get*
s3:Delete*
s3:Put*
s3:ListBucket
s3:ListBucketMultipartUploads
s3:AbortMultipartUpload
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
You can monitor the progress of an HDFS replication schedule using performance data that you download as a CSV file
from the Cloudera Manager Admin console. This file contains information about the files being replicated, the average
throughput, and other details that can help diagnose performance issues during HDFS replications. You can view this
performance data for running HDFS replication jobs and for completed jobs.
To view the performance data for a running HDFS replication schedule:
1. Go to Backup > Replication Schedules.
2. Locate the schedule.
3. Click Performance Report and select one of the following options:
• HDFS Performance Summary – Download a summary report of the performance of the running replication
job. An HDFS Performance Summary Report includes the last performance sample for each mapper that is
working on the replication job.
• HDFS Performance Full – Download a full report of the performance of the running replication job. An HDFS
Performance Full report includes all samples taken for all mappers during the full execution of the replication
job.
4. To view the data, import the file into a spreadsheet program such as Microsoft Excel.
To view the performance data for a completed HDFS replication schedule:
1. Go to Backup > Replication Schedules.
2. Locate the schedule and click Actions > Show History.
The Replication History page for the replication schedule displays.
3. Click to expand the display for this schedule.
4. Click Download CSV link and select one of the following options:
• Listing – a list of files and directories copied during the replication job.
• Status - full status report of files where the status of the replication is one of the following:
– ERROR – An error occurred and the file was not copied.
– DELETED – A deleted file.
– SKIPPED – A file where the replication was skipped because it was up-to-date.
• Error Status Only – full status report, filtered to show files with errors only.
• Deleted Status Only – full status report, filtered to show deleted files only.
• Skipped Status Only– full status report, filtered to show skipped files only.
• Performance – summary performance report.
• Full Performance – full performance report.
See Table 32: HDFS Performance Report Columns on page 558 for a description of the data in the performance
reports.
5. To view the data, import the file into a spreadsheet program such as Microsoft Excel.
The performance data is collected every two minutes. Therefore, no data is available during the initial execution of a
replication job because not enough samples are available to estimate throughput and other reported data.
The data returned by the CSV files downloaded from the Cloudera Manager Admin console has the following columns:
• If you employ a proxy user with the form user@domain, performance data is not available through the links.
• If the replication job only replicates small files that can be transferred in less than a few minutes, no performance
statistics are collected.
• For replication schedules that specify the Dynamic Replication Strategy, statistics regarding the last file transferred
by a MapReduce job hide previous transfers performed by that MapReduce job.
• Only the last trace per MapReduce job is reported in the CSV file.
Hive/Impala Replication
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Note: If your deployment includes tables backed by Kudu, BDR filters out Kudu tables for a Hive
replication in order to prevent data loss or corruption. Even though BDR does not replicate the data
in the Kudu tables, it might replicate the tables' metadata entries to the destination.
HOST_WHITELIST=host-1.mycompany.com,host-2.mycompany.com
5. Enter a Reason for change, and then click Save Changes to commit the changes.
Replication of Parameters
Hive replication replicates parameters of databases, tables, partitions, table column stats, indexes, views, and Hive
UDFs.
You can disable replication of parameters:
1. Log in to the Cloudera Manager Admin Console.
2. Go to the Hive service.
3. Click the Configuration tab.
4. Search for "Hive Replication Environment Advanced Configuration Snippet"
5. Add the following parameter:
REPLICATE_PARAMETERS=false
Value: true
• Name:replication.hive.ignoreTableNotFound
Value: true
• If a Hive replication schedule is created to replicate a database, ensure all the HDFS paths for the tables in that
database are either snapshottable or under a snapshottable root. For example, if the database that is being
replicated has external tables, all the external table HDFS data locations should be snapshottable too. Failing to
do so will cause BDR to fail to generate a diff report. Without a diff report, BDR will not use snapshot diff.
• After every replication, BDR retains a snapshot on the source cluster. Using the snapshot copy on the source
cluster, BDR performs incremental backups for the next replication cycle. BDR retains snapshots on the source
cluster only if:
– Source and target clusters in the Cloudera Manager are 5.15 and higher
– Source and target CDH are 5.13.3+, 5.14.2+, and 5.15+ respectively
Note: In replication scenarios where a destination cluster has multiple source clusters, all the source
clusters must either be secure or insecure. BDR does not support replication from a mixture of secure
and insecure source clusters.
To enable replication from an insecure cluster to a secure cluster, you need a user that exists on all the hosts on both
the source cluster and destination cluster. Specify this user in the Run As Username field when you create a replication
schedule.
The following steps describe how to add a user:
1. On a host in the source or destination cluster, add a user with the following command:
2. Set the permissions for the user directory with the following command:
For example, the following command makes milton the owner of the milton directory:
3. Create the supergroup group for the user you created in step 1 with the following command:
groupadd supergroup
4. Add the user you created in step 1 to the group you created:
5. Repeat this process for all hosts in the source and destination clusters so that the user and group exists on all of
them.
After you complete this process, specify the user you created in the Run As Username field when you create a replication
schedule.
Note: If you are replicating to or from S3 or ADLS, follow the steps under Hive/Impala Replication
To and From Cloud Storage on page 571 before completing these steps.
a. Use the Name field to provide a unique name for the replication schedule.
b. Use the Source drop-down list to select the cluster with the Hive service you want to replicate.
c. Use the Destination drop-down list to select the destination for the replication. If there is only one Hive
service managed by Cloudera Manager available as a destination, this is specified as the destination. If more
than one Hive service is managed by this Cloudera Manager, select from among them.
d. Based on the type of destination cluster you plan to use, select one of these two options:
• Use HDFS Destination
• Use Cloud Destination
Note: For using cloud storage in the target cluster, you must set up a valid cloud storage
account and verify that the cloud storage has enough space to save the replicated data.
g. Select a Schedule:
Note: The user running the MapReduce job should have read and execute permissions
on the Hive warehouse directory on the source cluster. If you configure the replication job
to preserve permissions, superuser privileges are required on the destination cluster.
i. Specify the Run on peer as Username option if the peer cluster is configured with a different superuser. This
is only applicable while working in a kerberized environment.
6. Select the Resources tab to configure the following:
• Scheduler Pool – (Optional) Enter the name of a resource pool in the field. The value you enter is used by
the MapReduce Service you specified when Cloudera Manager executes the MapReduce job for the replication.
The job specifies the value using one of these properties:
– MapReduce – Fair scheduler: mapred.fairscheduler.pool
– MapReduce – Capacity scheduler: queue.name
– YARN – mapreduce.job.queuename
• Maximum Map Slots and Maximum Bandwidth – Limits for the number of map slots and for bandwidth per
mapper. The default is 100 MB.
• Replication Strategy – Whether file replication should be static (the default) or dynamic. Static replication
distributes file replication tasks among the mappers up front to achieve a uniform distribution based on file
sizes. Dynamic replication distributes file replication tasks in small sets to the mappers, and as each mapper
processes its tasks, it dynamically acquires and processes the next unallocated set of tasks.
7. Select the Advanced tab to specify an export location, modify the parameters of the MapReduce job that will
perform the replication, and set other options. You can select a MapReduce service (if there is more than one in
your cluster) and change the following parameters:
• Uncheck the Replicate HDFS Files checkbox to skip replicating the associated data files.
• If both the source and destination clusters use CDH 5.7.0 or later up to and including 5.11.x, select the
Replicate Impala Metadata drop-down list and select No to avoid redundant replication of Impala metadata.
(This option only displays when supported by both source and destination clusters.) You can select the
following options for Replicate Impala Metadata:
– Yes – replicates the Impala metadata.
– No – does not replicate the Impala metadata.
– Auto – Cloudera Manager determines whether or not to replicate the Impala metadata based on the
CDH version.
To replicate Impala UDFs when the version of CDH managed by Cloudera Manager is 5.7 or lower, see
Replicating Data to Impala Clusters on page 575 for information on when to select this option.
• The Force Overwrite option, if checked, forces overwriting data in the destination metastore if incompatible
changes are detected. For example, if the destination metastore was modified, and a new partition was added
to a table, this option forces deletion of that partition, overwriting the table with the version found on the
source.
Important: If the Force Overwrite option is not set, and the Hive/Impala replication process
detects incompatible changes on the source cluster, Hive/Impala replication fails. This
sometimes occurs with recurring replications, where the metadata associated with an existing
database or table on the source cluster changes over time.
Note: In a Kerberized cluster, the HDFS principal on the source cluster must have read,
write, and execute access to the Export Path directory on the destination cluster.
• Number of concurrent HMS connections - The number of concurrent Hive Metastore connections. These
connections are used to concurrently import and export metadata from Hive. Increasing the number of
threads can improve BDR performance. By default, any new replication schedules will use 5 connections.
If you set the value to 1 or more, BDR uses multi-threading with the number of connections specified. If you
set the value to 0 or fewer, BDR uses single threading and a single connection.
Note that the source and destination clusters must run a Cloudera Manager version that supports concurrent
HMS connections, Cloudera Manager 5.15.0 or higher and Cloudera Manager 6.1.0 or higher.
• By default, Hive HDFS data files (for example, /user/hive/warehouse/db1/t1) are replicated to a location
relative to "/" (in this example, to /user/hive/warehouse/db1/t1). To override the default, enter a path
in the HDFS Destination Path field. For example, if you enter /ReplicatedData, the data files would be
replicated to /ReplicatedData/user/hive/warehouse/db1/t1.
• Select the MapReduce Service to use for this replication (if there is more than one in your cluster).
• Log Path - An alternative path for the logs.
• Description - A description for the replication schedule.
• Skip Checksum Checks - Whether to skip checksum checks, which are performed by default.
Checksums are used for two purposes:
• To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the
replication job skips copying a file if the file lengths and modification times are identical between the
source and destination clusters. Otherwise, the job copies the file from the source to the destination.
• To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate
transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage
hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work
together to validate the integrity of the copied data.
• Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine
whether they are same or not. If skipped, the file size and last modified time are used to determine if files
are the same or not. Skipping the check improves performance during the mapper phase. Note that if you
select the Skip Checksum Checks option, this check is also skipped.
• Abort on Error - Whether to abort the job on an error. By selecting the check box, files copied up to that
point remain on the destination, but no additional files will be copied. Abort on Error is off by default.
• Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, BDR uses a complete copy to
replicate data. If you select this option, the BDR aborts the replication when it encounters an error instead.
• Delete Policy - Whether files that were on the source should also be deleted from the destination directory.
Options include:
– Keep Deleted Files - Retains the destination files even when they no longer exist at the source. (This is
the default.).
– Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. (Not supported when
replicating to S3 or ADLS.)
– Delete Permanently - Uses the least amount of space; use with caution.
• Preserve - Whether to preserve the Block Size, Replication Count, and Permissions as they exist on the
source file system, or to use the settings as configured on the destination file system. By default, settings are
preserved on the source.
Note: You must be running as a superuser to preserve permissions. Use the "Run As
Username" option to ensure that is the case.
• Alerts - Whether to generate alerts for various state changes in the replication workflow. You can alert On
Failure, On Start, On Success, or On Abort (when the replication workflow is aborted).
8. Click Save Schedule.
The replication task appears as a row in the Replications Schedule table. See Viewing Replication Schedules on
page 567.
Note: If your replication job takes a long time to complete, and tables change before the replication
finishes, the replication may fail. Consider making the Hive Warehouse Directory and the directories
of any external tables snapshottable, so that the replication job creates snapshots of the directories
before copying the files. See Using Snapshots with Replication on page 576.
For previously-run replications, the number of replicated UDFs displays on the Replication History page:
Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same
replication schedule starts before the previous one has finished, the second one is canceled.
You can limit the replication jobs that are displayed by selecting filters on the left. If you do not see an expected
schedule, adjust or clear the filters. Use the search box to search the list of schedules for path, database, or table
names.
The Replication Schedules columns are described in the following table.
Column Description
ID An internally generated ID number that identifies the schedule. Provides a convenient way to
identify a schedule.
Click the ID column label to sort the replication schedule table by ID.
Name The unique name you specify when you create a schedule.
Type The type of replication scheduled, either HDFS or Hive.
Source The source cluster for the replication.
Destination The destination cluster for the replication.
Throughput Average throughput per mapper/file of all the files written. Note that throughput does not
include the following information: the combined throughput of all mappers and the time taken
to perform a checksum on a file after the file is written.
Progress The progress of the replication.
Last Run The date and time when the replication last ran. Displays None if the scheduled replication has
not yet been run. Click the date and time link to view the Replication History page for the
replication.
Displays one of the following icons:
• - Successful. Displays the date and time of the last run replication.
• - Failed. Displays the date and time of a failed replication.
• - None. This scheduled replication has not yet run.
Column Description
•
- Running. Displays a spinner and bar showing the progress of the replication.
Click the Last Run column label to sort the Replication Schedules table by the last run date.
Next Run The date and time when the next replication is scheduled, based on the schedule parameters
specified for the schedule. Hover over the date to view additional details about the scheduled
replication.
Click the Next Run column label to sort the Replication Schedules table by the next run date.
Objects Displays on the bottom line of each row, depending on the type of replication:
• Hive - A list of tables selected for replication.
• HDFS - A list of paths selected for replication.
For example:
Actions The following items are available from the Action button:
• Show History - Opens the Replication History page for a replication. See Viewing Replication
History on page 555.
• Edit Configuration - Opens the Edit Replication Schedule page.
• Dry Run - Simulates a run of the replication task but does not actually copy any files or
tables. After a Dry Run, you can select Show History, which opens the Replication History
page where you can view any error messages and the number and size of files or tables
that would be copied in an actual replication.
• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows you
to collect replication-specific diagnostic data for the last 10 runs of the schedule:
1. Select Send Diagnostic Data to Cloudera to automatically send the bundle to Cloudera
Support. You can also enter a ticket number and comments when sending the bundle.
2. Click Collect and Send Diagnostic Data to generate the bundle and open the
Replications Diagnostics Command screen.
3. When the command finishes, click Download Result Data to download a zip file
containing the bundle.
• Run Now - Runs the replication task immediately.
• Disable | Enable - Disables or enables the replication schedule. No further replications are
scheduled for disabled replication schedules.
• Delete - Deletes the schedule. Deleting a replication schedule does not delete copied files
or tables.
• While a job is in progress, the Last Run column displays a spinner and progress bar, and each stage of the replication
task is indicated in the message beneath the job's row. Click the Command Details link to view details about the
execution of the command.
• If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source
since the previous job, then that file is not copied. As a result, after the initial job, only a subset of the files may
actually be copied, and this is indicated in the success message.
• If the job fails, the icon displays.
• To view more information about a completed job, select Actions > Show History. See Viewing Replication History
on page 555.
Enabling, Disabling, or Deleting A Replication Schedule
When you create a new replication schedule, it is automatically enabled. If you disable a replication schedule, it can
be re-enabled at a later time.
To enable, disable, or delete a replication schedule:
• Click Actions > Enable|Disable|Delete in the row for a replication schedule.
To enable, disable, or delete multiple replication schedules:
1. Select one or more replication schedules in the table by clicking the check box the in the left column of the table.
2. Click Actions for Selected > Enable|Disable|Delete.
The Replication History page displays a table of previously run replication jobs with the following columns:
Column Description
Start Time Time when the replication job started.
Expand the display and show details of the replication. In this screen, you can:
• Click the View link to open the Command Details page, which displays details and
messages about each step in the execution of the command. Expand the display for a
Step to:
– View the actual command string.
– View the Start time and duration of the command.
– Click the Context link to view the service status page relevant to the command.
– Select one of the tabs to view the Role Log, stdout, and stderr for the command.
See Viewing Running and Recent Commands on page 301.
• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows
you to collect replication-specific diagnostic data for this run of the schedule:
1. Select Send Diagnostic Data to Cloudera to automatically send the bundle to
Cloudera Support. You can also enter a ticket number and comments when sending
the bundle.
2. Click Collect and Send Diagnostic Data to generate the bundle and open the
Replications Diagnostics Command screen.
3. When the command finishes, click Download Result Data to download a zip file
containing the bundle.
• (HDFS only) Link to view details on the MapReduce Job used for the replication. See
Viewing and Filtering MapReduce Activities on page 309.
• (Dry Run only) View the number of Replicable Files. Displays the number of files that
would be replicated during an actual replication.
• (Dry Run only) View the number of Replicable Bytes. Displays the number of bytes that
would be replicated during an actual replication.
• Link to download a CSV file containing a Replication Report. This file lists the databases
and tables that were replicated.
• View the number of Errors that occurred during the replication.
• View the number of Impala UDFs replicated. (Displays only for Hive/Impala replications
where Replicate Impala Metadata is selected.)
• Click the link to download a CSV file containing a Download Listing. This file lists the files
and directories that were replicated.
• Click the link to download a CSV file containing Download Status.
• If a user was specified in the Run As Username field when creating the replication job,
the selected user displays.
• View messages returned from the replication job.
Column Description
Files Skipped Number of files skipped during the replication. The replication process skips files that already
exist in the destination and have not changed.
Important: If AWS S3 access keys are rotated, you must restart Cloudera Manager server;
otherwise, Hive replication fails.
s3a://S3_bucket_name/path
adl://<accountname>.azuredatalakestore.net/<path>
s3a://S3_bucket_name/path_to_metadata_file
adl://<accountname>.azuredatalakestore.net/<path_to_metadata_file>
6. Complete the configuration of the Hive/Impala replication schedule by following the steps under Configuring
Replication of Hive/Impala Data on page 562, beginning with step 5.f on page 563
Ensure that the following basic permissions are available to provide read-write access to S3 through the S3A connector:
s3:Get*
s3:Delete*
s3:Put*
s3:ListBucket
s3:ListBucketMultipartUploads
s3:AbortMultipartUpload
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
You can monitor the progress of a Hive/Impala replication schedule using performance data that you download as a
CSV file from the Cloudera Manager Admin console. This file contains information about the tables and partitions being
replicated, the average throughput, and other details that can help diagnose performance issues during Hive/Impala
replications. You can view this performance data for running Hive/Impala replication jobs and for completed jobs.
To view the performance data for a running Hive/Impala replication schedule:
1. Go to Backup > Replication Schedules.
2. Locate the row for the schedule.
3. Click Performance Reports and select one of the following options:
• HDFS Performance Summary – downloads a summary performance report of the HDFS phase of the running
Hive replication job.
• HDFS Performance Full – downloads a full performance report of the HDFS phase of the running Hive replication
job.
• Hive Performance – downloads a report of Hive performance.
4. To view the data, import the file into a spreadsheet program such as Microsoft Excel.
To view the performance data for a completed Hive/Impala replication schedule:
1. Go to Backup > Replication Schedules.
Note: The option to download the HDFS Replication Report might not appear if the HDFS phase
of the replication skipped all HDFS files because they have not changed, or if the Replicate HDFS
Files option (located on the Advanced tab when creating Hive/Impala replication schedules) is
not selected.
See Table 35: Hive Performance Report Columns on page 574 for a description of the data in the HDFS performance
reports.
5. To view performance of the HDFS phase, click Download CSV next to the HDFS Replication Report label and select
one of the following options:
• Listing – a list of files and directories copied during the replication job.
• Status - full status report of files where the status of the replication is one of the following:
– ERROR – An error occurred and the file was not copied.
– DELETED – A deleted file.
– SKIPPED – A file where the replication was skipped because it was up-to-date.
• Error Status Only – full status report, filtered to show files with errors only.
• Deleted Status Only – full status report, filtered to show deleted files only.
• Skipped Status Only– full status report, filtered to show skipped files only.
• Performance – summary performance report.
• Full Performance – full performance report.
See Table 32: HDFS Performance Report Columns on page 558 for a description of the data in the HDFS performance
reports.
6. To view the data, import the file into a spreadsheet program such as Microsoft Excel.
The performance data is collected every two minutes. Therefore, no data is available during the initial execution of a
replication job because not enough samples are available to estimate throughput and other reported data.
The data returned by the CSV files downloaded from the Cloudera Manager Admin console has the following structure:
• If you employ a proxy user with the form user@domain, performance data is not available through the links.
• If the replication job only replicates small files that can be transferred in less than a few minutes, no performance
statistics are collected.
• For replication schedules that specify the Dynamic Replication Strategy, statistics regarding the last file transferred
by a MapReduce job hide previous transfers performed by that MapReduce job.
• Only the last trace of each MapReduce job is reported in the CSV file.
Note: This feature is not available if the source and destination clusters run CDH 5.12 or higher. This
feature replicated legacy Impala UDFs, which are no longer supported. Impala metadata is replicated
as part of regular Hive/Impala replication operations.
Impala metadata replication is performed as a part of Hive replication. Impala replication is only supported between
two CDH clusters. The Impala and Hive services must be running on both clusters.
To enable Impala metadata replication, perform the following tasks:
1. Schedule Hive replication as described in Configuring Replication of Hive/Impala Data on page 562.
2. Confirm that the Replicate Impala Metadata option is set to Yes on the Advanced tab in the Create Hive Replication
dialog.
When you set the Replicate Impala Metadata option to Yes, Impala UDFs (user-defined functions) will be available on
the target cluster, just as on the source cluster. As part of replicating the UDFs, the binaries in which they are defined
are also replicated.
Note: To run queries or execute DDL statements on tables that have been replicated to a destination
cluster, you must run the Impala INVALIDATE METADATA statement on the destination cluster to
prevent queries from failing. See INVALIDATE METADATA Statement
Note: If the source contains UDFs, you must run the INVALIDATE METADATA statement manually
and without any tables specified even if you configure the automatic invalidation.
If you are using external tables in Hive, also make the directories hosting any external tables not stored in the Hive
warehouse directory snapshottable.
Similarly, if you are using Impala and are replicating any Impala tables using Hive/Impala replication, ensure that the
storage locations for the tables and associated databases are also snapshottable. See Enabling and Disabling HDFS
Snapshots on page 609.
Important: Cloudera Backup and Disaster Recovery (BDR) works with clusters in different Kerberos
realms even without a Kerberos realm trust relationship. The Cloudera Manager configuration properties
Trusted Kerberos Realms and Kerberos Trusted Realms are used for Cloudera Manager and CDH
configuration, and are not related to Kerberos realm trust relationships.
If you are using standalone DistCp between clusters in different Kerberos realms, you must configure
a realm trust. For more information, see Distcp between Secure Clusters in Different Kerberos Realms
on page 648.
Ports
When using BDR with Kerberos authentication enabled, BDR requires all the ports listed on the following page: Port
Requirements for Backup and Disaster Recovery on page 540.
Additionally, the port used for the Kerberos KDC Server and KRB5 services must be open to all hosts on the destination
cluster. By default, this is port 88.
Important: If the source and destination clusters are in the same realm but do not use the same KDC
or the KDcs are not part of a unified realm, the replication job will fail.
[realms]
SRC.EXAMPLE.COM = {
kdc = kdc01.src.example.com:88
admin_server = kdc01.example.com:749
default_domain = src.example.com
}
DST.EXAMPLE.COM = {
kdc = kdc01.dst.example.com:88
admin_server = kdc01.dst.example.com:749
default_domain = dst.example.com
}
• Realm mapping for the source cluster domain. You configure these mappings in the [domain_realm] section.
For example:
[domain_realm]
.dst.example.com = DST.EXAMPLE.COM
dst.example.com = DST.EXAMPLE.COM
.src.example.com = SRC.EXAMPLE.COM
src.example.com = SRC.EXAMPLE.COM
2. On the destination cluster, use Cloudera Manager to add the realm of the source cluster to the Trusted Kerberos
Realms configuration property:
a. Go to the HDFS service.
b. Click the Configuration tab.
c. In the search field type Trusted Kerberos to find the Trusted Kerberos Realms property.
d. Click the plus sign icon, and then enter the source cluster realm.
e. Enter a Reason for change, and then click Save Changes to commit the changes.
3. Go to Administration > Settings.
4. In the search field, type domain name.
5. In the Domain Name(s) field, enter any domain or host names you want to map to the destination cluster KDC.
Use the plus sign icon to add as many entries as you need. The entries in this property are used to generate the
domain_realm section in krb5.conf.
6. If domain_realm is configured in the Advanced Configuration Snippet (Safety Valve) for remaining krb5.conf,
remove the entries for it.
7. Enter a Reason for change, and then click Save Changes to commit the changes.
Note: If the source and destination clusters both run Cloudera Manager 5.12 or higher, you do not
need to complete the steps in this section. These additional steps are no longer required for Hive or
Impala replication. If you are using Cloudera Manager 5.11 or lower, complete the steps above in
HDFS, Hive, and Impala Replication on page 577, and then complete the steps in the following section.
Kerberos Recommendations
If Cloudera Manager manages the Kerberos configuration file, Cloudera Manager configures Kerberos correctly for you
and then provides the set of commands that you must manually run to finish configuring the clusters. The following
screen shots show the prompts that Cloudera Manager provides in cases of improper configuration:
Configuration changes:
If Cloudera Manager does not manage the Kerberos configuration file, Cloudera manager provides the manual steps
required to correct the issue. For example, the following screen shot shows the steps required to properly configure
Kerberos:
more information about HDFS encryption zones, see HDFS Transparent Encryption. Encryption zones are not supported
in CDH versions 5.1 or lower. Replication of data from CDH 6 encrypted zones to CDH 5 is not supported.
When you configure encryption zones, you also configure a Key Management Server (KMS) to manage encryption keys.
During replication, Cloudera Manager uses TLS/SSL to encrypt the keys when they are transferred from the source
cluster to the destination cluster.
When you configure encryption zones, you also configure a Key Management Server (KMS) to manage encryption keys.
When a HDFS replication command that specifies an encrypted source directory runs, Cloudera Manager temporarily
copies the encryption keys from the source cluster to the destination cluster, using TLS/SSL (if configured for the KMS)
to encrypt the keys. Cloudera Manager then uses these keys to decrypt the encrypted files when they are received
from the source cluster before writing the files to the destination cluster.
Important: When you configure HDFS replication, you must select the Skip Checksum check property
to prevent replication failure in the following cases:
• Replications from an encrypted zone on the source cluster to an encrypted zone on a destination
cluster.
• Replications from an encryption zone on the source cluster to an unencrypted zone on the
destination cluster.
• Replications from an unencrypted zone on the source cluster to an encrypted zone on the
destination cluster.
Even when the source and destination directories are both in encryption zones, the data is decrypted as it is read from
the source cluster (using the key for the source encryption zone) and encrypted again when it is written to the destination
cluster (using the key for the destination encryption zone). The data transmission is encrypted if you have configured
encryption for HDFS Data Transfer.
Note: The decryption and encryption steps happen in the same process on the hosts where the
MapReduce jobs that copy the data run. Therefore, data in plain text only exists within the memory
of the Mapper task. If a KMS is in use on either the source or destination clusters, and you are using
encrypted zones for either the source or destination directories, configure TLS/SSL for the KMS to
prevent transferring the key to the mapper task as plain text.
During replication, data travels from the source cluster to the destination cluster using distcp. For clusters that use
encryption zones, configure encryption of KMS key transfers between the source and destination using TLS/SSL. See
Configuring TLS/SSL for the KMS.
To configure encryption of data transmission between source and destination clusters:
• Enable TLS/SSL for HDFS clients on both the source and the destination clusters. For instructions, see Configuring
TLS/SSL for HDFS, YARN and MapReduce. You may also need to configure trust between the SSL certificates on
the source and destination.
• Enable TLS/SSL for the two peer Cloudera Manager Servers. See Manually Configuring TLS Encryption for Cloudera
Manager.
• Encrypt data transfer using HDFS Data Transfer Encryption. See Configuring Encrypted Transport for HDFS.
The following blog post provides additional information about encryption with HDFS:
http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/.
Security Considerations
The user you specify with the Run As field when scheduling a replication job requires full access to both the key and
the data directories being replicated. This is not a recommended best practice for KMS management. If you change
permissions in the KMS to enable this requirement, you could accidentally provide access for this user to data in other
encryption zones using the same key. If a user is not specified in the Run As field, the replication runs as the default
user, hdfs.
To access encrypted data, the user must be authorized on the KMS for the encryption zones they need to interact with.
The user you specify with the Run As field when scheduling a replication must have this authorization. The key
administrator must add ACLs to the KMS for that user to prevent authorization failure.
Key transfer using the KMS protocol from source to the client uses the REST protocol, which requires that you configure
TLS/SSL for the KMS. When TLS/SSL is enabled, keys are not transferred over the network as plain text.
See Encryption Mechanisms Overview.
HBase Replication
If your data is already in an HBase cluster, replication is useful for getting the data into additional HBase clusters. In
HBase, cluster replication refers to keeping one cluster state synchronized with that of another cluster, using the
write-ahead log (WAL) of the source cluster to propagate the changes. Replication is enabled at column family granularity.
Before enabling replication for a column family, create the table and all column families to be replicated, on the
destination cluster. Replication is supported both from CDH 5 to CDH 6 and from CDH 6 to CDH 5, the source and
destination cluster do not have to run the same major version of CDH.
Cluster replication uses an active-push methodology. An HBase cluster can be a source (also called active, meaning
that it writes new data), a destination (also called passive, meaning that it receives data using replication), or can fulfill
both roles at once. Replication is asynchronous, and the goal of replication is consistency.
When data is replicated from one cluster to another, the original source of the data is tracked with a cluster ID, which
is part of the metadata. All clusters that have already consumed the data are also tracked. This prevents replication
loops.
At the top of the diagram, the San Jose and Tokyo clusters, shown in red, replicate changes to each other, and each
also replicates changes to a User Data and a Payment Data cluster.
Each cluster in the second row, shown in blue, replicates its changes to the All Data Backup 1 cluster, shown in
grey. The All Data Backup 1 cluster replicates changes to the All Data Backup 2 cluster (also shown in grey),
as well as the Data Analysis cluster (shown in green). All Data Backup 2 also propagates any of its own changes
back to All Data Backup 1.
The Data Analysis cluster runs MapReduce jobs on its data, and then pushes the processed data back to the San
Jose and Tokyo clusters.
Requirements
Before configuring replication, make sure your environment meets the following requirements:
• You must manage ZooKeeper yourself. It must not be managed by HBase, and must be available throughout the
deployment.
• Each host in both clusters must be able to reach every other host, including those in the ZooKeeper cluster.
• Every table that contains families that are scoped for replication must exist on each cluster and have exactly the
same name. If your tables do not yet exist on the destination cluster, see Creating the Empty Table On the
Destination Cluster on page 589.
• HBase version 0.92 or greater is required for complex replication topologies, such as active-active.
Important: You cannot run replication-related HBase commands as an HBase administrator. To run
replication-related HBase commands, you must have HBase user permissions. If ZooKeeper uses
Kerberos, configure HBase Shell to authenticate to ZooKeeper using Kerberos before attempting to
run replication-related commands. No replication-related ACLs are available at this time.
5. On the source cluster, in HBase Shell, add the destination cluster as a peer, using the add_peer command. The
syntax is as follows:
The ID must be a short integer. To compose the CLUSTER_KEY, use the following template:
hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent
If both clusters use the same ZooKeeper cluster, you must use a different zookeeper.znode.parent, because they
cannot write in the same folder.
6. On the source cluster, configure each column family to be replicated by setting its REPLICATION_SCOPE to 1,
using commands such as the following in HBase Shell.
7. Verify that replication is occurring by examining the logs on the source cluster for messages such as the following.
8. To verify the validity of replicated data, use the included VerifyReplication MapReduce job on the source
cluster, providing it with the ID of the replication peer and table name to verify. Other options are available, such
as a time range or specific families to verify.
The command has the following form:
hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication
[--starttime=timestamp1] [--stoptime=timestamp] [--families=comma separated list of
families] <peerId> <tablename>
The VerifyReplication command prints GOODROWS and BADROWS counters to indicate rows that did and did
not replicate correctly.
Note: HBase peer-to-peer replication from a non-Kerberized cluster to a Kerberized cluster is not
supported.
1. Set up Kerberos on your cluster, as described in Enabling Kerberos Authentication for CDH.
2. If necessary, configure Kerberos cross-realm authentication.
• At the command line, use the list_principals command to list the kdc, admin_server, and
default_domain for each realm.
• Add this information to each cluster using Cloudera Manager. For each cluster, go to HDFS > Configuration >
Trusted Kerberos Realms. Add the target and source. This requires a restart of HDFS.
3. Configure ZooKeeper.
4. Configure the following HDFS parameters on both clusters, in Cloudera Manager or in the listed files if you do not
use Cloudera Manager:
Note:
If you use Cloudera Manager to manage your cluster, do not set these properties directly in
configuration files, because Cloudera Manager will overwrite or ignore these settings. You must
set these properties in Cloudera Manager.
For brevity, the Cloudera Manager setting names are not listed here, but you can search by
property name. For instance, in the HDFS service configuration screen, search for
dfs.encrypt.data.transfer. The Enable Data Transfer Encryption setting is shown. Selecting the
box is equivalent to setting the value to true.
5. Configure the following HBase parameters on both clusters, using Cloudera Manager or in hbase-site.xml if
you do not use Cloudera Managert.
<value>true</value>
</property>
<property>
<name>hbase.ssl.enabled</name>
<value>true</value>
</property>
<property>
<name>hbase.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hbase.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hbase.secure.rpc.engine</name>
<value>true</value>
</property>
6. Add the cluster peers using the simplified add_peer syntax, as described in Add Peer.
7. If you need to add any peers which require custom security configuration, modify the add_peer syntax, using
the following examples as a model.
'hbase.regionserver.keytab.file' =>
'/keytabs/vegas_hbase.keytab',
'hbase.master.keytab.file' =>
'/keytabs/vegas_hbase.keytab'},
TABLE_CFS => {"tbl" => [cf1']}
Note: For information about adding trusted realms to the cluster, see Adding Trusted Realms to
the Cluster.
3. Enter a Reason for change, and then click Save Changes to commit the changes. Restart the role and service when
Cloudera Manager prompts you to restart.
6. Manually copy and paste the source cluster’s HBase client configuration files in the target cluster where you want
the data to be replicated. Copy core-site.xml, hdfs-site.xml, and hbase-site.xml to the target cluster.
Do this for all RegionServers.
7. Go to the target cluster where you want the data to be replicated.
8. Go to HBase > Configuration.
9. Select Scope > (Service-Wide).
10. Locate the HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml
property or search for it by typing its name in the Search box.
11. Add the following property value:
• Name: hbase.replication.conf.dir
Value: /opt/cloudera/fs_conf
Description: Path to the configuration directory where the source cluster’s configuration files have been
copied. The path for copying the configuration file is [hbase.replication.conf.dir]/[hbase.replication.cluster.id],
i.e.:/opt/cloudera/fs_conf/source/
12. Restart the HBase service on both clusters to deploy the new configurations.
Important:
• In the target cluster, ensure that you copy the configuration files to the path set in
[hbase.replication.conf.dir]. There, you must create a directory with the
[hbase.replication.cluster.id] name.
• Make sure to set the correct file permissions to hbase user using the command:
13. Add the peer to the source cluster as you would with normal replication.
• In the HBase shell, add the target cluster as a peer using the following command:
• Enable the replication for the table to which you will be bulk loading data using the command:
enable_table_replication 'IntegrationTestBulkLoad'
• Alternatively, you can allow replication on a column family using the command:
disable ‘IntegrationTestBulkLoad’
alter 'IntegrationTestBulkLoad', {NAME => ‘D’, REPLICATION_SCOPE => '1'}
enable ‘IntegrationTestBulkLoad’
You can verify if BulkLoad Replication is working in your set up by following the example in this blog post:
https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
Note: This log accumulation is a powerful side effect of the disable_peer command and can be
used to your advantage. See Initiating Replication When Data Already Exists on page 590.
disable_peer("1")
• To re-enable peer 1:
enable_peer("1")
hbase> disable_peer("1")
hbase> disable_table_replication
Already queued edits will be replicated after you use the disable_table_replication command, but new entries
will not. See Understanding How WAL Rolling Affects Replication on page 590.
To start replication again, use the enable_peer command.
1. On the source cluster, describe the table using HBase Shell. The output below has been reformatted for readability.
"CREATE 'cme_users' ,
3. On the destination cluster, paste the command from the previous step into HBase Shell to create the table.
The following diagram shows the consequences of adding and removing peer clusters with unpredictable WAL rolling
occurring. Follow the time line and notice which peer clusters receive which writes. Writes that occurred before the
WAL is rolled are not retroactively replicated to new peers that were not participating in the cluster before the WAL
was rolled.
Follow these instructions to recover HBase data from a replicated cluster in a disaster recovery scenario.
1. Change the value of the column family property REPLICATION_SCOPE on the sink to 0 for each column to be
restored, so that its data will not be replicated during the restore operation.
2. Change the value of the column family property REPLICATION_SCOPE on the source to 1 for each column to be
restored, so that its data will be replicated.
3. Use the CopyTable or distcp commands to import the data from the backup to the sink cluster, as outlined in
Initiating Replication When Data Already Exists on page 590.
4. Add the sink as a replication peer to the source, using the add_peer command as discussed in Deploying HBase
Replication on page 584. If you used distcp in the previous step, restart or rolling restart both clusters, so that
the RegionServers will pick up the new files. If you used CopyTable, you do not need to restart the clusters. New
data will be replicated as it is written.
5. When restoration is complete, change the REPLICATION_SCOPE values back to their values before initiating the
restoration.
Note: To use the hbase user in a secure cluster, use Cloudera Manager to add the hbase
user as a YARN whitelisted user. For a new installation, the hbase user is already added to
the whitelisted users. In addition, /user/hbase should exist on HDFS and owned as the hbase
user, because YARN will create a job staging directory there.
...
org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
BADROWS=2
CONTENT_DIFFERENT_ROWS=1
GOODROWS=1
ONLY_IN_PEER_TABLE_ROWS=1
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
Counter Description
GOODROWS Number of rows. On both clusters, and all values are the same.
CONTENT_DIFFERENT_ROWS The key is the same on both source and destination clusters for a row,
but the value differs.
ONLY_IN_SOURCE_TABLE_ROWS Rows that are only present in the source cluster but not in the
destination cluster.
ONLY_IN_PEER_TABLE_ROWS Rows that are only present in the destination cluster but not in the
source cluster.
BADROWS Total number of rows that differ from the source and destination
clusters; the sum of CONTENT_DIFFERENT_ROWS +
ONLY_IN_SOURCE_TABLE_ROWS + ONLY_IN_PEER_TABLE_ROWS
By default, VerifyReplication compares the entire content of table1 on the source cluster against table1 on
the destination cluster that is configured to use the replication peer peer1.
Use the following options to define the period of time, versions, or column families
Option Description
--starttime=<timestamp> Beginning of the time range, in milliseconds. Time range is forever if no
end time is defined.
--endtime=<timestamp> End of the time range, in milliseconds.
--versions=<versions> Number of cell versions to verify.
--families=<cf1,cf2,..> Families to copy; separated by commas.
The following example, verifies replication only for rows with a timestamp range of one day:
Replication Caveats
• Two variables govern replication: hbase.replication as described above under Deploying HBase Replication
on page 584, and a replication znode. Stopping replication (using disable_table_replication as above) sets
the znode to false. Two problems can result:
– If you add a new RegionServer to the active cluster while replication is stopped, its current log will not be
added to the replication queue, because the replication znode is still set to false. If you restart replication
at this point (using enable_peer), entries in the log will not be replicated.
– Similarly, if a log rolls on an existing RegionServer on the active cluster while replication is stopped, the new
log will not be replicated, because the replication znode was set to false when the new log was created.
• In the case of a long-running, write-intensive workload, the destination cluster may become unresponsive if its
meta-handlers are blocked while performing the replication. CDH has three properties to deal with this problem:
– hbase.regionserver.replication.handler.count - the number of replication handlers in the
destination cluster (default is 3). Replication is now handled by separate handlers in the destination cluster
to avoid the above-mentioned sluggishness. Increase it to a high value if the ratio of active to passive
RegionServers is high.
– replication.sink.client.retries.number - the number of times the HBase replication client at the
sink cluster should retry writing the WAL entries (default is 1).
– replication.sink.client.ops.timeout - the timeout for the HBase replication client at the sink cluster
(default is 20 seconds).
• For namespaces, tables, column families, or cells with associated ACLs, the ACLs themselves are not replicated.
The ACLs need to be re-created manually on the target table. This behavior opens up the possibility for the ACLs
could be different in the source and destination cluster.
Snapshots
You can create HBase and HDFS snapshots using Cloudera Manager or by using the command line.
• HBase snapshots allow you to create point-in-time backups of tables without making data copies, and with minimal
impact on RegionServers. HBase snapshots are supported for clusters running CDH 4.2 or higher.
• HDFS snapshots allow you to create point-in-time backups of directories or the entire filesystem without actually
cloning the data. They can improve data replication performance and prevent errors caused by changes to a source
directory. These snapshots appear on the filesystem as read-only directories that can be accessed just like other
ordinary directories.
NEW! View a video about Using Snapshots and Cloudera Manager to Back Up Data.
Note: You can improve the reliability of Data Replication on page 541 by also using snapshots. See
Using Snapshots with Replication on page 576.
Note: You must enable an HDFS directory for snapshots to allow snapshot policies to be created for
that directory. To designate a HDFS directory as snapshottable, follow the procedure in Enabling and
Disabling HDFS Snapshots on page 609.
Existing snapshot policies are shown in a table. See Snapshot Policies Page on page 596.
2. To create a new policy, click Create Snapshot Policy.
3. From the drop-down list, select the service (HDFS or HBase) and cluster for which you want to create a policy.
4. Provide a name for the policy and, optionally, a description.
5. Specify the directories, namespaces or tables to include in the snapshot.
• For an HDFS service, select the paths of the directories to include in the snapshot. The drop-down list allows
you to select only directories that are enabled for snapshotting. If no directories are enabled for snapshotting,
a warning displays.
7. Specify whether Alerts should be generated for various state changes in the snapshot workflow. You can alert on
failure, on start, on success, or when the snapshot workflow is aborted.
8. Click Save Policy.
The new Policy displays on the Snapshot Policies page. See Snapshot Policies Page on page 596.
Column Description
Policy Name The name of the policy.
Cluster The cluster that hosts the service (HDFS or HBase).
Service The service from which the snapshot is taken.
Objects HDFS Snapshots: The directories included in the snapshot.
HBase Snapshots: The tables included in the snapshot.
Last Run The date and time the snapshot last ran. Click the link to view the Snapshots History page. Also
displays the status icon for the last run.
Snapshot Schedule The type of schedule defined for the snapshot: Hourly, Daily, Weekly, Monthly, or Yearly.
Actions A drop-down menu with the following options:
• Show History - Opens the Snapshots History page. See Snapshots History on page 596.
• Edit Configuration - Edit the snapshot policy.
• Delete - Deletes the snapshot policy.
• Enable - Enables running of scheduled snapshot jobs.
• Disable - Disables running of scheduled snapshot jobs.
Snapshots History
The Snapshots History page displays information about Snapshot jobs that have been run or attempted. The page
displays a table of Snapshot jobs with the following columns:
Column Description
Start Time Time when the snapshot job started execution.
Click to display details about the snapshot. For example:
Click the View link to open the Managed scheduled snapshots Command page, which displays details
and messages about each step in the execution of the command. For example:
Column Description
Paths | Tables HDFS Snapshots: the number of Paths Unprocessed for the snapshot.
Unprocessed
HBase Snapshots: the number of Tables Unprocessed for the snapshot.
See Managing HDFS Snapshots on page 608 and Managing HBase Snapshots on page 598 for more information about
managing snapshots.
Orphaned Snapshots
When a snapshot policy includes a limit on the number of snapshots to keep, Cloudera Manager checks the total
number of stored snapshots each time a new snapshot is added, and automatically deletes the oldest existing snapshot
if necessary. When a snapshot policy is edited or deleted, files, directories, or tables that were removed from the policy
may leave "orphaned" snapshots behind that are not deleted automatically because they are no longer associated
with a current snapshot policy. Cloudera Manager never selects these snapshots for automatic deletion because
selection for deletion only occurs when the policy creates a new snapshot containing those files, directories, or tables.
You can delete snapshots manually through Cloudera Manager or by creating a command-line script that uses the
HDFS or HBase snapshot commands. Orphaned snapshots can be hard to locate for manual deletion. Snapshot policies
automatically receive the prefix cm-auto followed by a globally unique identifier (GUID). You can locate all snapshots
for a specific policy by searching for t the prefix cm-auto-guid that is unique to that policy.
To avoid orphaned snapshots, delete snapshots before editing or deleting the associated snapshot policy, or record
the identifying name for the snapshots you want to delete. This prefix is displayed in the summary of the policy in the
policy list and appears in the delete dialog box. Recording the snapshot names, including the associated policy prefix,
is necessary because the prefix associated with a policy cannot be determined after the policy has been deleted, and
snapshot names do not contain recognizable references to snapshot policies.
Warning: If you use coprocessors, the coprocessor must be available on the destination cluster before
restoring the snapshot.
To restore a snapshot to a new table, select Restore As from the menu associated with the snapshot, and provide a
name for the new table.
Warning: If you "Restore As" to an existing table (that is, specify a table name that already exists),
the existing table will be overwritten.
Important: When HBase snapshots are stored on, or restored from, Amazon S3, a MapReduce (MRv2)
job is created to copy the HBase table data and metadata. The YARN service must be running on your
Cloudera Manager cluster to use this feature.
To configure HBase to store snapshots on Amazon S3, you must have the following information:
• The access key ID for your Amazon S3 account.
• The secret access key for your Amazon S3 account.
• The path to the directory in Amazon S3 where you want your HBase snapshots to be stored.
You can improve the transfer of large snapshots to Amazon S3 by increasing the number of nodes due to throughput
limitations of EC2 on a per node basis.
Configuring HBase in Cloudera Manager to Store Snapshots in Amazon S3
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
Perform the following steps in Cloudera Manager:
1. Open the HBase service page.
2. Select Scope > HBASE (Service-Wide).
3. Select Category > Backup.
4. Type AWS in the Search box.
5. Enter your Amazon S3 access key ID in the field AWS S3 access key ID for remote snapshots.
6. Enter your Amazon S3 secret access key in the field AWS S3 secret access key for remote snapshots.
Important: If AWS S3 access keys are rotated, the Cloudera Manager server must be restarted.
7. Enter the path to the location in Amazon S3 where your HBase snapshots will be stored in the field Amazon S3
Path for Remote Snapshots.
Warning: Do not use the Amazon S3 location defined by the path entered in Amazon S3 Path
for Remote Snapshots for any other purpose, or directly add or delete content there. Doing so
risks corrupting the metadata associated with the HBase snapshots stored there. Use this path
and Amazon S3 location only through Cloudera Manager, and only for managing HBase snapshots.
8. In a terminal window, log in to your Cloudera Manager cluster at the command line and create a /user/hbase
directory in HDFS. Change the owner of the directory to hbase. For example:
Note: Amazon S3 has default rate limitation per prefix per bucket. The throughput can be limited
to 3500 requests per second. Consider to use different prefixes on S3 per table namespace, or
table if any of the following applies:
• large number of tables
• tables with a large number of store files or regions
• frequent snapshot policy
Configuring the Dynamic Resource Pool Used for Exporting and Importing Snapshots in Amazon S3
Dynamic resource pools are used to control the resources available for MapReduce jobs created for HBase snapshots
on Amazon S3. By default, MapReduce jobs run against the default dynamic resource pool. To choose a different
dynamic resource pool for HBase snapshots stored on Amazon S3, follow these steps:
1. Open the HBase service page.
2. Select Scope > HBASE (Service-Wide).
3. Select Category > Backup.
4. Type Scheduler in the Search box.
5. Enter name of a dynamic resource pool in the Scheduler pool for remote snapshots in AWS S3 property.
6. Click Save Changes.
HBase Snapshots on Amazon S3 with Kerberos Enabled
Starting with Cloudera Manager 5.8, YARN should by default allow the hbase user to run MapReduce jobs even when
Kerberos is enabled. However, this change only applies to new Cloudera Manager deployments, and not if you have
upgraded from a previous version to Cloudera Manager 5.8 (or higher).
If Kerberos is enabled on your cluster, and YARN does not allow the hbase user to run MapReduce jobs, perform the
following steps:
1. Open the YARN service page in Cloudera Manager.
2. Select Scope > NodeManager.
3. Select Category > Security.
4. In the Allowed System Users property, click the + sign and add hbase to the list of allowed system users.
5. Click Save Changes.
6. Restart the YARN service.
Managing HBase Snapshots on Amazon S3 in Cloudera Manager
Minimum Required Role: BDR Administrator (also provided by Full Administrator)
To take HBase snapshots and store them on Amazon S3, perform the following steps:
1. On the HBase service page in Cloudera Manager, click the Table Browser tab.
2. Select a table in the Table Browser. If any recent local or remote snapshots already exist, they display on the right
side.
3. In the dropdown for the selected table, click Take Snapshot.
4. Enter a name in the Snapshot Name field of the Take Snapshot dialog box.
5. If Amazon S3 storage is configured as described above, the Take Snapshot dialog box Destination section shows
a choice of Local or Remote S3. Select Remote S3.
6. Click Take Snapshot.
While the Take Snapshot command is running, a local copy of the snapshot with a name beginning cm-tmp
followed by an auto-generated filename is displayed in the Table Browser. This local copy is deleted as soon as
the remote snapshot has been stored in Amazon S3. If the command fails without being completed, the temporary
local snapshot might not be deleted. This copy can be manually deleted or kept as a valid local snapshot. To store
a current snapshot in Amazon S3, run the Take Snapshot command again, selecting Remote S3 as the Destination,
or use the HBase command-line tools to manually export the existing temporary local snapshot to Amazon S3.
Note: You can only configure a policy as Local or Remote S3 at the time the policy is created and
cannot change the setting later. If the setting is wrong, create a new policy.
When you create a snapshot based on a snapshot policy, a local copy of the snapshot is created with a name beginning
with cm-auto followed by an auto-generated filename. The temporary copy of the snapshot is displayed in the Table
Browser and is deleted as soon as the remote snapshot has been stored in Amazon S3. If the snapshot procedure fails
without being completed, the temporary local snapshot might not be deleted. This copy can be manually deleted or
kept as a valid local snapshot. To export the HBase snapshot to Amazon S3, use the HBase command-line tools to
manually export the existing temporary local snapshot to Amazon S3.
Important:
• Follow these command-line instructions on systems that do not use Cloudera Manager.
• This information applies specifically to CDH 6.3.x. See Cloudera Documentation for information
specific to other releases.
Use Cases
• Recovery from user or application errors
– Useful because it may be some time before the database administrator notices the error.
Note:
The database administrator needs to schedule the intervals at which to take and delete
snapshots. Use a script or management tool; HBase does not have this functionality.
– The database administrator may want to save a snapshot before a major application upgrade or change.
Note:
Snapshots are not primarily used for system upgrade protection because they do not roll
back binaries, and would not necessarily prevent bugs or errors in the system or the upgrade.
– Recovery cases:
– Roll back to previous snapshot and merge in reverted data.
– View previous snapshots and selectively merge them into production.
• Backup
– Capture a copy of the database and store it outside HBase for disaster recovery.
– Capture previous versions of data for compliance, regulation, and archiving.
– Export from a snapshot on a live system provides a more consistent view of HBase than CopyTable and
ExportTable.
• Offload work
– Capture, copy, and restore data to another site
– Export data to another cluster
Storage Considerations
Because hfiles are immutable, a snapshot consists of a reference to the files in the table at the moment the snapshot
is taken. No copies of the data are made during the snapshot operation, but copies may be made when a compaction
or deletion is triggered. In this case, if a snapshot has a reference to the files to be removed, the files are moved to an
archive folder, instead of being deleted. This allows the snapshot to be restored in full.
Because no copies are performed, multiple snapshots share the same hfiles, butfor tables with lots of updates, and
compactions, each snapshot could have a different set of hfiles.
Configuring and Enabling Snapshots
Snapshots are on by default; to disable them, set the hbase.snapshot.enabled property in hbase-site.xml to
false:
<property>
<name>hbase.snapshot.enabled</name>
<value>
false
</value>
</property>
To enable snapshots after you have disabled them, set hbase.snapshot.enabled to true.
Note:
If you have taken snapshots and then decide to disable snapshots, you must delete the snapshots
before restarting the HBase master; the HBase master will not start if snapshots are disabled and
snapshots exist.
#!/bin/bash
# Take a snapshot of the table passed as an argument
# Usage: snapshot_script.sh table_name
# Names the snapshot in the format snapshot-YYYYMMDD
HBase Shell returns an exit code of 0 on successA non-zero exit code indicates the possibility of failure, not a definite
failure. Your script should check to see if the snapshot was created before taking the snapshot again, in the event of
a reported failure.
Exporting a Snapshot to Another Cluster
You can export any snapshot from one cluster to another. Exporting the snapshot copies the table's hfiles, logs, and
the snapshot metadata, from the source cluster to the destination cluster. Specify the -copy-from option to copy
from a remote cluster to the local cluster or another remote cluster. If you do not specify the -copy-from option, the
hbase.rootdir in the HBase configuration is used, which means that you are exporting from the current cluster. You
must specify the -copy-to option, to specify the destination cluster.
Note: Snapshots must be enabled on the destination cluster. See Configuring and Enabling Snapshots
on page 603.
Warning: If you use coprocessors, the coprocessor must be available on the destination cluster before
restoring the snapshot.
The ExportSnapshot tool executes a MapReduce Job similar to distcp to copy files to the other cluster. It works
at file-system level, so the HBase cluster can be offline.
Run ExportSnapshot as the hbase user or the user that owns the files. If the user, group, or permissions need to
be different on the destination cluster than the source cluster, use the -chuser, -chgroup, or -chmod options as
in the second example below, or be sure the destination directory has the correct permissions. In the following examples,
replace the HDFS server path and port with the appropriate ones for your cluster.
To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs://srv2:8020/hbase) using 16 mappers:
To export the snapshot and change the ownership of the files during the copy:
You can also use the Java -D option in many tools to specify MapReduce or other configuration properties. For example,
the following command copies MY_SNAPSHOT to hdfs://cluster2/hbase using groups of 10 hfiles per mapper:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot
-Dsnapshot.export.default.map.group=10 -snapshot MY_SNAPSHOT -copy-to
hdfs://cluster2/hbase
To specify a different name for the snapshot on the target cluster, use the -target option.
Restrictions
Warning:
Do not use merge in combination with snapshots. Merging two regions can cause data loss if
snapshots or cloned tables exist for this table.
The merge is likely to corrupt the snapshot and any tables cloned from the snapshot. If the table has
been restored from a snapshot, the merge may also corrupt the table. The snapshot may survive intact
if the regions being merged are not in the snapshot, and clones may survive if they do not share files
with the original table or snapshot. You can use the Snapinfo tool (see Information and Debugging
on page 607) to check the status of the snapshot. If the status is BROKEN, the snapshot is unusable.
• If you have enabled the AccessController Coprocessor for HBase, only a global administrator can take,
clone, or restore a snapshot, and these actions do not capture the ACL rights. This means that restoring a table
preserves the ACL rights of the existing table, and cloning a table creates a new table that has no ACL rights until
the administrator adds them.
• Do not take, clone, or restore a snapshot during a rolling restart. Snapshots require RegionServers to be up;
otherwise, the snapshot fails.
Note: This restriction also applies to a rolling upgrade, which can be done only through Cloudera
Manager.
If you are using HBase Replication and you need to restore a snapshot:
Important:
Snapshot restore is an emergency tool; you need to disable the table and table replication to get to
an earlier state, and you may lose data in the process.
If you are using HBase Replication, the replicas will be out of sync when you restore a snapshot. If you need to restore
a snapshot, proceed as follows:
1. Disable the table that is the restore target, and stop the replication.
2. Remove the table from both the master and worker clusters.
3. Restore the snapshot on the master cluster.
4. Create the table on the worker cluster and use CopyTable to initialize it.
Note:
If this is not an emergency (for example, if you know exactly which rows you have lost), you can create
a clone from the snapshot and create a MapReduce job to copy the data that you have lost.
In this case, you do not need to stop replication or disable your main table.
Snapshot Failures
Region moves, splits, and other metadata actions that happen while a snapshot is in progress can cause the snapshot
to fail. The software detects and rejects corrupted snapshot attempts.
Information and Debugging
You can use the SnapshotInfo tool to get information about a snapshot, including status, files, disk usage, and
debugging information.
Examples:
Use the -h option to print usage instructions for the SnapshotInfo utility.
$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -h
Usage: bin/hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo [options]
where [options] are:
-h|-help Show this help and exit.
-remote-dir Root directory that contains the snapshots.
-list-snapshots List all the available snapshots and exit.
-snapshot NAME Snapshot to examine.
-files Files and logs list.
-stats Files and logs stats.
-schema Describe the snapshotted table.
Use the -remote-dir option with the -list-snapshots option to list snapshots located on a remote system.
Use the -snapshot with the -stats options to display additional statistics about a snapshot.
1 HFiles (0 in archive), total size 1.0k (100.00% 1.0k shared with the source table)
Use the -schema option with the -snapshot option to display the schema of a snapshot.
Table Descriptor
----------------------------------------
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW',
REPLICATION_SCOPE => '0',
COMPRESSION => 'GZ', VERSIONS => '1', TTL => 'FOREVER', MIN_VERSIONS => '0',
KEEP_DELETED_CELLS => 'false',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
Use the -files option with the -snapshot option to list information about files contained in a snapshot.
Snapshot Files
----------------------------------------
52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/bdf29c39da2a4f2b81889eb4f7b18107
(archive)
52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/1e06029d0a2a4a709051b417aec88291
(archive)
86.8k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/506f601e14dc4c74a058be5843b99577
(archive)
52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/5c7f6916ab724eacbcea218a713941c4
(archive)
293.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/aec5e33a6564441d9bd423e31fc93abb
(archive)
52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/97782b2fbf0743edaacd8fef06ba51e4
(archive)
6 HFiles (6 in archive), total size 589.7k (0.00% 0.0 shared with the source table)
0 Logs, total size 0.0
• Designate HDFS directories to be "snapshottable" so snapshots can be created for those directories. To improve
performance, ensure that you do not enable snapshots at the root directory level.
• Initiate immediate (unscheduled) snapshots of a HDFS directory.
• View the list of saved snapshots currently being maintained. These can include one-off immediate snapshots, as
well as scheduled policy-based snapshots.
• Delete a saved snapshot.
• Restore an HDFS directory or file from a saved snapshot.
• Restore an HDFS directory or file from a saved snapshot to a new directory or file (Restore As).
Before using snapshots, note the following limitations:
• Snapshots that include encrypted directories cannot be restored outside of the zone within which they were
created.
• The Cloudera Manager Admin Console cannot perform snapshot operations (such as create, restore, and delete)
for HDFS paths with encryption-at-rest enabled. This limitation only affects the Cloudera Manager Admin Console
and does not affect CDH command-line tools or actions not performed by the Admin Console, such as BDR
replication which uses command-line tools. For more information about snapshot operations, see the Apache
HDFS snapshots documentation.
Note: Once you enable snapshots for a directory, you cannot enable snapshots on any of its
subdirectories. Snapshots can be taken only on directories that have snapshots enabled.
Taking Snapshots
Note: You can also schedule snapshots to occur regularly by creating a Snapshot Policy.
Deleting Snapshots
1. From the Clusters tab, select your CDH HDFS service.
2. Go to the File Browser tab.
3. Go to the directory with the snapshot you want to delete.
4.
In the list of snapshots, locate the snapshot you want to delete and click .
5. Select Delete.
Restoring Snapshots
Before you restore from a snapshot, ensure that there is adequate disk space.
1. From the Clusters tab, select your CDH HDFS service.
2. Go to the File Browser tab.
3. Go to the directory you want to restore.
4. In the File Browser, click the drop-down menu next to the full file path (to the right of the file browser listings)
and select one of the following:
• Restore Directory From Snapshot
• Restore Directory From Snapshot As...
The Restore Snapshot screen displays.
5. Select Restore Directory From Snapshot As... if you want to restore the snapshot to a different directory. Enter
the directory path to which the snapshot has to be restored. Ensure that there is enough space on HDFS to restore
the files from the snapshot.
Note: If you enter an existing directory path in the Restore Directory From Snapshot As... field,
the directory is overwritten.
• Delete Policy - Whether files that were deleted on the source should also be deleted from the destination
directory. This policy also determines the handling of files in the destination location that are unrelated
to the source. Options include:
– Keep Deleted Files - Retains the destination files even when they no longer exist at the source. (This
is the default.).
– Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. (Not supported
when replicating to S3 or ADLS.)
– Delete Permanently - Uses the least amount of space; use with caution.
• Preserve - Whether to preserve the block size, replication count, permissions (including ACLs), and
extended attributes (XAttrs) as they exist on the source file system, or to use the settings as configured
on the destination file system. By default source system settings are preserved. When Permission is
checked, and both the source and destination clusters support ACLs, replication preserves ACLs. Otherwise,
ACLs are not replicated. When Extended attributes is checked, and both the source and destination
clusters support extended attributes, replication preserves them. (This option only displays when both
source and destination clusters support extended attributes.)
If you select one or more of the Preserve options and you are replicating to S3 or ADLS, the values all of
these items are saved in meta data files on S3 or ADLS. When you replicate from S3 or ADls to HDFS, you
can select which of these options you want to preserve.
See Replication of Encrypted Data on page 580 and HDFS Transparent Encryption.
BDR Tutorials
Cloudera Backup and Disaster Recovery (BDR) is available with a Cloudera Enterprise license. Enterprise BDR lets you
replicate data from one cluster to another, or from one directory path to another on the same or on a different cluster.
In case of data loss, the backup replica can be used to restore data to the production cluster.
The time to start thinking about how to restore data is long before you might ever need to do so. These BDR tutorials
take you step-by-step through the process of backing up an example production cluster. The example backup replication
schedules are for one-time replication that makes a backup copy of Hive datasets or of HDFS files, respectively, on
another cluster designated as a backup cluster.
The restore processes detailed in each tutorial also take you step-by-step through the process of restoring data using
two different general approaches:
• How To Back Up and Restore Apache Hive Data Using Cloudera Enterprise BDR on page 612 highlights a one-off
data recovery scenario in which you create the replication schedule immediately after a data loss and use it to
restore data.
• How To Back Up and Restore HDFS Data Using Cloudera Enterprise BDR on page 624 shows you how pre-configure
replication schedules so they are available when needed.
Use either or both of these tutorials to help plan your own backup and restore strategy.
How To Back Up and Restore Apache Hive Data Using Cloudera Enterprise BDR
Cloudera Enterprise Backup and Disaster Recovery (BDR) uses replication schedules to copy data from one cluster to
another, enabling the second cluster to provide a backup for the first.
This tutorial shows you how to configure replication schedules to back up Apache Hive data and to restore data from
the backup cluster when needed.
The example clusters are not configured to use Kerberos, nor do they use external accounts for cloud storage on
Amazon Web Services (AWS). The name of the example production and the example backup cluster have each been
changed from the default "Cluster 1" name to Production DB (Main) and Offsite-Backup, respectively.
The example production cluster contains the Hive default database and an example database, us_fda_fea, which
contains data extracted from the US federal government's open data initiative at data.gov. The us_fda_fea database
contains three tables as shown in Hue Web UI:
As shown in the screenshot below, snapshots have been configured and enabled on the HDFS system path containing
the Hive database files. Using snapshots is one of the Best Practices for Back Up and Restore on page 613. See Using
Snapshots with Replication on page 576 for more information.
The example backup cluster (Offsite-Backup) has not yet been used for a backup yet, so the Hive default path is empty
as shown below:
Note: Screenshots in this guide show version 5.9 of the Cloudera Manager Admin Console.
The backup and restore processes are configured, managed, and executed using replication schedules. Each replication
schedule identifies a source and a destination for the given replication process. The replication process uses a pull
model. When the replication process runs, the configured destination cluster accesses the given source cluster and
transparently performs all tasks needed to recreate the Hive database and tables on the destination cluster.
The destination cluster handles configuration and running the schedule. Typically, creating a backup replication schedule
takes place on the backup cluster and creating a restore replication schedule takes place on the production cluster.
Thus, as shown in this tutorial, the example production cluster, Production DB (Main), is the source for the backup
replication schedule and the destination for the restore replication schedule.
Creating a Backup
Defining the backup replication schedule starts from the Cloudera Manager Admin Console on the destination cluster.
For this example, the destination cluster is the cluster being used as the backup and the source is the example production
cluster. To create the backup, follow these steps:
• Step 1: Establish a Peer Relationship to the Production Cluster on page 616
• Step 2: Configure the Replication Schedule on page 617
• Step 3: Verify Successful Replication on page 618
Step 1: Establish a Peer Relationship to the Production Cluster
You must have the BDR Administrator or Full Administrator role on both clusters to define a Peer relationship and
perform all subsequent steps.
1. Log in to Cloudera Manager Admin Console on the master node of the backup cluster.
2. Click the Backup tab and select Peers from the menu.
3. On the Peers page, click Add Peers.
4. On the Add Peer page:
a. Peer Name - Enter a meaningful name for the cluster that you want to back up, such as Production DB. This
peer name becomes available in the next step, to be selected as the source for the replication.
b. Peer URL - Enter the URL (including port number) for the Cloudera Manager Admin Console running on the
master node of the cluster.
c. Peer Admin Username - Enter the name of the administrator account for the cluster.
d. Peer Admin Password - Enter the password for the administrator account.
5. Click Add Peer to save your settings, connect to the production cluster, and establish this peer relationship.
The Peers page re-displays, showing the Status column as Connected (note the green check-mark):
You can now create a schedule to replicate Hive files from production to the backup cluster.
Step 2: Configure the Replication Schedule
From the Cloudera Manager Admin Console on the backup cluster:
1. Click the Backup tab and select Replication Schedules from the menu.
2. On the Replication Schedules page, select Create Schedule > Hive Replication.
3. On the Create Hive Replication page, click the General tab to display the default schedule options:
a. Source - Make sure the Hive node selected in the drop-down is the production cluster (the cluster to be
backed up).
b. Destination - This is the cluster you are logged into, the backup cluster. Select the Hive node on the cluster.
c. Databases - Select Replicate All to re-create all Hive databases from the production system to the backup.
Or deselect Replicate All and enter specific database name and tables to back up select databases or tables
only.
d. Schedule - Immediate. For production environments, change this to Recurring and set an appropriate
time-frame that can backup the selected dataset completely. For example, do not set an hourly schedule if
it takes two hours to back up the dataset.
e. Run As Username - Leave as Default.
f. Scheduler Pool - Leave as Default.
4. Click Save Schedule.
The files are replicated from the source cluster to the backup cluster immediately.
Note: When you configure a replication schedule to back up Hive data on a regular basis, make sure
that the schedule allows for each backup to complete. For example, do not create a schedule to back
up every hour if it takes two hours to complete a full backup.
When the process completes, the Replication Schedules page re-displays, showing a green check-mark and time-stamp
in the Last Run column:
When you set up your own schedules in your actual production environment, the Next Run column will likely also
contain a date and time according to your specifications for the schedule.
Step 3: Verify Successful Replication
You can verify that data has been replicated by using Hive commands, the HDFS File Browser, or the Hue Web UI (shown
below):
The Hive database is now on both the production and the backup clusters—the source and destination of the backup
Replication Schedule, respectively.
At this point if the production cluster has a catastrophic data loss, you can use the backup replica to restore the database
to the production cluster.
For example, assume that the us_fda_fea database was inadvertently deleted from the example production cluster as
shown in the Hue Web UI:
Whenever you first discover an issue (data loss, corruption) with production data, immediately disable any existing
backup replication schedules. Disabling the backup replication schedule prevents corrupt or missing data from being
replicated over an existing backup replica is why it is the first step in the restore process detailed in the next set of
steps.
Restoring Data from the Backup Cluster
Restoring data from a backup cluster takes place on the production cluster but requires that the backup replication
schedule is first disabled. The process includes these steps:
• Step 1: Disable Backup Replication Schedule on page 620
• Step 2: Establish a Peer Relationship to the Backup Cluster on page 620
• Step 3: Configure the Restore Replication Schedule on page 621
• Step 4: Disable the Replication Schedule on page 622
• Step 5: Verify Successful Replication on page 622
• Step 6: Re-enable the Backup Replication Schedule on page 623
Step 1: Disable Backup Replication Schedule
At the Cloudera Manager Admin Console on the backup cluster:
1. Select Backup > Replication Schedules.
2. On the Replication Schedules page, select the schedule.
3. From the Actions drop-down menu, select Disable.
When the Replication Schedules pages refreshes, the word Disabled displays in the Next Run column for the schedule.
With the backup replication schedule temporarily disabled, move to the production cluster to create and run the
replication schedule to restore the data as detailed in the remaining steps.
Step 2: Establish a Peer Relationship to the Backup Cluster
Log in to Cloudera Manager Admin Console on the master node of the production cluster.
1. Click the Backup tab and select Peers from the menu.
2. On the Peers page, click Add Peers button.
3. On the Add Peer page:
a. Peer Name - Enter a meaningful name for the cluster from which to obtain the backup data.
b. Peer URL - Enter the URL for the Cloudera Manager Admin Console (running on the master node of the
cluster).
c. Peer Admin Username - Enter the administrator user name for the backup cluster.
4. Click Add Peer to save your settings. The production cluster connects to the backup cluster, establishes the peer
relationship, and tests the connection.
The Peers page redisplays and lists the peer name, URL, and shows its Status (Connected) as shown below:
Important: Be sure that the source is the cluster where your backup is stored and the destination
is the cluster containing lost or damaged data that you want to replace with the backup.
5. Select Force Overwrite so that the backup cluster's metadata replaces the metadata on the production cluster.
The assumption is that the production cluster's dataset has been corrupted.
Important: The Force Overwrite setting can destroy tables or entries created after the backup
completed. Do not use this setting unless you are certain that you want to overwrite the destination
cluster with data from the source.
The restore process is complete. In a production environment, assuming the restored Hive database and tables are as
you want them in a temporary path, you can re-configure the replication schedule to restore the data to the original
path.
Step 6: Re-enable the Backup Replication Schedule
On the backup cluster, log in to the Cloudera Manager Admin Console.
1. Select Backup > Replication Schedules.
2. On the Replication Schedules page, select the schedule created at the beginning of this process.
3. From the Actions drop-down menu, select Enable.
The restore process is complete.
In actual production environments, create replication schedules that regularly back up your production clusters. To
restore data, create replication schedules as shown in this tutorial.
Alternatively, you can define replication schedules in advance but leave them disabled. See How To Back Up and Restore
HDFS Data Using Cloudera Enterprise BDR on page 624 for details.
See Backup and Disaster Recovery on page 540 and BDR Tutorials on page 612 or more information about Cloudera
Enterprise BDR.
How To Back Up and Restore HDFS Data Using Cloudera Enterprise BDR
Cloudera Enterprise Backup and Disaster Recovery (BDR) uses replication schedules to copy data from one cluster to
another, enabling the second cluster to provide a backup for the first. In case of any data loss, the second cluster—the
backup—can be used to restore data to production.
This tutorial shows you how to create a replication schedule to copy HDFS data from one cluster to another for a
backup, and how to create and test a replication schedule that you can use to restore data when needed in the future.
Creating replication schedules for backup and restore requires:
• A license for Cloudera Enterprise. Cloudera Enterprise BDR is available from the Backup menu of Cloudera Manager
Admin Console when licensed for Enterprise.
• The BDR Administrator or Full Administrator role on the clusters involved (typically, a production cluster and a
backup cluster).
The example clusters are not configured to use Kerberos, nor do they use external accounts for cloud storage on
Amazon Web Services (AWS).
The example production cluster contains nine HDFS files in the /user/cloudera path:
The example backup cluster has not been used as the destination of a replication schedule yet, so the HDFS file system
has no /user/cloudera directory:
Note: Screenshots in this guide show version 5.9 of the Cloudera Manager Admin Console.
The backup and restore processes are configured, managed, and executed using replication schedules. Each replication
schedule identifies a source and a destination for the given replication process. The replication process uses a pull
model. When the replication process runs, the configured destination cluster accesses the given source cluster and
transparently performs all tasks needed to copy the HDFS files to the destination cluster.
The destination cluster handles configuration and running the schedule. Typically, creating a backup replication schedule
takes place on the backup cluster and creating a restore replication schedule takes place on the production cluster.
Thus, as shown in this tutorial, the example production cluster, cloudera-bdr-src-{1..4}.cloud.computers.com, is the
source for the backup replication schedule and the destination for the restore replication schedule.
Backing Up HDFS Files
The backup process begins at the Cloudera Manager Admin Console on the cluster designated as the backup, and
includes these steps:
• Step 1: Establish a Peer Relationship to the Production Cluster on page 626
• Step 2: Configure the Replication Schedule for the Backup on page 627
• Step 3: Verify Successful Replication on page 628
Step 1: Establish a Peer Relationship to the Production Cluster
For a backup, the destination is the backup cluster, and the source is the production cluster.
The cluster establishing the peer relationship gains access to the source cluster and can run the export command, list
HDFS files, and read files for copying them to the destination cluster. These are all the actions performed by the
replication process whenever the defined schedule goes into action.
Defining the replication starts from the Cloudera Manager Admin Console on the backup cluster.
1. Log in to Cloudera Manager Admin Console on the backup cluster.
2. Click the Backup tab and select Peers from the menu.
3. On the Peers page, click Add Peer:
5. Click Add Peer to save your settings, connect to the production cluster, and establish this peer relationship.
After the system establishes and verifies the connection to the peer, the Peers page re-displays, showing the Status
column as Connected (note the green check-mark):
With the peer relationship established from destination to source, create a schedule to replicate HDFS files from the
source (production cluster) to the destination (backup cluster).
Step 2: Configure the Replication Schedule for the Backup
From the Cloudera Manager Admin Console on the backup cluster:
1. Click the Backup tab and select Replication Schedules from the menu.
2. On the Replication Schedules page, select Create Schedule > HDFS Replication.
3. On the Create HDFS Replication page, click the General tab to display the default schedule options:
a. Source - Make sure the cluster node selected in the drop-down is the production cluster (the cluster to be
backed up).
b. Source Path - Specify the directory name on the production cluster holding the files to back up. Use an asterisk
(*) on the directory name to specify that only the explicit directory and no others should be created on the
destination. Without the asterisk, directories may be nested inside a containing directory on the destination.
c. Destination - Select the cluster to use as the target of the replication process, typically, the cluster to which
you have logged in and the cluster to which you want to backup HDFS data.
d. Destination Path - Specify a directory name on the backup cluster.
e. Schedule - Immediate.
f. Run As Username - Leave as Default.
g. Scheduler Pool - Leave as Default.
4. Click Save Schedule.
The files are replicated from the source cluster to the backup cluster immediately.
When the process completes, the Replication Schedules page re-displays, showing a green check-mark and time-stamp
in the Last Run column.
Step 3: Verify Successful Replication
Verify that the HDFS files are now on the backup cluster by using the HDFS File Browser or the hadoop fs -ls
command (shown below):
The HDFS files are now on both the production cluster and the backup cluster.
You can now use the backup as a source for a restore onto the production system when needed.
Note: This approach lets you step through and test the process prior to using the schedule in a
production environment.
Setting up a schedule for a restore follows the same pattern as setting up the backup, but with all actions initiated
using the Cloudera Manager Admin Console on the production cluster.
To set up and test a replication schedule to restore HDFS from an existing backup copy, follow these steps:
• Step 1: Establish a Peer Relationship to the Backup Cluster on page 629
• Step 2: Configure Replication Schedule to Test the Restore on page 630
• Step 3: Disable the Replication Schedule on page 631
• Step 4: Test the Restore Replication Schedule on page 631
• Step 5: Verify Successful Data Restoration on page 632
Step 1: Establish a Peer Relationship to the Backup Cluster
To restore HDFS files on the production cluster, establish a Peer relationship from the destination to the source for
the restore.
Log in to the Cloudera Manager Admin Console on the production cluster.
1. Click the Backup tab and select Peers from the menu.
2. On the Peers page, click Add Peers.
4. Click Add Peer to save the settings, connect to the backup cluster, and establish the peer relationship.
Once the system connects and tests the peer relationship, the Peers page lists its name, URL, and Status (Connected):
The backup cluster is now available as a peer, for use in a Replication Schedule.
Step 2: Configure Replication Schedule to Test the Restore
The goal in these steps is to create a replication schedule that can be used when needed, in the future, but to leave it
in a disabled state. However, because Replication Schedules cannot be created in a disabled state, you initially set the
date far into the future and then disable the schedule in a subsequent step.
From the Cloudera Manager Admin Console on the production cluster:
1. Click the Backup tab and select Replication Schedules from the menu.
2. On the Replication Schedules page, select Create Schedule > HDFS Replication.
3. On the General settings tab of the Create HDFS Replication page:
a. Source - Make sure the cluster node selected in the drop-down is the backup cluster.
b. Source Path - Specify the directory name that you want to back up. Use an asterisk (*) on the directory name
to specify that only the explicit directory and no others should be created on the destination.
c. Destination - Select the cluster to use as the target for the replication process.
d. Destination Path - Specify a directory name on the backup cluster.
e. Schedule - Set to a time far in the future, so that the schedule does not run as soon as it is saved.
f. Run As Username - Leave as Default.
g. Scheduler Pool - Leave as Default.
You can leave restore Replication Schedules pre-configured and disabled in this way so they are ready to use in the
event of a catastrophic data loss. Before relying on this approach, test the schedule.
Step 4: Test the Restore Replication Schedule
The Replication Schedule defined in the example restores data to a specific directory path identified for the purpose
of restoration (/user/cloudera-restored) rather than targeting the original source directory path.
From the Cloudera Manager Admin Console on the production cluster, with Replication Schedules page displayed:
1. On the Replication Schedules page, select the disabled replication schedule.
2. Select Actions > Run Now.
When the replication process completes, disable the Replication Schedule once again:
1. On the Replication Schedules page, select the replication schedule.
2. Select Actions > Disable .
You can now verify that the files have been replicated to the destination directory path.
Step 5: Verify Successful Data Restoration
To manually verify that your data has been restored to the source cluster, you can use the HDFS File Browser or the
hadoop command-line, as shown here:
Compare the restored data in /user/cloudera-restored to the data in /user/cloudera to validate that the
restore schedule operates as expected.
At this point, until you need to actually restore production HDFS data files, you can leave the Replication Schedule
disabled.
Important: In a production environment, only one Replication Schedule for a given dataset should
be active at the same time. In this example, the Replication Schedule that created the backup has not
been disabled yet, because the HDFS files from the backup cluster were restored to a different path
on the example production cluster.
In the event of an actual data loss on a production cluster, you should first disable any existing replication schedules
for the affected datasets before activating a replication schedule for the restore, to avoid overwriting existing replicas
on the backup cluster with defective files.
• Step 1: Disable the Backup Replication Schedule on page 633
• Step 2: Edit the Existing Replication Schedule on page 633
• Step 3: Run the Restore Replication Schedule on page 634
• Step 4: Return the Restore Replication Schedule to a Disabled State on page 634
• Step 5: Re-enable the Backup Replication Schedule on page 634
Step 1: Disable the Backup Replication Schedule
Disabling any existing replication schedule for HDFS backups can help prevent the replication of lost or corrupted data
files over existing backups.
At the Cloudera Manager Admin Console on the backup cluster:
1. Select Backup > Replication Schedules.
2. On the Replication Schedules page, select the schedule.
3. From the Actions drop-down menu, select Disable:
When the Replication Schedules pages refreshes, you see Disabled in the Next Run column for the schedule.
Step 2: Edit the Existing Replication Schedule
With the replication schedule disabled, you can edit the replication schedule verified previously (Step 4: Test the
Restore Replication Schedule on page 631) and restore the data to the production cluster.
From the Cloudera Manager Admin Console on the production cluster:
1. Click the Backup tab and select Replication Schedules from the menu.
2. On the Replication Schedules page, select Create Schedule > HDFS Replication.
3. On the General settings tab of the Create HDFS Replication page:
a. Source - Name of the backup cluster from which to pull the data. For this example, the source is the backup
cluster.
b. Source Path - The path on the backup cluster that contains the data you want to restore. Use the asterisk (*)
at the end of the directory name to prevent extraneous sub-directories being created on the destination.
c. Destination - The name of the cluster on which to restore the data, in which case, the example production
cluster.
d. Destination Path - Directory in which to restore the HDFS data files, in this case, the directory on the example
production system.
e. Schedule - Once.
f. Start Time - Leave set to the future date and time that you originally defined in Step 2: Configure Replication
Schedule to Test the Restore on page 630.
g. Run as Username - Leave set as Default.
3. On the Replication Schedules page, select the appropriate replication schedule to back up the production cluster.
4. From the Actions drop-down menu, select Enable.
This concludes the tutorial. In an actual production environment, you should configure replication schedules to regularly
backup production systems. For restoring files from any backup, you can create and test a replication schedules in
advance, as shown in this tutorial.
Alternatively, you can create a replication schedule to restore data specifically when needed. See How To Back Up and
Restore Apache Hive Data Using Cloudera Enterprise BDR on page 612 for details.
See Backup and Disaster Recovery on page 540 and BDR Tutorials on page 612 for more information about Cloudera
Enterprise BDR.
import cm_client
from cm_client.rest import ApiException
from pprint import pprint
TARGET_CM_HOST = “[***destination_cluster***]”
SOURCE_CM_URL = “[***source_cluster***]:7180”
# Create an instance of the API class
api_host = TARGET_CM_HOST
port = '7183'
api_version = 'v30'
# Construct base URL for API
# http://cmhost:7180/api/v30api_url = api_host + ':' + port + '/api/' + api_version
api_client = cm_client.ApiClient(api_url)
api_instance = cm_client.CmPeersResourceApi(api_client)
api_response = api_instance.create_peer(body=body)
The above sample creates an API root handle and gets a Cloudera Manager instance from it before creating the peer.
To implement a similar solution to the example, keep the following guidelines in mind:
• Replace <destination_cluster> with the domain name of the destination, for example
target.cm.cloudera.com.
• Replace <source_cluster> with the domain name of the source, for example src.cm.cloudera.com:7180/.
• The user you specify must possess a role that is capable of creating a peer, such as the Cluster Administrator role.
Step 2. Create the HDFS Replication Schedule
After you have add a peer Cloudera Manager instance that functions as the source, you can create a replication schedule:
import cm_client
from cm_client.rest import ApiException
from pprint import pprint
cm_client.configuration.verify_ssl = False
TARGET_CM_HOST = “[***destination_cluster***]”
body = cm_client.ApiReplicationScheduleList([{
"displayName" : "ScheduleNameFromAPI",
"startTime" : "2021-03-11T18:28:18.684Z",
"interval" : 0,
"intervalUnit" : "MINUTE",
"alertOnStart" : False,
"alertOnSuccess" : False,
"alertOnFail" : False,
"alertOnAbort" : False,
"hdfsArguments" : {
"sourceService" : {
"peerName" : "peer_name_from_api",
"clusterName" : "Cluster 1",
"serviceName" : "HDFS-1"
},
"sourcePath" : "/tmp",
"destinationPath" : "/tmp",
"mapreduceServiceName" : "YARN-1",
"userName" : "d",
"numMaps" : 20,
"dryRun" : False,
"bandwidthPerMap" : 100,
"abortOnError" : False,
"removeMissingFiles" : False,
"preserveReplicationCount" : True,
"preserveBlockSize" : True,
"preservePermissions" : False,
"skipChecksumChecks" : False,
"skipListingChecksumChecks" : False,
"skipTrash" : False,
"replicationStrategy" : "DYNAMIC",
"preserveXAttrs" : False,
"exclusionFilters" : [ ],
"raiseSnapshotDiffFailures" : False
}}])
The example creates ApiHdfsReplicationArguments and populate attributes such as source path, destination
name, MapReduce service to use, and others. For the source service, you will need to provide the HDFS service name
and cluster name on the source Cloudera Manager instance. See the API documentation for the complete list of
attributes for ApiHdfsReplicationArguments.
At the end of the example, hdfs_args is used to create an HDFS replication schedule.
Step 3. Run the Replication Schedule
The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time
every day. You can also manually run the schedule using the following:
import cm_client
from cm_client.rest import ApiException
from pprint import pprint
cm_client.configuration.verify_ssl = False
TARGET_CM_HOST = “[***destination_cluster***]”
import cm_client
from cm_client.rest import ApiException
from pprint import pprint
# Configure HTTP basic authorization: basic
cm_client.configuration.username = “[***username***]”
cm_client.configuration.password = “[***password***]”
cm_client.configuration.verify_ssl = False
try:
# Returns a list of commands triggered by a schedule.
api_response = api_instance.read_history([*** cluster_name ***], [*** schedule_id
***], [*** service_name ***], limit=limit, offset=offset, view=view)
pprint(api_response)
except ApiException as e:
print("Exception when calling ReplicationsResourceApi->read_history: %s\n" % e)
try:
# Create a new external account.
api_response = api_instance.create_account(body=body)
pprint(api_response)
except ApiException as e:
print("Exception when calling ExternalAccountsResourceApi->create_account: %s\n" %
e)
CLUSTER_NAME='Cluster-tgt-1'
HDFS_NAME='HDFS-tgt-1'
CLOUD_ACCOUNT='cloudAccount1'
YARN_SERVICE='YARN-1'
hdfs = api_root.get_cluster(CLUSTER_NAME).get_service(HDFS_NAME)
hdfs_cloud_args = ApiHdfsCloudReplicationArguments(None)
hdfs_cloud_args.sourceService = ApiServiceRef(None,
peerName=None,
clusterName=CLUSTER_NAME,
serviceName=HDFS_NAME)
hdfs_cloud_args.sourcePath = '/src/path'
hdfs_cloud_args.destinationPath = 's3a://bucket/target/path/'
hdfs_cloud_args.destinationAccount = CLOUD_ACCOUNT
hdfs_cloud_args.mapreduceServiceName = YARN_SERVICE
The example creates ApiHdfsCloudReplicationArguments, populates it, and creates an HDFS to S3 backup
schedule. In addition to specifying attributes such as the source path and destination path, the example provides
destinationAccount as CLOUD_ACCOUNT and peerName as None in sourceService. The peerName is None since
there is no peer for cloud replication schedules.
hdfs_cloud_args is then used to create a HDFS-S3 replication schedule with a frequency of 1 day.
cmd = hdfs.trigger_replication_schedule(schedule.id)
cmd = cmd.wait()
result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult
schs = hdfs.get_replication_schedules()
sch = hdfs.get_replication_schedule(schedule_id)
sch = hdfs.delete_replication_schedule(schedule_id)
sch.hdfsArguments.removeMissingFiles = True
sch = hdfs.update_replication_schedule(sch.id, sch)
args = {}
resp = hdfs.collect_replication_diagnostic_data(schedule_id=schedule.id, args)
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems
within and across clusters. You can also use distcp to copy data to and from an Amazon S3 bucket. The distcp
command submits a regular MapReduce job that performs a file-by-file copy.
To see the distcp command options, run the built-in help:
hadoop distcp
Important:
• Do not run distcp as the hdfs user which is blacklisted for MapReduce jobs by default.
• Do not use Hadoop shell commands (such as cp, copyfromlocal, put, get) for large copying
jobs or you may experience I/O bottlenecks.
The following section provides the basic syntax for different distcp scenarios. For more information, see the relevant
section on this page.
Copying Between the Same CDH Version
Use the following syntax:
For example, the following command copies data from example-source to example-dest:
Note the webhdfs prefix for the remote cluster, which should be your source cluster. You must use webhdfs when
the clusters run different major versions. When clusters run the same version, you can use the hdfs protocol for better
performance.
For example, the following command copies data from a lower CDH source cluster named example-source to a
higher CDH version destination cluster namedexample-dest:
#Copying from S3
hadoop distcp s3a://<bucket>/<data> hdfs://<namenode>/<directory>/
#Copying to S3
hadoop distcp hdfs://<namenode>/<directory> s3a://<bucket>/<data>
This is a basic example of using distcp with S3. For more information, see Using DistCp with Amazon S3 on page 644.
Copying to/from ADLS Gen1 and Gen2
The following syntax for distcp shows how to copy data to/from ADLS Gen1:
To use distcp with ADLS Gen2, use the Gen2 URI instead of the Gen1 URI, for example:
2. In the hdfs-site.xml file in the distcpConf directory, add the nameservice ID for the remote cluster to the
dfs.nameservices property.
Note:
If the remote cluster has the same nameservice ID as the local cluster, change the remote cluster’s
nameservice ID. Nameservice names must be unique.
For example, if the nameservice name for both clusters is nameservice1, change the nameservice
ID of the remote cluster to a different ID, such as externalnameservice:
<property>
<name>dfs.nameservices</name>
<value>nameservice1,externalnameservice</value>
</property>
3. On the remote cluster, find the hdfs-site.xml file and copy the properties that refers to the nameservice ID
to the end of the hdfs-site.xml file in the distcpConf directory you created in step 1:
• dfs.ha.namenodes.<nameserviceID>
• dfs.client.failover.proxy.provider.<remote nameserviceID>
• dfs.ha.automatic-failover.enabled.<remote nameserviceID>
• dfs.namenode.rpc-address.<nameserviceID>.<namenode1>
• dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>
• dfs.namenode.http-address.<nameserviceID>.<namenode1>
• dfs.namenode.https-address.<nameserviceID>.<namenode1>
• dfs.namenode.rpc-address.<nameserviceID>.<namenode2>
• dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>
• dfs.namenode.http-address.<nameserviceID>.<namenode2>
• dfs.namenode.https-address.<nameserviceID>.<namenode2>
By default, you can find the hdfs-site.xml file in the /etc/hadoop/conf directory on a node of the remote
cluster.
4. If you changed the nameservice ID for the remote cluster in step 2, update the nameservice ID used in the properties
you copied in step 3 with the new nameservice ID, accordingly.
The following example shows the properties copied from the remote cluster with the following values:
• A remote nameservice called externalnameservice
• NameNodes called namenode1 and namenode2
• A host named remotecluster.com
<property>
<name>dfs.ha.namenodes.externalnameservice</name>
<value>namenode1,namenode2</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.externalnameservice</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.externalnameservice</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.rpc-address.externalnameservice.namenode1</name>
<value>remotecluster.com:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.externalnameservice.namenode1</name>
<value>remotecluster.com:8022</value>
</property>
<property>
<name>dfs.namenode.http-address.externalnameservice.namenode1</name>
<value>remotecluster.com:20101</value>
</property>
<property>
<name>dfs.namenode.https-address.externalnameservice.namenode1</name>
<value>remotecluster.com:20102</value>
</property>
<property>
<name>dfs.namenode.rpc-address.externalnameservice.namenode2</name>
<value>remotecluster.com:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.externalnameservice.namenode2</name>
<value>remotecluster.com:8022</value>
</property>
<property>
<name>dfs.namenode.http-address.externalnameservice.namenode2</name>
<value>remotecluster.com:20101</value>
</property>
<property>
<name>dfs.namenode.https-address.externalnameservice.namenode2</name>
<value>remotecluster.com:20102</value>
</property>
At this point, the hdfs-site.xml file in the distcpConf directory should have both clusters and 4 NameNode
IDs.
5. Depending on the use case, the options specified when you run the distcp may differ. Here are some examples:
Note: The remote cluster can be either the source or the target. The examples provided specify
the remote cluster as the source.
For example:
If the distcp source or target are in encryption zones, include the following distcp options: -skipcrccheck
-update. The distcp command may fail if you do not include these options when the source or target are in
encryption zones because the CRC for the files may differ.
For CDH 5.12.0 and later, distcp between clusters that both use HDFS Transparent Encryption, you must include
the exclude parameter.
<property>
<name>fs.s3a.access.key</name>
<value>...</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>...</value>
</property>
You can also enter the configurations in the Advanced Configuration Snippet for core-site.xml, which allows
Cloudera Manager to manage this configuration. See Custom Configuration on page 88.
You can also provide the credentials on the command line:
For example:
Important: Entering secrets on the command line is inherently insecure. These secrets may be accessed
in log files and other artifacts. Cloudera recommends that you use a credential provider to store
secrets. See Using a Credential Provider to Secure S3 Credentials on page 644.
Note: Using the -diff option with the distcp command requires a DistributedFileSystem on both
the source and destination and is not supported when using distcp to copy data to or from Amazon
S3.
Note: Using a Credential Provider does not work with MapReduce v1 (MRV1).
For example:
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
2. Copy the contents of the /etc/hadoop/conf directory to a working directory.
3. Add the following to the core-site.xml file in the working directory:
<property>
<name>hadoop.security.credential.provider.path</name>
<value>jceks://hdfs/path_to_credential_store_file</value>
</property>
4. Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
After completing these steps, you can run the distcp command using the following syntax:
You can also reference the credential store on the command line, without having to enter it in a copy of the
core-site.xml file. You also do not need to set a value for HADOOP_CONF_DIR. Use the following syntax:
There are additional options for the distcp command. See DistCp Guide (Apache Software Foundation).
Examples of DistCP Commands Using the S3 Protocol and Hidden Credentials
Copying files to Amazon S3
Copying files to Amazon S3 using the -filters option to exclude specified source files
You specify a file name with the -filters option. The referenced file contains regular expressions, one per line,
that define file name patterns to exclude from the distcp job. The pattern specified in the regular expression
should match the fully-qualified path of the intended files, including the scheme (hdfs, webhdfs, s3a, etc.). For
example, the following are valid expressions for excluding files:
hdfs://x.y.z:8020/a/b/c
webhdfs://x.y.z:50070/a/b/c
s3a://bucket/a/b/c
Reference the file containing the filter expressions using -filters option. For example:
.*foo.*
.*/bar/.*
hdfs://x.y.z:8020/tmp/.*
hdfs://x.y.z:8020/tmp1/file1
For more information about the -filters, -overwrite, and other options, see DIstCp Guide: Command Line Options
(Apache Software Foundation).
Note: The following examples use the ADLS Gen1 URI. To use ADLS Gen2 Preview, replace the Gen1
URI with the Gen2 URI:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
1. Configure connectivity to ADLS using one of the methods described in Configuring ADLS Gen1 Connectivity on
page 673 or Configuring ADLS Gen2 Connectivity on page 678.
2. If you are copying data to or from Amazon S3, also configure connectivity to S3 as described above. See Using
DistCp with Amazon S3 on page 644
3. Use the following syntax to define the Hadoop Credstore:
export HADOOP_CONF_DIR=path_to_working_directory
export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password
You can also use distcp to copy data between Amazon S3 and Microsoft ADLS.
S3 to ADLS:
ADLS to S3:
Note that when copying data between these remote filesystems, the data is first copied form the source filesystem to
the local cluster before being copied to the destination filesystem.
<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>your_access_key</value>
</property>
Note that in practice, you should never store your Azure access key in cleartext. Protect your Azure credentials
using one of the methods described at Configuring Azure Blob Storage Credentials.
2. Run your distcp jobs using the following syntax:
Reference
• Upstream Hadoop documentation on Hadoop Support for Azure
If your environment matches the one described above, use the following table to configure Kerberos delegation tokens
on your cluster so that you can successfully distcp across two secure clusters. Based on the direction of the trust
between the SOURCE and DESTINATION clusters, you can use the
mapreduce.job.hdfs-servers.token-renewal.exclude property to instruct ResourceManagers on either
cluster to skip or perform delegation token renewal for NameNode hosts.
Note: For CDH 5.12.0 and later, you must use the
mapreduce.job.hdfs-servers.token-renewal.exclude parameter if both clusters use the
HDFS Transparent Encryption feature.
Neither SOURCE nor DESTINATION If a common realm is usable (such as Active Directory), set the
trusts the other mapreduce.job.hdfs-servers.token-renewal.exclude property to a
comma-separated list of hostnames of the NameNodes of the cluster that is not
running the distcp job. For example, if you are running the job on the
DESTINATION cluster:
$ hadoop distcp
-Ddfs.namenode.kerberos.principal.pattern=* \
-Dmapreduce.job.hdfs-servers.token-renewal.exclude=SOURCE-nn-host1,SOURCE-nn-host2
\
hdfs://source-nn-nameservice/source/path \
/destination/path
Important: To use DistCp between two secure clusters in different Kerberos realms, you must use a
single Kerberos principal that can authenticate to both realms. In other words, a Kerberos realm trust
relationship must exist between the source and destination realms. This can be a one-way trust (in
either direction), a bi-directional trust, or even multiple one-way trusts where both the source and
destination realms trust a third realm (such as an Active Directory domain).
If there is no trust relationship between the source and destination realms, you cannot use DistCp to
copy data between the clusters, but you can use Cloudera Backup and Data Recovery (BDR). For more
information, see Enabling Replication Between Clusters with Kerberos Authentication on page 577.
Additionally, both clusters must run a supported JDK version. For information about supported JDK
versions, see Cloudera Enterprise 6 Requirements and Supported Versions.
This section explains how to copy data between two secure clusters in different Kerberos realms:
[realms]
QA.EXAMPLE.COM = {
kdc = kdc01.qa.example.com:88
admin_server = kdc01.qa.example.com:749
}
DEV.EXAMPLE.COM = {
kdc = kdc01.dev.example.com:88
admin_server = kdc01.dev.example.com:749
}
[domain_realm]
.qa.example.com = QA.EXAMPLE.COM
qa.example.com = QA.EXAMPLE.COM
.dev.example.com = DEV.EXAMPLE.COM
dev.example.com = DEV.EXAMPLE.COM
<property>
<name>dfs.namenode.kerberos.principal.pattern</name>
<value>*</value>
</property>
7. Enter a Reason for change, and then click Save Changes to commit the changes.
(If TLS/SSL is enabled) Specify Truststore Properties
The following properties must be configured in the ssl-client.xml file on the client submitting the distcp job to
establish trust between the target and destination clusters.
<property>
<name>ssl.client.truststore.location</name>
<value>path_to_truststore</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>XXXXXX</value>
</property>
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
If launching distcp fails, force Kerberos to use TCP instead of UDP by adding the following parameter to the krb5.conf
file on the client.
[libdefaults]
udp_preference_limit = 1
<property>
<name>ipc.client.fallback-to-simple-auth-allowed</name>
<value>true</value>
</property>
Copying Data between a Secure and an Insecure Cluster using DistCp and WebHDFS
You can use DistCp and WebHDFS to copy data between a secure cluster and an insecure cluster. Note that when doing
this, the distcp commands should be run from the secure cluster. by doing the following:
1. On the secure cluster, set ipc.client.fallback-to-simple-auth-allowed to true in core-site.xml:
<property>
<name>ipc.client.fallback-to-simple-auth-allowed</name>
<value>true</value>
</property>
Alternatively, you can also pass this as a parameter when you run the distcp command. If you want to do that,
move onto step 2.
2. On the insecure cluster, add the secured cluster's realm name to the insecure cluster's configuration:
a. In the Cloudera Manager Admin Console for the insecure cluster, navigate to Clusters > <HDFS cluster>.
b. On the Configuration tab, search for Trusted Kerberos Realms and add the secured cluster's realm name.
Note that his does not require Kerberos to be enabled but is a necessary step to allow the simple auth fallback
to happen in the hdfs:// protocol.
c. Save the change.
3. Use commands such as the following from the secure cluster side only:
#This example uses the insecure cluster as the source and the secure cluster as the
destination
distcp webhdfs://<insecure_namenode>:50070 webhdfs://<secure_namenode>:50470
#This example uses the sefcure cluster as the source and the insecure cluster as the
destination
distcp webhdfs://<secure_namenode>:500470 webhdfs://<insecure_namenode>:50070
#This example uses the insecure cluster as the source and the secure cluster (with TLS
enabled) as the destination cluster. swebhdfs is used instead of webhdfs when TLS is
enabled.
hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true
webhdfs://<insecure_namenode>:50070 swebhdfs://<secure_namenode>:50470
#This example uses the secure cluster (with TLS enabled) as the source cluster and the
insecure cluster as the destination. swebhdfs is used instead of webhdfs when TLS is
enabled.
hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true
swebhdfs://<secure_namenode>:50470 webhdfs://<insecure_namenode>:50070
Post-migration Verification
After migrating data between the two clusters, it is a good idea to use hadoop fs -ls /basePath to verify the
permissions, ownership and other aspects of your files, and correct any problems before using the files in your new
cluster.
Backing Up Databases
Cloudera recommends that you schedule regular backups of the databases that Cloudera Manager uses to store
configuration, monitoring, and reporting data and for managed services that require a database:
• Cloudera Manager Server - Contains all the information about services you have configured and their role
assignments, all configuration history, commands, users, and running processes. This relatively small database (<
100 MB) is the most important to back up.
Important: When you restart processes, the configuration for each of the services is redeployed
using information saved in the Cloudera Manager database. If this information is not available,
your cluster cannot start or function correctly. You must schedule and maintain regular backups
of the Cloudera Manager database to recover the cluster in the event of the loss of this database.
For more information, see Backing Up Databases on page 652.
• Oozie Server - Contains Oozie workflow, coordinator, and bundle data. Can grow very large.
• Sqoop Server - Contains entities such as the connector, driver, links and jobs. Relatively small.
• Activity Monitor - Contains information about past activities. In large clusters, this database can grow large.
Configuring an Activity Monitor database is only necessary if a MapReduce service is deployed.
• Reports Manager - Tracks disk utilization and processing activities over time. Medium-sized.
• Hive Metastore Server - Contains Hive metadata. Relatively small.
• Hue Server - Contains user account information, job submissions, and Hive queries. Relatively small.
• Sentry Server - Contains authorization metadata. Relatively small.
• Cloudera Navigator Audit Server - Contains auditing information. In large clusters, this database can grow large.
• Cloudera Navigator Metadata Server - Contains authorization, policies, and audit report metadata. Relatively
small.
com.cloudera.cmf.db.name=scm
com.cloudera.cmf.db.user=scm
com.cloudera.cmf.db.password=NnYfWIjlbk
3. Run the following command as root using the parameters from the preceding step:
For example, to back up the Activity Monitor database amon created in Creating Databases for Cloudera Software, on
the local host as the root user, with the password amon_password:
To back up the sample Activity Monitor database amon on remote host myhost.example.com as the root user, with
the password amon_password:
For example, to back up the Activity Monitor database amon created in Creating Databases for Cloudera Software, on
the local host as the root user, with the password amon_password:
To back up the sample Activity Monitor database amon on remote host myhost.example.com as the root user, with
the password amon_password:
Component Tasks
Backup and Disaster Recovery
• HDFS Replication To and From Amazon S3
• Hive Replication To and From Amazon S3
Cloudera Navigator
• Cloudera Navigator and S3
• S3 Data Extraction for Navigator
Hue
• How to Enable S3 Cloud Storage
• How to Use S3 as Source or Sink
Hive
• Tuning Apache Hive Performance on the Amazon S3 Filesystem in CDH
Impala
• Using Impala with the Amazon S3 Filesystem
• Specifying Impala Credentials to Access Data in S3
• Specifying Impala Credentials to Access Data in S3 with Cloudera Manager
Spark, YARN, MapReduce, Oozie
• Accessing Data Stored in Amazon S3 through Spark
• Configuring MapReduce to Read/Write with Amazon Web Services
• Configuring Oozie to Enable MapReduce Jobs to Read/Write from Amazon S3
• Using S3 Credentials with YARN, MapReduce, or Spark
• How to Configure a MapReduce Job to Access S3 with an HDFS Credstore on page 663
Cloudera Manager stores these values securely and does not store them in world-readable locations. The credentials
are masked in the Cloudera Manager Admin console, encrypted in the configurations passed to processes managed
by Cloudera Manager, and redacted from the logs.
To access this storage, you define AWS Credentials in Cloudera Manager, and then you add the S3 Connector Service
and configure it to use the AWS credentials.
Consider using the S3Guard feature to address possible issues with the "eventual consistency" guarantee provided by
Amazon for data stored in S3. To use the S3Guard feature, you provision an Amazon DynamoDb for use as an additional
metadata store to improve performance and guarantee that your queries return the most current data. See Configuring
and Managing S3Guard on page 660.
Important:
• If all hosts are configured with IAM Role-based Authentication that allows access to S3 and you
do not want to use S3Guard, you do not need to add the S3 Connector Service.
• When using the More Secure mode, you must have the Sentry service and Kerberos enabled for
the cluster in order add the S3 Connector Service. For secure operation, Cloudera also recommends
that you enable TLS for Cloudera Manager.
• A cluster cannot use the S3 Connector Service and the ADLS Connector Service at the same time.
You must remove the old connector service before adding a new one. See Removing the ADLS
Connector Service on page 672 or Removing the S3 Connector Service on page 657.
To add the S3 Connector Service using the Cloudera Manager Admin Console:
1. If you have not defined AWS Credentials, add AWS credentials in Cloudera Manager.
2. Go to the cluster where you want to add the Amazon S3 Connector Service.
3. Click Actions > Add Service.
4. Select S3 Connector.
5. Click Continue.
The Add S3 Connector Service to Cluster Name wizard displays.
The wizard checks your configuration for compatibility with S3 and reports any issues. The wizard does not allow
you to continue if you have an invalid configuration. Fix any issues, and then repeat these steps to add the S3
Connector Service.
6. Select a Credentials Protection Policy. (Not applicable when IAM Role-Based Authentication is used.)
Choose one of the following:
• Less Secure
Credentials can be stored in plain text in some configuration files for specific services (currently Hive, Impala,
and Hue) in the cluster.
This configuration is appropriate for unsecure, single-tenant clusters that provide fine-grained access control
for data stored in S3.
• More Secure
Cloudera Manager distributes secrets to a limited set of services (currently Hive, Impala, and Hue) and enables
those services to access S3. It does not distribute these credentials to any other clients or services. See S3
Credentials Security.
Other configurations that are not sensitive, such as the S3Guard configuration, are included in the configuration
of all services and clients as needed.
7. Click Continue.
8. Select previously-defined AWS credentials from the Name drop-down list.
9. Click Continue.
The Restart Dependent Services page displays and indicates the dependent services that need to be restarted.
10. Select Restart Now to restart these services. You can also restart these services later. Hive, Impala, and Hue will
not be able to authenticate with S3 until you restart the services.
11. Click Continue to complete the addition of the Amazon S3 service. If Restart Now is selected, the dependent
services are restarted.
Note: This method of specifying AWS credentials to clients does not completely distribute secrets
securely because the credentials are not encrypted. Use caution when operating in a multi-tenant
environment.
Programmatic
Specify the credentials in the configuration for the job. This option is most useful for Spark jobs.
Make a modified copy of the configuration files
Make a copy of the configuration files and add the S3 credentials:
1. For YARN and MapReduce jobs, copy the contents of the /etc/hadoop/conf directory to a local working
directory under the home directory of the host where you will submit the job. For Spark jobs, copy
/etc/spark/conf to a local directory under the home directory of the host where you will submit the job.
2. Set the permissions for the configuration files appropriately for your environment and ensure that unauthorized
users cannot access sensitive configurations in these files.
3. Add the following to the core-site.xml file within the <configuration> element:
<property>
<name>fs.s3a.access.key</name>
<value>Amazon S3 Access Key</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>Amazon S3 Secret Key</value>
</property>
4. Reference these versions of the configuration files when submitting jobs by running the following command:
• YARN or MapReduce:
• Spark:
Note: If you update the client configuration files from Cloudera Manager, you must repeat these
steps to use the new configurations.
<include xmlns="http://www.w3.org/2001/XInclude"
href="/etc/hadoop/conf/hdfs-site.xml">
<fallback />
</include>
5. Add the following to the core-site.xml file within the <configuration> element:
<property>
<name>fs.s3a.access.key</name>
<value>Amazon S3 Access Key</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>Amazon S3 Secret Key</value>
</property>
6. Reference these versions of the configuration files when submitting jobs by running the following command:
• YARN or MapReduce:
• Spark:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="/etc/hadoop/conf/core-site.xml">
<fallback />
</include>
<property>
<name>fs.s3a.access.key</name>
<value>Amazon S3 Access Key</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>Amazon S3 Secret Key</value>
</property>
</configuration>
s3a://bucket_name/path
• HDFS:
hdfs://path
or
/path
For more information about using Impala, Hive, and Spark on S3, see:
• Using Impala with the Amazon S3 Filesystem
• Tuning Apache Hive Performance on the Amazon S3 Filesystem in CDH
• Accessing Data Stored in Amazon S3 through Spark
Property Description
Automatically Create When Yes is selected, the DynamoDB table that stores the S3Guard metadata is
S3Guard Metadata Table automatically created if it does not exist.
(fs.s3a.s3guard.ddb.table.create) When No is selected and the table does not exist, running the Prune command,
queries, or other jobs on S3 will fail.
API Name:
s3guard_table_auto_create
S3Guard Metadata Table The name of the DynamoDB table that stores the S3Guard metadata.
Name
By default, the table is named s3guard-metadata.
(fs.s3a.s3guard.ddb.table)
API Name:
s3guard_table_name
S3Guard Metadata Region The DynamoDB region to connect to for access to the S3Guard metadata. Set this
Name property to a valid region. See DynamoDB regions.
(fs.s3a.s3guard.ddb.region)
API Name: s3guard_region
Property Description
S3Guard Metadata Table Provisioned throughput requirements, in capacity units, for read operations from
Read Capacity the DynamoDB table used for the S3Guard metadata. This value is only used when
creating a new DynamoDB table. After the table is created, you can monitor the
(fs.s3a.s3guard.ddb.table.capacity.read)
throughput and adjust the read capacity using the DynamoDB AWS Management
API Name: Console. See Provisioned Throughput.
s3guard_table_capacity_read
S3Guard Metadata Table Provisioned throughput requirements, in capacity units, for write operations to
Write Capacity the DynamoDB table used for the S3Guard metadata. This value is only used when
creating a new DynamoDB table. After the table is created, you can monitor the
(fs.s3a.s3guard.ddb.table.capacity.write)
throughput and adjust the write capacity as needed using the DynamoDB AWS
API Name: Management Console. See Provisioned Throughput.
s3guard_table_capacity_write
4. Click Save.
The Connect to Amazon Web Services dialog box displays.
5. To enable cluster access to S3 using the S3 Connector Service, click the Enable for Cluster Name link in the Cluster
Access to S3 section.
Follow the prompts to add the S3 Connector Service. See Adding the S3 Connector Service on page 656 for details.
Note: S3Guard is not supported for Cloud Backup and Restore and Cloudera Navigator Access
to S3.
'Cloudera_Manager_server_URL:port_number/api/vAPI_version_number/externalAccounts/account/Credential_Name/commands/S3GuardPrune'
For example, the following request runs the S3Guard prune command on the data associated with the johnsmith
credential. The response from Cloudera Manager is also displayed (within the curly brackets):
Python
You can also use a Python script to run the Prune command. See aws.py for the code and usage instructions.
Java
See the Javadoc.
2. Change the permissions of the directory so that only you have access:
3. Add the following to the copy of the core-site.xml file in the working directory:
<property>
<name>hadoop.security.credential.provider.path</name>
<value>jceks://hdfs/user/username/awscreds.jceks</value>
</property>
4. Specify a custom Credstore by running the following command on the client host:
export HADOOP_CREDSTORE_PASSWORD=your_custom_keystore_password
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_CREDSTORE_PASSWORD=your_custom_keystore_password</value>
</property>
<property>
<name>mapred.child.env</name>
<value>HADOOP_CREDSTORE_PASSWORD=your_custom_keystore_password</value>
</property>
<property>
<name>mapreduce.job.redacted-properties</name>
<value>fs.s3a.access.key,fs.s3a.secret.key,yarn.app.mapreduce.am.env,mapred.child.env</value>
</property>
export HADOOP_CONF_DIR=~/path_to_working_directory
You will be prompted to enter the access key and secret key.
8. List the credentials to make sure they were created correctly by running the following command:
• distcp
Note: Sqoop import is supported only into S3A (s3a:// protocol) filesystem.
Authentication
You must authenticate to an S3 bucket using Amazon Web Service credentials. There are three ways to pass these
credentials:
• Provide them in the configuration file or files manually.
• Provide them on the sqoop command line.
• Reference a credential store to "hide" sensitive data, so that they do not appear in the console output, configuration
file, or log files.
Amazon S3 Block Filesystem URI example: s3a://bucket_name/path/to/file
S3 credentials can be provided in a configuration file (for example, core-site.xml):
<property>
<name>fs.s3a.access.key</name>
<value>...</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>...</value>
</property>
You can also set up the configurations through Cloudera Manager by adding the configurations to the appropriate
Advanced Configuration Snippet property.
Credentials can be provided through the command line:
For example:
Important: Entering sensitive data on the command line is inherently insecure. The data entered can
be accessed in log files and other artifacts. Cloudera recommends that you use a credential provider
to store credentials. See Using a Credential Provider to Secure S3 Credentials on page 666.
Note: Using a Credential Provider does not work with MapReduce v1 (MRV1).
For example:
You can omit the -value option and its value. When the option is omitted, the command will prompt the user
to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
2. Copy the contents of the /etc/hadoop/conf directory to a working directory.
3. Add the following to the core-site.xml file in the working directory:
<property>
<name>hadoop.security.credential.provider.path</name>
<value>jceks://hdfs/path_to_credential_store_file</value>
</property>
4. Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
After completing these steps, you can run the sqoop command using the following syntax:
Import into a target directory in an Amazon S3 bucket while credentials are stored in a credential store file and its path
is set in the core-site.xml.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLENAME
--target-dir s3a://example-bucket/target-directory
You can also reference the credential store on the command line, without having to enter it in a copy of the core-site.xml
file. You also do not have to set a value for HADOOP_CONF_DIR. Use the following syntax:
Import into a target directory in an Amazon S3 bucket while credentials are stored in a credential store file and its path
is passed on the command line.
sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfspath-to-credential-store-file
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLENAME
--target-dir s3a://example-bucket/target-directory
Data from RDBMS can be imported into S3 as Sequence or Avro file format too.
Parquet import into S3 is also supported if the Parquet Hadoop API based implementation is used, meaning that the
--parquet-configurator-implementation option is set to hadoop. For more information about the Parquet
Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example command: Import data into a target directory in an Amazon S3 bucket as Parquet file.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLENAME
--target-dir s3a://example-bucket/target-directory --as-parquetfile
--parquet-configurator-implementation hadoop
Append Mode
When importing data into a target directory in an Amazon S3 bucket in incremental append mode, the location of the
temporary root directory must be in the same bucket as the directory. For example:
s3a://example-bucket/temporary-rootdir or
s3a://example-bucket/target-directory/temporary-rootdir.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental append mode.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--target-dir s3a://example-bucket/target-directory --incremental append --check-column
$CHECK_COLUMN --last-value $LAST_VALUE --temporary-rootdir
s3a://example-bucket/temporary-rootdir
Data from RDBMS can be imported into S3 in incremental append mode as Sequence or Avro file format. too
Parquet import into S3 in incremental append mode is also supported if the Parquet Hadoop API based implementation
is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information
about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental append mode as Parquet
file.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--target-dir s3a://example-bucket/target-directory --incremental append --check-column
$CHECK_COLUMN --last-value $LAST_VALUE --temporary-rootdir
s3a://example-bucket/temporary-rootdir --as-parquetfile
--parquet-configurator-implementation hadoop
Lastmodified Mode
When importing data into a target directory in an Amazon S3 bucket in incremental lastmodified mode, the location
of the temporary root directory must be in the same bucket and in the same directory as the target directory. For
example: s3a://example-bucket/temporary-rootdir in case of s3a://example-bucket/target-directory.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental lastmodified mode.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--target-dir s3a://example-bucket/target-directory --incremental lastmodified
--check-column $CHECK_COLUMN --merge-key $MERGE_KEY --last-value $LAST_VALUE
--temporary-rootdir s3a://example-bucket/temporary-rootdir
Parquet import into S3 in incremental lastmodified mode is supported if the Parquet Hadoop API based implementation
is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information
about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental lastmodified mode as
Parquet file.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--target-dir s3a://example-bucket/target-directory --incremental lastmodified
--check-column $CHECK_COLUMN --merge-key $MERGE_KEY --last-value $LAST_VALUE
--temporary-rootdir s3a://example-bucket/temporary-rootdir
--as-parquetfile --parquet-configurator-implementation hadoop
Parquet import into an external Hive table backed by S3 is supported if the Parquet Hadoop API based implementation
is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information
about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--hive-import --create-hive-table --hs2-url $HS2_URL --hs2-user $HS2_USER --hs2-keytab
$HS2_KEYTAB --hive-table $HIVE_TABLE_NAME --target-dir
s3a://example-bucket/target-directory --external-table-dir
s3a://example-bucket/external-directory
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--hive-import --create-hive-table --hive-table $HIVE_TABLE_NAME --target-dir
s3a://example-bucket/target-directory --external-table-dir
s3a://example-bucket/external-directory
Create an external Hive table backed by S3 as Parquet file using Hive CLI:
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--hive-import --create-hive-table --hive-table $HIVE_TABLE_NAME --target-dir
s3a://example-bucket/target-directory --external-table-dir
s3a://example-bucket/external-directory --as-parquetfile
--parquet-configurator-implementation hadoop
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--hive-import --hs2-url $HS2_URL --hs2-user $HS2_USER --hs2-keytab $HS2_KEYTAB
--target-dir s3a://example-bucket/target-directory --external-table-dir
s3a://example-bucket/external-directory
Import data into an external Hive table backed by S3 using Hive CLI:
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--hive-import --target-dir s3a://example-bucket/target-directory --external-table-dir
s3a://example-bucket/external-directory
Import data into an external Hive table backed by S3 as Parquet file using Hive CLI:
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME
--hive-import --target-dir s3a://example-bucket/target-directory --external-table-dir
s3a://example-bucket/external-directory --as-parquetfile
--parquet-configurator-implementation hadoop
sqoop import
-Dfs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore
-Dfs.s3a.s3guard.ddb.region=$BUCKET_REGION -Dfs.s3a.s3guard.ddb.table.create=true
--connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir
s3a://example-bucket/target-directory
Component Tasks
DistCp
• Using DistCp with Microsoft Azure (ADLS) on page 646
Hive
• Using Microsoft Azure Data Lake Store (Gen1 and Gen2) with Apache Hive in CDH
Impala
• Using Impala with the Azure Data Lake Store (ADLS)
Oozie
• Configuring Oozie to Enable MapReduce Jobs To Read/Write from Microsoft Azure (ADLS)
Spark, YARN, MapReduce
• Configuring ADLS Gen1 Connectivity on page 673
• Best Practices for Spark Streaming in the Cloud
• Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark
• Using Spark with Azure Data Lake Storage (ADLS)
Sqoop
• Importing Data into Microsoft Azure Data Lake Store Using Sqoop
Important: You cannot use ADLS as a source or destination for Backup and Disaster recovery or to
enable lineage or metadata extraction using Cloudera Navigator.
When you configure credentials using Cloudera Manager, it provides a more secure way to access ADLS using credentials
that are not stored in plain-text files. The client configuration files generated by Cloudera Manager based on configured
services do not include ADLS credentials. Command-line and API clients must manage access to these credentials
outside of Cloudera Manager. Cloudera Manager provides credentials directly to trusted clients such as Hive, the Impala
daemon, and Hue. For access from YARN, MapReduce or Spark, see Configuring ADLS Gen1 Connectivity on page 673.
Important: A cluster cannot use the ADLS Connector Service and the S3 Connector Service at the
same time. You must remove any current connector services before adding a new one. See Removing
the ADLS Connector Service on page 672 or Removing the S3 Connector Service on page 657.
1. In the Cloudera Manager Admin console, go to the cluster where you want to add the ADLS Connector Service.
2. Click Actions > Add Service.
3. Select ADLS Connector.
4. Click Continue.
8. Click Continue.
9. If you have enabled the Hue service, the Additional Configuration for Hue screen displays. Enter the domain name
of the Hue Browser Data Lake Store in the form: store_name.azuredatalakestore.net
10. Click Continue.
The Restart Dependent Services page displays and indicates the dependent services that need to be restarted.
11. Select Restart Now to restart these services. You can also restart these services later.
12. Click Continue to complete the addition of the ADLS Connector Service. If Restart Now is selected, the dependent
services are restarted. The progress of the restart commands displays.
13. When the commands finish executing, click Continue.
14. Click Finish.
You can also delete the ADLS Connector Service from the Cloudera Manager home page for the cluster. See Deleting
Services on page 257.
Microsoft Azure Data Lake Store (ADLS) is a massively scalable distributed file system that can be accessed through an
HDFS-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon
S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and
POSIX-compliant ACLs. See the ADLS documentation for conceptual details.
CDH supports using ADLS as a storage layer for MapReduce2 (MRv2 or YARN), Hive, Hive on Spark, Spark 2.1 and higher,
and Spark 1.6. Other applications are not supported and may not work, even if they use MapReduce or Spark as their
execution engine. Use the steps in this topic to set up a data store to use with these CDH components.
Note the following limitations:
• ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to
an adl:// URI. You can still use ADLS as secondary filesystem while HDFS remains the primary filesystem.
• Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.
Important:
While you are creating the service principal, write down the following values, which you will need
in step 4:
• The client id.
• The client secret.
• The refresh URL. To get this value, in the Azure portal, go to Azure Active Directory > App
registrations > Endpoints. In the Endpoints region, copy the OAUTH 2.0 TOKEN ENDPOINT.
This is the value you need for the refresh_URL in step 4.
3. Grant the service principal permission to access the ADLS account. See the Microsoft documentation on
Authorization and access control. Review the section, "Using ACLs for operations on file systems" for information
about granting the service principal permission to access the account.
You can skip the section on RBAC (role-based access control) because RBAC is used for management and you only
need data access.
4. Configure your CDH cluster to access your ADLS account. To access ADLS storage from a CDH cluster, you provide
values for the following properties when submitting jobs:
Client ID dfs.adls.oauth2.client.id
There are several methods you can use to provide these properties to your jobs. There are security and other
considerations for each method. Select one of the following methods to access data in ADLS:
• User-Supplied Key for Each Job on page 674
• Single Master Key for Cluster-Wide Access on page 675
• User-Supplied Key stored in a Hadoop Credential Provider on page 675
• Create a Hadoop Credential Provider and reference it in a customized copy of the core-site.xml file for the
service on page 676
If your configuration is correct, this command lists the files in your account.
2. After successfully testing your configuration, you can access the ADLS account from MRv2, Hive, Hive on Spark ,
Spark 1.6, Spark 2.1 and higher, or HBase by using the following URI:
adl://your_account.azuredatalakestore.net
For additional information and examples of using ADLS access with Hadoop components:
• Spark: See Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark
• distcp: See Using DistCp with Microsoft Azure (ADLS) on page 646.
• TeraGen:
export HADOOP_CONF_DIR=path_to_working_directory
export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
teragen 1000 adl://jzhugeadls.azuredatalakestore.net/tg
Important: Cloudera recommends that you only use this method for access to ADLS in development
environments or other environments where security is not a concern.
hadoop command
-Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
-Ddfs.adls.oauth2.client.id=CLIENT ID \
-Ddfs.adls.oauth2.credential='CLIENT SECRET' \
-Ddfs.adls.oauth2.refresh.url=REFRESH URL \
adl://<store>.azuredatalakestore.net/src hdfs://nn/tgt
Important: Cloudera recommends that you only use this method for access to ADLS in development
environments or other environments where security is not a concern.
1. Open the Cloudera Manager Admin Console and go to Cluster Name > Configuration > Advanced Configuration
Snippets.
2. Enter the following in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>dfs.adls.oauth2.client.id</name>
<value>CLIENT ID</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value>CLIENT SECRET</value>
</property>
<property>
<name>dfs.adls.oauth2.refresh.url</name>
<value>REFRESH URL</value>
</property>
a. Create a password for the Hadoop Credential Provider and export it to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
2. Reference the Credential Provider on the command line when submitting jobs:
hadoop command
-Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/USER_NAME/adls-cred.jceks
\
adl://<store>.azuredatalakestore.net/
Create a Hadoop Credential Provider and reference it in a customized copy of the core-site.xml file for
the service
• Advantages: all users can access the ADLS storage
• Disadvantages: you must pass the path to the credential store on the command line.
1. Create a Credential Provider:
a. Create a password for the Hadoop Credential Provider and export it to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
2. Copy the contents of the /etc/service/conf directory to a working directory. The service can be one of the
following verify list:
• yarn
• spark
• spark2
Use the --dereference option when copying the file so that symlinks are correctly resolved. For example:
<property>
<name>hadoop.security.credential.provider.path</name>
<value>jceks://hdfs/path_to_credential_store_file</value>
</property>
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
The value of the path_to_credential_store_file should be the same as the value for the --provider option in
the hadoop credential create command described in step 1.
4. Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
Important: CDH supports using ADLS Gen2 as a storage layer for MapReduce, Hive on MapReduce,
Hive on Spark, Spark, Oozie, and Impala. Other services such as Hue are not supported and may not
work.
Use the steps in this topic to set up a data store to use with these CDH components.
Connecting your CDH cluster to ADLS Gen2 consists of two parts: configuring an ADLS Gen2 account and connecting
CDH to ADLS Gen2.
Note: The URI scheme for ADLS Gen2 is different from ADLS Gen1. ADLS Gen2 uses the following URI
scheme:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/
By default ADLS Gen2 uses TLS. If you use the fs.azure.always.use.https property to turn off
this behavior, you must specify abfss as the prefix in the URI to use TLS. Otherwise, you can use
abfs.
Limitations
Note the following limitations:
• ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to
an abfss:// URI. You can use ADLS as secondary filesystem while HDFS remains the primary filesystem.
• Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.
• Directory and file names should not end with a period. Paths that end in periods can cause inconsistent behavior,
including the period disappearing. For more information, see HADOOP-15860.
For example, run the following command to create a container with a file system (or container) named milton
on the account clouds and the directory path1:
Name Value
fs.azure.account.auth.type OAuth
fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
Name Value
fs.azure.account.oauth2.client.id Provide your <Client_ID>
fs.azure.account.oauth2.client.secret Provide your <Client_Secret>
Note: The OAuth 2.0 configuration for ADLS Gen2 must use an Azure Active Directory (v1.0)
endpoint. Microsoft identity platform (v2.0) endpoints are not currently supported. See
https://docs.microsoft.com/en-au/azure/active-directory/develop/azure-ad-endpoint-comparison
for more information about the differences between these two endpoint types.
In addition, you can also provide account-specific keys. To do this, you need to add the following suffix to the key:
.<Account>.dfs.core.windows.net
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
3. Reference the credential provider on the command line when you submit a job:
hadoop <command> \
-Dfs.azure.account.oauth.provider.type =
org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider \
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/USER_NAME/adls2keyfile.jceks
\
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
whereis libssl
2. In the Cloudera Manager Admin Console, search for the following property: Gateway Client Environment Advanced
Configuration Snippet (Safety Valve) for hadoop-env.sh.
3. Add the following parameter to the property:
For example, if the OpenSSL libraries are in /usr/lib64, add the following parameter:
HADOOP_OPTS="-Dorg.wildfly.openssl.path=/usr/lib64 ${HADOOP_OPTS}"
For example, if the OpenSSL libraries are in /usr/lib64, add the following parameter:
HADOOP_OPTS="-Dorg.wildfly.openssl.path=/usr/lib64 ${HADOOP_OPTS}"
org.wildfly.openssl.SSL init
INFO: WFOPENSSL0002 OpenSSL Version OpenSSL 1.0.1e-fips 11 Feb 2013
The message may differ slightly depending on your operating system and OpenSSL version.
Importing Data into Microsoft Azure Data Lake Store (Gen1 and Gen2) Using Sqoop
Microsoft Azure Data Lake Store (ADLS) is a cloud object store designed for use as a hyper-scale repository for big data
analytic workloads. ADLS acts as a persistent storage layer for CDH clusters running on Azure.
There are two generations of ADLS, Gen1 and Gen2. You can use Apache Sqoop with both generations of ADLS to
efficiently transfer bulk data between these file systems and structured datastores such as relational databases. For
more information on ADLS Gen 1 and Gen 2, see:
• Microsoft ADLS Gen1 documentation
• Microsoft ADLS Gen2 documentation
You can use Sqoop to import data from any relational database that has a JDBC adaptor such as SQL Server, MySQL,
and others, to the ADLS file system.
Note: Sqoop export from the Azure files systems is not supported.
Configuring Sqoop to Import Data into Microsoft Azure Data Lake Storage (ADLS)
Prerequisites
The configuration procedure presumes that you have already set up an Azure account, and have configured an ADLS
Gen1 store or ADLS Gen2 storage account and container. See the following resources for information:
• Microsoft ADLS Gen1 documentation
• Microsoft ADLS Gen2 documentation
• Hadoop Azure Data Lake Support
Authentication
To connect CDH to ADLS with OAuth, you must configure the Hadoop CredentialProvider or core-site.xml directly.
Although configuring the core-site.xml is convenient, it is insecure, because the contents of core-site.xml
configuration file are not encrypted. For this reason, Cloudera recommends using a credential provider. For more
information, see Configuring OAuth in CDH.
You can also pass the credentials by providing them on the Sqoop command line as part of the import command.
sqoop import
-Dfs.azure.account.auth.type=...
-Dfs.azure.account.oauth.provider.type=...
-Dfs.azure.account.oauth2.client.endpoint=...
-Dfs.azure.account.oauth2.client.id=...
-Dfs.azure.account.oauth2.client.secret=...
For example:
sqoop import
-Dfs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/$TENANT_ID/oauth2/token
-Dfs.azure.account.oauth2.client.id=$CLIENT_ID
-Dfs.azure.account.oauth2.client.secret=$CLIENT_SECRET
-Dfs.azure.account.auth.type=OAuth
-Dfs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
If you want to enter a password for the data source, use the -P option in the connection string. If you want to specify
a file where the password is stored, use the --password-file option.
sqoop import
-Dfs.azure.account.auth.type=...
-Dfs.azure.account.oauth.provider.type=...
-Dfs.azure.account.oauth2.client.endpoint=...
-Dfs.azure.account.oauth2.client.id=...
-Dfs.azure.account.oauth2.client.secret=...
--connect... --username... --password... --table... --target-dir... --split-by...
ABFS example:
sqoop import
-Dfs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/$TENANT_ID/oauth2/token
-Dfs.azure.account.oauth2.client.id=$CLIENT_ID
-Dfs.azure.account.oauth2.client.secret=$CLIENT_SECRET
-Dfs.azure.account.auth.type=OAuth
-Dfs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
--connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir
abfs://$CONTAINER$ACCOUNT.dfs.core.windows.net/$TARGET-DIRECTORY --split-by $COLUMN_NAME
ADLS example:
sqoop import
-Dfs.adl.oauth2.refresh.url=https://login.windows.net/$TENANT_ID/oauth2/token
-Dfs.adl.oauth2.client.id=$CLIENT_ID
-Dfs.adl.oauth2.credential=$CLIENT_SECRET
-Dfs.adl.oauth2.access.token.provider.type=ClientCredential
--connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir
adl://$TARGET-ADDRESS/$TARGET-DIRECTORY --split-by $COLUMN_NAME
For more information about GCS such as access control and security, see the GCS documentation.
Before you start, review the supported services and limitations.
Supported Services
CDH with GCS as the storage layer supports the following services:
• Hive
• MapReduce
• Spark
Limitations
Note the following limitations with GCS support in CDH:
• Cloudera Manager’s Backup and Disaster Recovery and other management features, such as credential management,
do not support GCS.
• GCS cannot be the default file system for the cluster.
• Services not explicitly listed under the Supported Services on page 684 section, such as Impala, Hue, and Cloudera
Navigator, are not supported.
This section describes how to add the GCS related properties and distribute it to every node in the cluster. Alternatively,
you can submit it on a per job basis.
Complete the following steps to add the GCS information to every node in the cluster:
1. In the Cloudera Manager Admin Console, search for the following property that corresponds to the HDFS Service
you want to use with GCS: Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
2. Add the following properties:
• Name: fs.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
• Name: fs.AbstractFileSystem.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
• Name: fs.gs.project.id
Value: <project_id>
• Name: fs.gs.auth.service.account.email
Value <client_email>
• Name: fs.gs.auth.service.account.enable
Value: true
• Name: fs.gs.auth.service.account.private.key.id
Value: <private_key_id>
• Name: fs.gs.auth.service.account.private.key
Value: <private_key>
5. Run the following command to see if you can access GCS with the name of an existing bucket in place of
<existing-bucket>:
Share Nothing
In a share nothing architecture, both management and data are completely separate. Nothing is shared between
clusters. This architecture does not provide the benefits of multitenancy, but IT teams might find it appropriate based
on specific operational realities and governance policies.
For example, specific company contracts or policies might force an enterprise IT team to use this model. Another
common example is security and data privacy mandates that restrict the transfer of data sets across geographical
boundaries.
Share Management
A shared management model offers the benefits of reduced operational overhead without sharing cluster resources
or data between groups. This approach is a middle ground, granting some of the benefits of multitenancy while
maintaining isolation at the cluster level. This is the preferred choice for environments where full multitenancy is not
appropriate. For example, enterprises commonly employ this model for purpose-built, high-priority clusters that cannot
risk any performance issues or resource contention, such as an ad serving system or retail personalization, “next offer”
engine. While a multitenant EDH always faces the risks of misconfigured applications or malicious users, this model
mitigates these risks at the cost of data sharing and resource pooling.
Share Data
The shared resource model uses full multitenancy with all the benefits from consolidated management to shared data
and resources. It represents the desired end state for many EDH operators. For example, a biotechnology firm can
harness the entire body and insight of research, trial data, and individual perspectives from all its research teams and
other departments by employing a full multitenant EDH to store and analyze its information, greatly accelerating
innovation through transparency and accessibility.
Configuring Security
Once you settle on an isolation model, you can choose the security elements to support your model.
Security for Hadoop is clearly critical in both single tenant and multitenant environments. It establishes the foundation
for trusted data and usage among the various actors in the business environment. Without such trust, enterprises
cannot rely on the resources and information when making business-critical decisions, which in turn undermines the
benefits of operational consolidation and the decreased friction of shared data and insights. Cloudera’s EDH provides
a rich set of tools and frameworks for security. Key elements of this strategy and its systems include:
• Authentication, which proves users are who they say they are.
• Authorization, which determines what users are allowed to see and do.
• Auditing, which determines who did what, and when.
• Data Protection, which encrypts data-at-rest and in-motion.
Cloudera’s EDH offers additional tools, such as network connectivity management and data masking. For further
information on how IT teams can enable enterprise-grade security measures and policies for multitenant clusters, see
Securing Your Enterprise Hadoop Ecosystem. In the context of multitenant administration, security requirements should
also include:
• Delegating Security Management
• Managing Auditor Access
• Managing Data Visibility
often adhere to the best practice of “least privilege” and restrict operational access to the minimum data and activity
set required. For these cases, Cloudera Navigator provides a data auditor role that partitions the management rights
to the cluster so that administrators can grant the audit team access only to the informational data needed, mitigating
the impact to operations and security. This approach also answers the common request of audit teams to simplify and
focus the application user interface.
For more information, see Cloudera Navigator Auditing.
Managing Resources
The practical batch processing engine of Hadoop, MapReduce, provides a scheduler framework that administrators
can configure to ensure multiple, simultaneous user jobs share physical resources. More specifically, many production
environments have successful implementations of the Fair Scheduler. These environments provide maximum utilization
while enforcing SLAs through assigned resource minimums.
For more information, see Configuring the Fair Scheduler.
Managing Quotas
Fair distribution of resources is essential for keeping tenants happy and productive.
While resource management systems ensure appropriate access and minimum resource amounts for running
applications, IT administrators must also carefully govern cluster resources in terms of disk usage in a multitenant
environment. Like resource management, disk management is a balance of business objectives and requirements
across a range of user communities. The underlying storage foundation of a Hadoop-based EDH, the Hadoop Distributed
File System (HDFS), supports quota mechanisms that administrators use to manage space usage by cluster tenants.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
692 | Cloudera
Appendix: Apache License, Version 2.0
licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their
Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against
any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated
within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under
this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution.
You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You meet the following conditions:
1. You must give any other recipients of the Work or Derivative Works a copy of this License; and
2. You must cause any modified files to carry prominent notices stating that You changed the files; and
3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark,
and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part
of the Derivative Works; and
4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute
must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices
that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE
text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along
with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party
notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify
the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or
as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be
construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license
terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as
a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated
in this License.
5. Submission of Contributions.
Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the
Licensor shall be under the terms and conditions of this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement
you may have executed with Licensor regarding such Contributions.
6. Trademarks.
This License does not grant permission to use the trade names, trademarks, service marks, or product names of the
Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing
the content of the NOTICE file.
7. Disclaimer of Warranty.
Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides
its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or
FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or
redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability.
In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required
by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable
to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising
as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss
of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even
if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability.
Cloudera | 693
Appendix: Apache License, Version 2.0
While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance
of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in
accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any
other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional
liability.
END OF TERMS AND CONDITIONS
http://www.apache.org/licenses/LICENSE-2.0
694 | Cloudera