0% found this document useful (0 votes)
98 views

Module-2 PPT-1

hdfs dfs -ls /

Uploaded by

Lahari bilimale
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

Module-2 PPT-1

hdfs dfs -ls /

Uploaded by

Lahari bilimale
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

Lesson 1

Hadoop

1
• Hadoop is an open-source software framework for storing data
and running applications on clusters of commodity hardware.

• It provides massive storage for any kind of data, enormous


processing power and the ability to handle virtually limitless
concurrent tasks or jobs.
• Hadoop Ecosystem is a platform or a suite which provides
various services to solve the big data problems.
• It includes Apache projects and various commercial tools and
solutions.
• There are four major elements of Hadoop
i.e. HDFS, MapReduce, YARN, and Hadoop Common.
• Most of the tools or solutions are used to supplement or
support these major elements.
• All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop
ecosystem:
•HDFS: Hadoop Distributed File System
•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
Hadoop Platform
• Provides a low cost Big Data
platform, which is open source and
uses cloud services

5
Hadoop
• Tera Bytes of data processing takes
just few minutes
• Hadoop enables distributed processing
of large datasets (above 10 million
bytes) across clusters of computers
using a programming model called
MapReduce.

6
Hadoop Distributed File System

Hadoop Data Storage


Hadoop Physical Organisation
Map Reduce Programming Model
YARN based Execution Model
Hadoop System Characteristics
• Scalable
• Self-manageable
• Self-healing
• Distributed file system

12
Scalability
• Means can be scaled up (enhanced) by
adding storage and processing units as
per the requirements failure.

13
Self Manageability
• Means creation of storage and
processing resources which are used,
scheduled and reduced or increased
with the help of the system itself

14
Self Healing
• Means taken care of by the system
itself in case of faults
• Enables functioning and resources
availability
• Software detect and handle failures at
the task level and also Software
enable the task execution on
communication failure.
15
Figure 2.1 Core components of Hadoop

16
•Hadoop Common: The common utilities that support the
other Hadoop modules.
•Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
•Hadoop YARN: A framework for job scheduling and cluster
resource management.
•Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
Features of Hadoop Which Makes It Popular
1. Open Source:
2. Highly Scalable Cluster
3 . Fault Tolerance is Available
4. High Availability is Provided
5. Cost-Effective:
6. Hadoop Provide Flexibility
7. Easy to Use
8. Provides Faster Data Processing
Hadoop ecosystem:
•HDFS: Hadoop Distributed File System
•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
Figure 2.2 Hadoop main components and
ecosystem components

22
End of Lesson 1 on
Hadoop

24
MODULE 2
Chapter 3: Hadoop Distributed File System Basics

Prepared By: Mrs. SHWETHA C H


HDFS Design Features

▪ The HDFS was designed for Big Data processing.

▪ Distributed and parallel computation.


▪ HDFSis not designed asatrue parallel file system.

▪ Designassumes alarge file write-once/read manymodel

▪ HDFSrigorously restricts data writing to one user at atime.

▪ Alladditional writes are append only,there is no random writing toHDFSfiles.

▪ HDFSdesign is basedon Goggle File System(GFS)

▪ HDFSis designedfor data streaming where large amounts of data are read from disk in bulk.
▪ HDFS block size is typically 64MBto 128MB

▪ Map reduce emphasis on moving the computation to the data

▪ Asingle server node in the cluster is often both a computation engine


as well as storage engine for the application.

▪ HDFS has redundant design that can tolerate system failures.


Discuss the important aspects of HDFS (5 Marks)
HDFS Components

⚫ Explain HDFS components with a neat diagram. (8 Marks)


• or
⚫Explain Various system roles in an HDFS deployment (8
Marks)
HDFS Components
• The design of HDFSis basedonTwotypes of nodes

• A Single Name Node


• Multiple DataNode
• Single NameNode manages all the metadata

• No data on the NameNode.

• Minimal Adobe installation

• Master/Slave architecture

• File system namespace operations

• Mapping of blocks to DataNodes

• Slaves serving read and write request


• NameNode- Block creation, Deletion and Replicatioon

• Example – client write Data

• File block Replication

• Acknowledgement

• Reading Data

• Block Report
How Client write Data to DataNode
How Client read Data from DataNode
FOR Performance Reason

• The mappings b/w data blocks and physical DataNodes are not kept in persistent storage
on the NameNode. (No local caching mechanism)
• The NameNode stores all metadata in memory.
• Block reports are sent every 10 heartbeat.
• In almost all Hadoop deployments, there is a SecondaryNameNode(Checkpoint Node).
• It is not an active failover node and cannot replace the primary NameNode in case of it
failure
Secondary NameNode (SNN)
[checkpiont]
Thus the various important roles in HDFS are:
• HDFS uses a master/slave model designed for large file reading or streaming.

• The NameNode is a metadata server or “Data traffic cop”.

• HDFS provides a single namespace that is managed by the NameNode.

• Data is redundantly stored on DataNodes ; there is no data on NameNode.

• SecondaryNameNode performs checkpoints of NameNode file system’s state but is not

a failover node.
HDFS Block replication
• HDFS is a reliable system, it stores multiple copies of data.

Store the data



• What if a data block corrupts or fails?
• - chances of loosing the Data permanently

• HOW to overcome this problem?


• - Block replication
node
• When HDFS writes a file, it is replicated across the cluster.

• The amount of replication is based on the value of dfs.replication in the hdfs-site.xml


file.

• Hadoop clusters containing more than eight DataNodes, the replication value is usually
set to 3.

• In a Hadoop cluster of fewer DataNodes but more than one DataNode , a replication
factor of 2 is adequate.

• For a single machine ,like pseudo-distributed the replication factor is set to 1.

• HDFS default block size is often 64MB.In a typical OS , the block size is 4KB or 8KB.
• The figure above provides an example of how a file is broken into blocks and replicated across the
cluster.
• In this case replication factor of 3 ensures that any one DataNode can fail and the replicated blocks will
be available on other nodes and subsequently re-replicated on other DataNodes.
HDFS Safe Mode
⚫ When NameNode starts- read only safe mode where blocks

cannot be replicated or deleted.

⚫ Safe Mode enables the NameNode to perform two

important processes,

1. The previous file system state is reconstructed by

loading the fsimage file into memory and replaying the


edit log.

2. The mapping between blocks and data nodes is created by

waiting for enough of data nodes to register so that atleast


one copy of the data is available.
⚫ Hdfs may also enter safe mode for maintenance

using hdfs dfsadmin-safemode command.


Rack Awareness
• It deals with data locality.
• -One of the main design goals Hadoop MapReduce is

• to move the computation to the data


• When YARN scheduler is assigning MapReduce containers to
work as mappers, it will try to place containers first on the
local machine, then on same rack, and finally on another rack.
• In addition NameNode tries to replace replicated data blocks on multiple racks for
improved fault tolerance.

• In such a case, an entire rack failure will not cause data loss or stop HDFS from
working.

• HDFS can be made rack-aware by using a user-derived script that enables the
master node to map the network topology of the cluster.

• A default Hadoop installation assumes all the nodes belong to the same rack.
NameNode High Availability
scalability

Better
performance

⚫ HDFS BackupNode maintains an upto-date copy of the file system namespace both in memory and on
disk.
⚫ A NameNode supports one BackupNode at a time.
HDFS snapshots
HDFS User Commands

⚫ The preferred wayto interact with HDFSin Hadoop version2 isthroughthe hdfs command
List Files in HDFS
❖ To list the files in the root HDFS directory, enter the following

command:

• Syntax: $ hdfs dfs -ls /


• Output:
• Found 2 items
• drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
• drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:28 /apps

• To list files in your home directory, enter the following command:


• Syntax: $ hdfs dfs -ls
• Output:
• Found 2 items
• drwxr-xr-x - hdfs hdfs 0 2015-05-24 20:06 bin
• drwxr-xr-x - hdfs hdfs 0 2015-04-29 16:52 examples
 Make a Directory in HDFS

❖ To make a directory in HDFS, use the following command. As with the


-ls command, when no path is supplied, the user’s home directory is
used

• Syntax: $ hdfs dfs -mkdir stuff

 Copy Files to HDFS

❖ To copy a file from your current local directory into HDFS, use the
following command. If a full path is not supplied, your home directory is
assumed. In this case, the file test is placed in the directory stuff that was
created previously.
• Syntax: $ hdfs dfs -put test stuff

❖ The file transfer can be confirmed by using the -ls


command:

❖ Syntax: $ hdfs dfs -ls stuff

• Output:

• Found 1 items

• -rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test


• Copy Files from HDFS
❖ Files can be copied back to your local file system using the following command.

❖ In this case, the file we copied into HDFS, test, will be copied back to the current local
directory with the name test-local.

• Syntax: $ hdfs dfs -get stuff/test test-local

• Copy Files within HDFS

• The following command will copy a file in HDFS:

• Syntax: $ hdfs dfs -cp stuff/test test.hdfs

• Delete a File within HDFS

❖ The following command will delete the HDFS file


test.hdfs that was Syntax: $ hdfs dfs -rm test.hdfs
• Get an HDFS Status Report
• $ hdfs dfsadmin -report
• Configured Capacity: 1503409881088 (1.37 TB)
• Present Capacity: 1407945981952 (1.28 TB)
• DFS Remaining: 1255510564864 (1.14 TB)
• DFS Used: 152435417088 (141.97 GB)
• DFS Used%: 10.83%
• Under replicated blocks: 54
• Blocks with corrupt replicas: 0
• Missing blocks: 0
Using the Web GUI to Monitor
Examples

• This section provides an illustration of using the YARN


ResourceManager web GUI to monitor and find information
about YARN jobs.

• The Hadoop version 2 YARN ResourceManager web GUI differs


significantly from the MapReduce web GUI found in Hadoop version
1.
Module 2
Chapter 3
Hadoop ecosystem

• Pig: It is a procedural language platform used to develop a script for MapReduce operations.
• Sqoop: It is used to import and export data to andfrom between HDFS and RDBMS.
• Hbase: HBase is a distributed column-oriented database built on top of the
Hadoop file system.
• Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
• Flume: Used to handle streaming data on the topof Hadoop.
• Oozie: Apache Oozie is a workflow schedulerfor Hadoop.
Introduction to Pig

• Pig raises the level of abstraction for processing large amount of datasets.
• It is a fundamental platform for analyzing large amount of data sets which
consists of a high level language for expressing data analysis programs.
• It is an open source platform developed by yahoo.
usage of Apache pig
▪ Apache Pig is a high-level language that enables programmers to write
complex Map Reduce transformations using a simple scripting language.
▪ Pig’s simple SQL-like scripting language is called Pig Latin, and appeals
to developers already familiar with scripting languages and SQL.
▪ Pig Latin (the actual language) defines a set of transformations on a
data set such as aggregate, join, and sort.
▪ Pig is often used to extract, transform, and load (ETL) data , quick
research on raw data.
▪ Apache Pig has several usage modes. The first is a local mode in which
all processing is done on the local machine.
▪ The non-local (cluster) modes are Map Reduce and Tez.
▪ These modes execute the job on the cluster using either the Map
Reduce engine or the optimized Tez engine.
MapReduce Parallel Data Flow

 The basic Stepsare:


1) Input Splits
2) Map Step
3) Combiner Step
4) Shuffle Step
5) Reduce Step
Pig Example Walk-Through:

▪ Working knowledge of Pig through the hand-on experience of creating pig scripts to carry out
essential data operations and tasks.

▪ In this simple example, Pig is used to extract user names from the
/etc/passwd file.

▪ The following example assumes the user is hdfs , but any valid user with access to HDFS can run the
example.

• To begin first, copy the passwd file to a working directory for local Pig operation:

$cp /etc/passwd.
▪ Next, copy the data file into HDFS for Hadoop Map Reduce operation:

▪ $hdfs dfs –put passwd passwd.

▪ To confirm the file is in HDFS by entering the following command:


▪ hdfs dfs –ls passwd
▪ -rw-r--r- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd.

▪ In local Pig operation, all processing is done on the local machine (Hadoop is not
used). First, the interactive command line started: $ pig -x local.

▪ If Pig starts correctly, you will see a grunt> prompt.

▪ And also see a bunch of INFO messages. Next, enter the commands to load the
passwd file and then grab the user name and dump it to the terminal.
▪ Pig commands must end with a semicolon (;).
▪ grunt> A= load 'passwd' using Pig Storage(':') ;
▪ grunt>B = foreach A generate $0 as id;
▪ grunt>dumpB;

▪ The processing will start and a list of user names will be printed to
the screen.

▪ To exit the interactive session, enter the command quit.


• o $ grunt> quit.
USING APACHE HIVE
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, ad hoc queries, and the analysis of large data sets using a SQL-like language
called HiveQL.
• Hive is considered the de facto standard for interactive SQL queries over petabytes of data
using Hadoop and offers the following features:

1. Tools to enable easy data extraction, transformation, and loading


(ETL)
2. A mechanism to impose structure on a variety of data formats
3. Access to files stored either directly in HDFS or in other data
storage systems such as HBase
4. Query execution via MapReduce and Tez (optimized MapReduce)
• Hive provides users who are already familiar with SQL the capability to query the data on
Hadoop clusters.
To start Hive, simply enter the hive command. If Hive starts correctly, you should get a hive> prompt.
$ hive
(some messages may show up here)
hive>
As a simple test, create and drop a table. Note that Hive commands must end
with a semicolon (;).
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 1.705 seconds
hive> SHOW TABLES;
OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)
hive> DROP TABLE pokes;
OK
Time taken: 4.038 seconds
First, create a table using the following command:
hive> CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5
string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ‘ ’;
OK
Time taken: 0.129 seconds
Next, load the data—in this case, from the sample.log file.
Note that the file is found in the local directory and not in HDFS.
hive> LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO
TABLE logs;
Loading data to table default.logs
Table default.logs stats: [numFiles=1, numRows=0, totalSize=99271,
rawDataSize=0]
OK
Time taken: 0.953 seconds
Finally, apply the select step to the file. Note that this invokes a Hadoop
• hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs Cumulative CPU 4.07 sec
WHERE t4 LIKE '[%'
MapReduce Total cumulative CPU time: 4 seconds 70 msec
GROUP BY t4;
Ended Job = job_1427397392757_0001
Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-
b785-2c6569791368 MapReduce Jobs Launched:
Total jobs = 1 Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.07 sec
Launching Job 1 out of 1 HDFS

1 In order to change the average load for a reducer ( Read: 106384

in bytes): HDFS Write: 63 SUCCESS


set hive.exec.reducers.bytes.per.reducer=<number> Total MapReduce CPU Time Spent: 4 seconds 70 msec
In order to limit the maximum number of reducers: OK
set hive.exec.reducers.max=<number> [DEBUG] 434
In order to set a constant number of reducers: [ERROR] 3
set mapreduce.job.reduces=<number> [FATAL] 1
Starting Job = job_1427397392757_0001, Tracking URL =
[INFO] 96
http://norbert:8088/proxy/
[TRACE] 816
application_1427397392757_0001/
[WARN] 4
Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill
Time taken: 32.624 seconds, Fetched: 6 row(s)
To exit Hive, simply type exit;:
hive> exit;
Apache Sqoop to Acquire Relational data with an example.

▪ Sqoop is a tool designed to transfer data between Hadoop and


relational databases.

▪ Sqoop is used to
• -import data from a relational database
management system (RDBMS) into the Hadoop
Distributed File System(HDFS),
- transform the data in Hadoop and

- export the data back into an RDBMS.


Sqoop import method :
• The data import is done in two steps :
1) Sqoop examines the database to gather the necessary metadata for the
data to be imported.
2) Map-only Hadoop job : Transfers the actual data using the metadata.

▪ The imported data are saved in an HDFS directory.


▪ Sqoop will use the database name for the directory, or the user can
specify any alternative directory where the files should be populated.
▪ By default, these files contain comma delimited fields, with new lines
separating different records.
Sqoop Export method :
• Data export from the cluster works in a similar fashion. The export is done in
two steps :

1) examine the database for metadata.

2) Map-only Hadoop job to write the data to the database.

• Sqoop divides the input data set into splits, then uses individual map tasks to
push the splits to the database
Example: The following example shows the use of sqoop:

• Steps:
1. Download Sqoop.
2. Download and load sample MySQL data.
3. Add Sqoop user permissions for the local machine and cluster.
4. Import data from MySQL to HDFS.
5. Export data from HDFS to MySQL.
Step 1: Download Sqoop and Load Sample MySQL Database

• To install sqoop,

• # yum install sqoop sqoop-metastore

• To download database,

• $ wget http : //downloads.mysql.com/docs/world_innodb.sql.gz


next, log into mysql (assumes you have privileges to create a database) and
import the desired database by following these steps:
The following MySQL command will let you see the table details
mysql> SHOW CREATE TABLE Country;
mysql> SHOW CREATE TABLE City;
mysql> SHOW CREATE TABLE CountryLanguage;
Step 2: Add Sqoop User Permissions for the Local Machine and Cluster.
• In MySQL, add the following privileges for user sqoop to MySQL.

•mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'limulus'


IDENTIFIED BY 'sqoop';

•mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'10.0.0.%'


IDENTIFIED BY 'sqoop';

• mysql> quit
Next, log in as sqoop to test the permissions:
$ mysql -u sqoop -p
mysql> USE world;
mysql> SHOW TABLES;
+---------------------+
| Tables_in_world |
+---------------------+
| City |
| Country |
| CountryLanguage|
+----------------------+
3 rows in set (0.01 sec)
mysql> quit
Step 3: Import Data Using Sqoop
• To import data, we need to make a directory in HDFS:

• $ hdfs dfs -mkdir sqoop-mysql-import

• The following command imports the Country table into HDFS. The option -table
signifies the table to import, --target-dir is the directory created previously, and -m 1
tells Sqoop to use one map task to import the data.

• $ sqoop import --connect jdbc:mysql://limulus/world --username sqoop

• --password sqoop --table Country -m 1 --target-dir /user/hdfs/sqoopmysql-


import/country

• The file can be viewed using the hdfs dfs -cat command:
Step 4: Export Data from HDFS to MySQL

• Sqoop can also be used to export data from HDFS. The first step is to create
tables for exported data.

• There are actually two tables needed for each exported table. The first table
holds the exported data (CityExport), and the second is used for staging the
exported data (CityExportStaging).
Enter the following MySQL commands to create these tables:

mysql> CREATE TABLE 'CityExport’ ( 'ID' int(11) NOT NULL


AUTO_INCREMENT,
'Name' char(35) NOT NULL DEFAULT '',
'CountryCode' char(3) NOT NULL DEFAULT '',
'District' char(20) NOT NULL DEFAULT '',
'Population' int(11) NOT NULL DEFAULT '0',
PRIMARY KEY ('ID’));

mysql> CREATE TABLE 'CityExportStaging’ ( 'ID' int(11) NOT NULL


AUTO_INCREMENT,
'Name' char(35) NOT NULL DEFAULT '',
'CountryCode' char(3) NOT NULL DEFAULT '',
'District' char(20) NOT NULL DEFAULT '',
• Then use the following command to export the cities data into MySQL:

•sqoop --options-file cities-export-options.txt --table CityExport --


staging-table CityExportStaging --clear-staging-table -m 1 –exportdir
/user/hdfs/sqoop-mysql-import/city

• $ mysql> select * from CityExport limit 10;


Apache Flume to acquire data streams

▪ Apache Flume is an independent agent designed to collect, transport, and store data into
HDFS.
▪ Data transport involves a number of Flume agents that may traverse a series of machines and
locations.
▪ Flume is often used for log files, social media-generated data, email messages, and just about
any continuous data source.
▪ Flame agent is composed of three components.
o Source: The source component receives data and sends it to a channel. It can send the
data to more than one channel.
o Channel: A channel is a data queue that forwards the source data to the sink
destination.
o Sink: The sink delivers data to destination such as HDFS, a local file, or another
Flume agent.
▪ A Flume agent must have all three of these components defined. Flume agent can have
several source, channels, and sinks.
▪ Source can write to multiple channels, but a sink can take data from only a single channel.
▪ Data written to a channel remain in the channel until a sink removes the data.
▪ By default, the data in a channel are kept in memory but may be optionally stored on disk
to prevent data loss in the event of a network failure.
▪ As shown in the above figure, Sqoop agents may be placed in a pipeline, possibly
to traverse several machines or domains.
▪ In this Flume pipeline, the sink from one agent is connected to the source of
another.
▪ The data transfer normally used by Flume, which is called Apache Avro.
▪ Avro is a data serialization/deserialization system that uses a compact binary
format.
▪ The scheme is sent as part of the data exchange and is defined using JSON.
▪ Avro also uses remote procedure calls (RPCs) to send data.
Oozie Example Walk-Through
To run the Oozie MapReduce example job from the oozieexamples/
apps/map-reduce directory, enter the following line:
Click here to view code image
$ oozie job -run -oozie http://limulus:11000/oozie -config
job.properties
When Oozie accepts the job, a job ID will be printed:
Click here to view code image
job: 0000001-150424174853048-oozie-oozi-W
You will need to change the “limulus” host name to match the name of the node running your Oozie server.
The job ID can be used to track and control job progress.
Step 3: Run the Oozie Demo Application
• A more sophisticated example can be found in the demo directory (oozieexamples/ apps/demo). This
workflow includes MapReduce, Pig, and file system tasks as well as fork, join, decision, action, start, stop,
kill, and end nodes.
• Move to the demo directory and edit the job.properties file as described previously. Entering the following
command runs the workflow (assuming the OOZIE_URL environment variable has been set):
• $ oozie job -run -config job.properties
• You can track the job using either the Oozie command-line interface or the Oozie web console. To start the
web console from within Ambari, click on the Oozie service, and then click on the Quick Links pull-down
menu and select Oozie Web UI. Alternatively, you can start the Oozie web UI by connecting to the Oozie
server directly. For example, the following command will bring up the Oozie UI (use your Oozie server host
name in place of “limulus”):
• $ firefox http://limulus:11000/oozie/
A Short Summary of Oozie Job Commands
The following summary lists some of the more commonly encountered Oozie
commands. See the latest documentation at http://oozie.apache.org for more
information. (Note that the examples here assume OOZIE_URL is defined.)
Run a workflow job (returns _OOZIE_JOB_ID_):
$ oozie job -run -config JOB_PROPERITES
Submit a workflow job (returns _OOZIE_JOB_ID_ but does not start):
$ oozie job -submit -config JOB_PROPERTIES
Start a submitted job:
$ oozie job -start _OOZIE_JOB_ID_
Check a job’s status:
$ oozie job -info _OOZIE_JOB_ID_
Suspend a workflow:
$ oozie job -suspend _OOZIE_JOB_ID_
Resume a workflow:
$ oozie job -resume _OOZIE_JOB_ID_
Rerun a workflow:
$ oozie job -rerun _OOZIE_JOB_ID_ -config JOB_PROPERTIES
Kill a job:
$ oozie job -kill _OOZIE_JOB_ID_
View server logs:
$ oozie job -logs _OOZIE_JOB_ID_
Full logs are available at /var/log/oozie on the Oozie server.
HBase provides a shell for interactive use. To enter the shell, type the following as a user:
$ hbase shell
hbase(main):001:0>
To exit the shell, type exit. Various commands can be conveniently entered from the shell prompt. For instance, the status command provides the
system status:
hbase(main):001:0> status
4 servers, 0 dead, 1.0000 average load
Additional arguments can be added to the status command, including 'simple', 'summary', or 'detailed'. The single quotes are needed for proper
operation. For example, the following command will provide simple status information for the four HBase servers (actual server statistics have been
removed for clarity):
hbase(main):002:0> status 'simple'
4 live servers
n1:60020 1429912048329
n2:60020 1429912040653
limulus:60020 1429912041396
...
n0:60020 1429912042885
...
0 dead servers
Aggregate load: 0, regions: 4
Apache HBase Web Interface
• Like many of the Hadoop ecosystem tools, HBase has a web interface. To start the HBase console,
shown in Figure 7.11, from within Ambari, click on the HBase service, and then click on the Quick
Links pull-down menu and select HBase Master UI. Alternatively, you can connect to the HBase master
directly to start the HBase web UI. For example, the following command will bring up theHBase UI
(use your HBase master server host name in place of “limulus”):
• $ firefox http://limulus:60010/master-status

You might also like