Getting Started With HDP Sandbox
Getting Started With HDP Sandbox
As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.
Please visit recommended tutorials:
Introduction
Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by
getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
This tutorial describes how to refine data for a Trucking IoT Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use
cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is targeted to linking location information with your analytic data.
For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed
back to a datacenter where we will be processing the data. The company wants to use this data to better understand risk.
Here is the video of Analyzing Geolocation Data to show you what you’ll be doing in this tutorial.
Prerequisites
Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox
Go through Learning the Ropes of the HDP Sandbox to become familiar with the Sandbox.
Outline
Concepts to strengthen your foundation in the Hortonworks Data Platform (HDP)
Loading Sensor Data into HDFS
Hive - Data ETL
Spark - Risk Factor
Data Reporting with Zeppelin
Introduction
In this tutorial, we will explore important concepts that will strengthen your foundation in the Hortonworks Data Platform (HDP). Apache Hadoop is a layered structure
to process and store massive amounts of data. In our case, Apache Hadoop will be recognized as an enterprise solution in the form of HDP. At the base of HDP exists
our data storage environment known as the Hadoop Distributed File System. When data files are accessed by Hive, Pig or another coding language, YARN is the Data
Operating System that enables them to analyze, manipulate or process that data. HDP includes various components that open new opportunities and efficiencies in
healthcare, finance, insurance and other industries that impact people.
Prerequisites
Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox
Learning the Ropes of the HDP Sandbox
Outline
Hadoop & HDP
HDFS
MapReduce & YARN
Hive and Pig
Further Reading
Apache Hadoop
Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to
quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an
enterprise to deploy, integrate and work with Hadoop. Refer to the blog reference below for more information on Hadoop.
Hortonworks Blog:
o How Apache Hadoop 3 Adds Value Over Apache Hadoop 2.0
Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
Hadoop MapReduce – a programming model for large scale data processing.
Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles. There are five pillars to
Hadoop that make it enterprise ready:
Data Management – Store and process vast quantities of data in a storage layer that scales linearly. Hadoop Distributed File System (HDFS) is the core
technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for
Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data
stored in Hadoop with predictable performance and service levels.
o Apache Hadoop YARN – Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by
supporting non-MapReduce workloads associated with other programming models.
o HDFS – Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of
commodity servers.
Data Access – Interact with your data in a wide variety of ways – from batch to real-time. Apache Hive is the most widely adopted data access technology,
though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing, Apache HBase
offers columnar NoSQL storage and Apache Accumulo offers cell-level access control. All of these engines can work across one set of data and resources
thanks to YARN and intermediate engines such as Apache Tez for interactive access and Apache Slider for long-running applications. YARN also provides
flexibility for new and emerging data access methods, such as Apache Solr for search and programming frameworks such as Cascading.
o Apache Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for
large datasets stored in HDFS.
o Apache Pig – A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired
with the MapReduce framework for processing these programs.
o MapReduce – MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
o Apache Spark – Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering
and classification of datasets.
o Apache Storm – Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to
Apache Hadoop 2.x
o Apache HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
o Apache Tez – Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-
time big data processing.
o Apache Kafka – Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher
throughput, replication, and fault tolerance.
o Apache HCatalog – A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of
the data stored within Apache Hadoop.
o Apache Slider – A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN’s resource management capabilities to
deploy those applications, to manage their lifecycles and scale them up or down.
o Apache Solr – Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the
world’s largest Internet sites.
o Apache Mahout – Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based
collaborative filtering.
o Apache Accumulo – Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big
Table design that works on top of Apache Hadoop and Apache ZooKeeper.
Data Governance and Integration – Quickly and easily load data, and manage according to policy. Workflow Manager provides workflows for data governance,
while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
o Workflow Management – Workflow Manager allows you to easily create and schedule workflows and monitor workflow jobs. It is based on the Apache Oozie workflow
engine that allows users to connect and automate the execution of big data processing tasks into a defined workflow.
o Apache Flume – Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
o Apache Sqoop – Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data
sources.
Security – Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from
HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.
o Apache Knox – The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to
simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster.
o Apache Ranger – Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core
enterprise security requirements of authorization, accounting and data protection.
Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process,
and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive
modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing. To learn more about Apache Hadoop, watch the following
introduction:
HDFS
A single physical machine gets saturated with its storage capacity as data grows. With this growth comes the impending need to partition your data across separate
machines. This type of File system that manages storage of data across a network of machines is called a Distributed File System. HDFS is a core component of
Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware. With Hortonworks Data Platform
(HDP), HDFS is now expanded to support heterogeneous storage media within the HDFS cluster.
With the next generation HDFS data architecture that comes with HDP, HDFS has evolved to provide automated failure with a hot standby, with full stack resiliency.
The video provides more clarity on HDFS.
Ambari Files User View on Hortonworks Sandbox
Ambari Files User View
Ambari Files User View provides a user friendly interface to upload, store and move data. Underlying all components in Hadoop is the Hadoop Distributed File System
(HDFS). This is the foundation of the Hadoop cluster. The HDFS file system manages how the datasets are stored in the Hadoop cluster. It is responsible for
distributing the data across the datanodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of data nodes.
Apache MapReduce
MapReduce is the key algorithm that the Hadoop data processing engine uses to distribute work around a cluster. A MapReduce job splits a large data set into
independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster,
returning solutions more quickly and with greater reliability.
The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the
worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using
inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the
computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing
software is called MapReduce. Here’s a short video introduction to MapReduce:
Apache YARN (Yet Another Resource Negotiator)
Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer in Hadoop 1x. However, the MapReduce algorithm, by itself, isn’t
sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. YARN was introduced in Hadoop 2.0, as a generic resource-management
and distributed application framework, whereby, one can implement multiple data processing applications customized for the task at hand. The fundamental idea of
YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a
global ResourceManager and per-application ApplicationMaster (AM).
The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner.
The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect,
a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the
component tasks.
ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of
capacities, queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on
restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the
applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, CPU, disk, network etc.
NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network)
and reporting the same to the ResourceManager.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for
progress. From the system perspective, the ApplicationMaster itself runs as a normal container.
Here is an architectural view of YARN:
One of the crucial implementation details for MapReduce within the new YARN system that should be mentioned is that we have reused the existing
MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. Here is a short
video introduction for YARN.
Apache Hive
Data analysts use Hive to explore, structure and analyze that data, then turn it into business insights. Hive implements a dialect of SQL (HiveQL) that focuses on
analytics and presents a rich set of SQL semantics including OLAP functions, sub-queries, common table expressions and more. Hive allows SQL developers or users
with SQL tools to easily query, analyze and process data stored in Hadoop. Hive also allows programmers familiar with the MapReduce framework to plug in their
custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Hive users have a choice of 3 runtimes when executing SQL queries. Users can choose between Apache Hadoop MapReduce, Apache Tez or Apache Spark
frameworks as their execution backend.
Here are some advantageous characteristics of Hive for enterprise SQL in Hadoop:
Feature Description
Query data
with a SQL-
Familiar
based
language
Interactive
response
Fast times, even
over huge
datasets
As data
variety and
volume
grows, more
Scalable commodity
and machines can
Extensible be added,
without a
corresponding
reduction in
performance
Components of Hive
HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and
MapReduce — to more easily read and write data on the grid. HCatalog holds a set of files paths and metadata about data in a Hadoop cluster. This allows scripts,
MapReduce and Tez, jobs to be decoupled from data location and metadata like the schema. Additionally, since HCatalog also supports tools like Hive and Pig, the location
and metadata can be shared between tools. Using the open APIs of HCatalog external tools that want to integrate, such as Teradata Aster, can also use leverage file path
location and metadata in HCatalog.
Note: At one point HCatalog was its own Apache project. However, in March, 2013, HCatalog’s project merged with Hive. HCatalog is currently released as part of
Hive.
WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style)
interface.
Apache Pig
Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin
script into MapReduce so that it can be executed within YARN for access to a single dataset stored in the Hadoop Distributed File System (HDFS).
Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs:
Characteristic Benefit
Pig users can
create custom
functions to
Extensible meet their
particular
processing
requirements
Complex tasks
involving
interrelated
data
transformations
can be
simplified and
Easily encoded as
Programmed data flow
sequences. Pig
programs
accomplish
huge tasks, but
they are easy to
write and
maintain.
Because the
system
Self- automatically
Optimizing optimizes
execution of Pig
jobs, the user
can focus on
semantics.
MapReduce Mode. This is the default mode, which requires access to a Hadoop cluster. The cluster may be a pseudo- or fully distributed one.
Local Mode. With access to a single machine, all files are installed and run using a local host and file system.
Further Reading
HDFS is one of the 4 components of Apache Hadoop the other 3 are Hadoop Common, Hadoop YARN and Hadoop MapReduce
Announcing HDP 3.0
To learn more about YARN watch the following YARN introduction video
Hadoop 3 Blogs
Apache Hadoop 3.1 A Giant Leap For Big Data
Apache Ambari is an open source and open community based web based tool for Hadoop operations which has been extended via Ambari User Views to
provide a growing list of developer tools as User Views.
Follow this link to learn more about the Ambari features included in HDP.
Hive Blogs:
Hive 3 Overview
Cost-Based Optimizer Makes Apache Hive 0.14 More Than 2.5X Faster
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
HIVE 0.14 Cost Based Optimizer (CBO) Technical Overview
5 Ways to Make Your Hive Queries Run Faster
Secure JDBC and ODBC Clients’ Access to HiveServer2
Speed, Scale and SQL: The Stinger Initiative, Apache Hive 12 & Apache Tez
Hive/HCatalog – Data Geeks & Big Data Glue
Tez Blogs
ORC Blogs
HDFS Blogs:
YARN Blogs:
YARN series-1
YARN series-2
Introduction
In this section, you will download the sensor data and load that into HDFS using Ambari User Views. You will get introduced to the Ambari Files User View to manage
files. You can perform tasks like create directories, navigate file systems and upload files to HDFS. In addition, you’ll perform a few other file-related tasks as
well. Once you get the basics, you will create two directories and then load two files into HDFS using the Ambari Files User View.
Prerequisites
The tutorial is a part of series of hands on tutorial to get you started on HDP using Hortonworks sandbox. Please ensure you complete the prerequisites before
proceeding with this tutorial.
Outline
HDFS backdrop
Download and Extract Sensor Data Files
Load the Sensor Data into HDFS
Summary
Further Reading
HDFS backdrop
A single physical machine gets saturated with its storage capacity as the data grows. This growth drives the need to partition your data across separate machines. This
type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and
is designed to store large files with streaming data access patterns, running on clusters of commodity hardware. With Hortonworks Data Platform (HDP), HDFS is now
expanded to support heterogeneous storage media within the HDFS cluster.
o trucks.csv – This is data was exported from a relational database and it shows information on truck models, driverid, truckid, and aggregated mileage info.
3. Start from the top root of the HDFS file system, you will see all the files the logged in user (maria_dev in this case) has access to see:
4. Navigate to /tmp/ directory by clicking on the directory links.
5. Create directory data. Click the button to create that directory. Then navigate to it. The directory path you should see: /tmp/data
Summary
Congratulations! Let’s summarize the skills and knowledge we acquired from this tutorial. We learned Hadoop Distributed File System (HDFS) was built to manage
storing data across multiple machines. Now we can upload data into the HDFS using Ambari’s HDFS Files view.
Further Reading
HDFS
HDFS User Guide
HDFS Architecture Guide
HDP OPERATIONS: HADOOP ADMINISTRATION
Introduction
In this section, you will be introduced to Apache Hive. In the earlier section, we covered how to load data into HDFS. So now you have geolocation and trucks files
stored in HDFS as csv files. In order to use this data in Hive, we will guide you on how to create a table and how to move data into a Hive warehouse, from where it
can be queried. We will analyze this data using SQL queries in Hive User Views and store it as ORC. We will also walk through Apache Tez and how a DAG is created
when you specify Tez as execution engine for Hive. Let's begin...
Prerequisites
The tutorial is a part of a series of hands on tutorials to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before
proceeding with this tutorial.
Outline
Apache Hive Basics
Become Familiar with Data Analytics Studio
Create Hive Tables
Explore Hive Settings on Ambari Dashboard
Analyze the Trucks Data
Summary
Further Reading
Note: that the first row contains the names of the columns.
Click Create button to complete table creation.
NOTE: For details on these clauses consult the Apache Hive Language Manual.
Following is a visual representation of the Upload table creation process:
1. The target table is created using ORC file format (i.e. Geolocation)
2. A temporary table is created using TEXTFILE file format to store data from the CSV file
3. Data is copied from temporary table to the target (ORC) table
4. Finally, the temporary table is dropped
You can review the SQL statements issued by selecting the Queries tab and reviewing the four most recent jobs, which was a result of using the Upload Table.
Verify New Tables Exist
To verify the tables were defined successfully:
describe formatted {table_name}; - Explore additional metadata about the table. For example you can verify geolocation is an ORC Table, execute the following query:
describe formatted geolocation;
By default, when you create a table in Hive, a directory with the same name gets created in the /warehouse/tablespace/managed/hive folder in HDFS. Using the
Ambari Files View, navigate to that folder. You should see both a geolocation and trucks directory:
NOTE: The definition of a Hive table and its associated metadata (i.e., the directory the data is stored in, the file format, what Hive
properties are set, etc.) are stored in the Hive metastore, which on the Sandbox is a MySQL database.
Rename Query Editor Worksheet
Click on the SAVE AS button in the Compose section, enter the name of your query and save it.
Beeline - Command Shell
Try running commands using the command line interface - Beeline. Beeline uses a JDBC connection to connect to HiveServer2. Use the built-in SSH Web Client (aka
Shell-In-A-Box):
1. Connect to Beeline hive.
beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000 -n hive
2. Enter the beeline commands to grant all permission access for maria_dev user:
grant all on database foodmart to user maria_dev;
grant all on database default to user maria_dev;
!quit
4. Enter the beeline commands to view 10 rows from foodmart database customer and account tables:
select * from foodmart.customer limit 10;
select * from foodmart.account limit 10;
select * from trucks;
show tables;
!help
!tables
!describe trucks
What did you notice about performance after running hive queries from shell?
Queries using the shell run faster because hive runs the query directory in hadoop whereas in DAS, the query must be accepted by a rest server before it can submitted to
hadoop.
You can get more information on the Beeline from the Hive Wiki.
Beeline is based on SQLLine.
1. Hive Page
2. Hive Configs Tab
3. Hive Settings Tab
4. Version History of Configuration
By default the key configurations are displayed on the first page. If the setting you are looking for is not on this page you can find additional settings in
the Advanced tab:
For example, if we wanted to improve SQL performance, we can use the new Hive vectorization features. These settings can be found and enabled by following these
steps:
As you can see from the green circle above, the Enable Vectorization and Map Vectorization is turned on already.
Some key resources to learn more about vectorization and some of the key settings in Hive tuning:
You should see a table that lists each trip made by a truck and driver:
Use the Content Assist to build a query
1. Create a new SQL Worksheet.
2. Start typing in the SELECT SQL command, but only enter the first two letters:
SE
NOTE: Notice content assist shows you some options that start with an “SE”. These shortcuts will be great for when you write a lot of
custom query code.
4. Type in the following query
SELECT truckid, avg(mpg) avgmpg FROM truckmileage GROUP BY truckid;
5. Click the “Save As” button to save the query as “average-mpg”:
6. Notice your query now shows up in the list of “Saved Queries”, which is one of the tabs at the top of the Hive User View.
7. Execute the “average-mpg” query and view its results.
Table avgmileage provides a list of average miles per gallon for each truck.
Create Table DriverMileage from Existing truckmileage data
The following CTAS groups the records by driverid and sums of miles. Copy the following DDL into the query editor, then click Execute:
CREATE TABLE DriverMileage
STORED AS ORC
AS
SELECT driverid, sum(miles) totmiles
FROM truckmileage
GROUP BY driverid;
View Data of DriverMileage
To view the data generated by CTAS above, execute the following query:
SELECT * FROM drivermileage;
We will use these result to calculate all truck driver's risk factors in the next section, so lets store our results on to HDFS:
and store it at /tmp/data/drivermileage.
Then open your web shell client:
sudo -u hdfs hdfs dfs -chown maria_dev:hdfs /tmp/data/drivermileage.csv
Next, navigate to HDFS as maria_dev and give permission to other users to use this file:
Summary
Congratulations! Let’s summarize some Hive commands we learned to process, filter and manipulate the geolocation and trucks data. We now can create Hive tables
with CREATE TABLE and UPLOAD TABLE. We learned how to change the file format of the tables to ORC, so hive is more efficient at reading, writing and processing this
data. We learned to retrieve data using SELECT statement and create a new filtered table (CTAS).
Further Reading
Augment your hive foundation with the following resources:
Apache Hive
Hive LLAP enables sub second SQL on Hadoop
Programming Hive
Hive Language Manual
HDP DEVELOPER: APACHE PIG AND HIVE
Introduction
In this tutorial you will be introduced to Apache Zeppelin and teach you to visualize data using Zeppelin.
Prerequisites
The tutorial is a part of series of hands on tutorial to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before
proceeding with this tutorial.
Outline
Apache Zeppelin
Create a Zeppelin Notebook
Download the Data
Execute a Hive Query
Build Charts Using Zeppelin
Summary
Further Reading
Apache Zeppelin
Apache Zeppelin provides a powerful web-based notebook platform for data analysis and discovery. Behind the scenes it supports Spark distributed contexts as well
as other language bindings on top of Spark.
In this tutorial we will be using Apache Zeppelin to run SQL queries on our geolocation, trucks, and riskfactor data that we've collected earlier and visualize the result
through graphs and charts.
Note: if you completed the Spark - Risk Factor section successfully advance to Execute a Hive Query
and enter the following command to create a spark temporary view
%spark2
val hiveContext = new org.apache.spark.sql.SparkSession.Builder().getOrCreate()
val riskFactorDataFrame = spark.read.format("csv").option("header", "true").load("hdfs:///tmp/data/riskfactor.csv")
riskFactorDataFrame.createOrReplaceTempView("riskfactor")
hiveContext.sql("SELECT * FROM riskfactor LIMIT 15").show()
2. After clicking on a chart, we can view extra advanced settings to tailor the view of the data we want.
3. Click settings to open the advanced chart features.
4. To make a chart with riskfactor.driverid and riskfactor.riskfactor SUM, drag the table relations into the boxes as shown in the image below.
5. You should now see an image like the one below.
6. If you hover on the peaks, each will give the driverid and riskfactor.
7. Try experimenting with the different types of charts as well as dragging and dropping the different table fields to see what kind of results you can obtain.
8. Let' try a different query to find which cities and states contain the drivers with the highest risk factors.
%sql
SELECT a.driverid, a.riskfactor, b.city, b.state
FROM riskfactor a, geolocation b where a.driverid=b.driverid
9. After changing a few of the settings we can figure out which of the cities have the high risk factors. Try changing the chart settings by clicking the scatterplot icon.
Then make sure that the keys a.driverid is within the xAxis field, a.riskfactor is in the yAxis field, and b.city is in the group field. The chart should look similar to the
following.
You can hover over the highest point to determine which driver has the highest risk factor and in which cities.
Summary
Great, now we know how to query and visualize data using Apache Zeppelin. We can leverage Zeppelin—along with our newly gained knowledge of Hive and Spark—
to solve real world problems in new creative ways.
Further Reading
Zeppelin on HDP
Apache Zeppelin Docs
Zeppelin Homepage