0% found this document useful (0 votes)
3 views

big data BASICS

The document provides an overview of key components in the Hadoop ecosystem, including Apache Pig, Hive, HBase, and Sqoop. Pig is a platform for analyzing large datasets with a focus on user-friendly data analysis, while Hive offers a SQL-like interface for managing data in Hadoop. HBase is a non-relational database for real-time data processing, and Sqoop facilitates the transfer of data between relational databases and the Hadoop ecosystem.

Uploaded by

Navin Franklin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

big data BASICS

The document provides an overview of key components in the Hadoop ecosystem, including Apache Pig, Hive, HBase, and Sqoop. Pig is a platform for analyzing large datasets with a focus on user-friendly data analysis, while Hive offers a SQL-like interface for managing data in Hadoop. HBase is a non-relational database for real-time data processing, and Sqoop facilitates the transfer of data between relational databases and the Hadoop ecosystem.

Uploaded by

Navin Franklin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

BIG DATA SECURITY

ASSIGNMENT
Write short notes on the following Hadoop ecosystem components:
Pig

1. Apache Pig is a platform utilized to analyze large data sets consisting of high-level
language for expressing data analysis programs along with the infrastructure for assessing
these programs.
2. Pig programs can be highly parallelized due to the virtue of which they can handle large
data sets.
3. Pig was initially developed by Yahoo! for its data scientists who were using Hadoop.
4. It was incepted to focus mainly on analysis of large data sets rather than on writing
mapper and reduce functions.
5. This allowed users to focus on what they want to do rather than bothering with how it's
done.
6. On top of this with Pig language you have the facility to write commands in other
languages like Java, Python etc.

Hive

1. Apache Hive is an open source data warehouse software for reading, writing and managing
large data set files that are stored directly in either the Apache Hadoop Distributed File
System (HDFS) or other data storage systems such as Apache HBase.
2. Hive enables SQL developers to write Hive Query Language (HQL) statements that are
similar to standard SQL statements for data query and analysis.
3. It is designed to make MapReduce programming easier because you don’t have to know
and write lengthy Java code. Instead, you can write queries more simply in HQL, and Hive
can then create the map and reduce the functions.
4. Included with the installation of Hive is the Hive MetaStor, which enables you to apply a
table structure onto large amounts of unstructured data. Once you create a Hive table,
defining the columns, rows, data types, etc., all of this information is stored in the MetaStor
and becomes part of the Hive architecture.
5. Other tools such as Apache Spark and Apache Pig can then access the data in the MetaStor.
6. As with any database management system (DBMS), you can run your Hive queries from a
command-line interface (known as the Hive shell), from a Java™ Database Connectivity
(JDBC) or from an Open Database Connectivity (ODBC) application, using the Hive
JDBC/ODBC drivers.

Hbase
1. HBase is a column-oriented non-relational database management system that runs on top
of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing
sparse data sets, which are common in many big data use cases. It is well suited for real-time
data processing or random read/write access to large volumes of data.
2. Unlike relational database systems, HBase does not support a structured query language like
SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java™
much like a typical Apache MapReduce application. HBase does support writing applications
in Apache Avro, REST and Thrift.
3. An HBase system is designed to scale linearly. It comprises a set of standard tables with rows
and columns, much like a traditional database. Each table must have an element defined as a
primary key, and all access attempts to HBase tables must use this primary key.
4. Avro, as a component, supports a rich set of primitive data types including: numeric, binary
data and strings; and a number of complex types including arrays, maps, enumerations and
records. A sort order can also be defined for the data.
5. HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into
HBase, but if you’re running a production cluster, it’s suggested that you have a dedicated
ZooKeeper cluster that’s integrated with your HBase cluster.
6. HBase works well with Hive, a query engine for batch processing of big data, to enable fault-
tolerant big data applications.

Sqoop

1. Apache Sqoop is part of the Hadoop ecosystem. Since a lot of the data had to be transferred
from relational database systems onto Hadoop, there was a need for a dedicated tool to do this
task fast.
2. This is where Apache Sqoop came into the picture which is now extensively used for
transferring data from RDBMS files to the Hadoop ecosystem for MapReduce processing and
so on.
3. When it comes to transferring data, there is a certain set of requirements to be taken care of.
It includes the following: Data has to have consistency; it should be prepared for provisioning
the downstream pipeline, and the users should ensure the consumption of production system
resources; among other things.
4. The MapReduce application is not able to directly access the data that is residing in external
relational databases.
5. This method can expose the system to the risk of too much load generation from the cluster
nodes.
6. The Sqoop tool can help import the structured data from relational databases, NoSQL systems,
and even from enterprise data warehouses. This tool makes it easy to import data from external
systems to HDFS.

You might also like