+ The common large objects in Sqoop are Blob and Clob. Suppose
the object is less than 16 MB, itis stored inline with the rest of the
data. If there are big objects, they are temporarily stored in a.
subdirectory with the name "lob. Those data are then materialized
in memory for processing. If we set lob limit as ZERO (0) then it is,
stored in external memory.
ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive 28yarner24, a2 AN BDCC- 4 -Pig, Sqoop and Hive
BDC
Example:
sqoop import —connect jdbi
"2016-07-20"
ysql://db.one.com/corp table COMPANY_EMP --where “start_date>
‘Sqoop supports data imported into following services:
+ HDFS
+ Hive
+ HBase
+ Heatalog
+ Accumulo
‘Sqoop needs a JDBC driver of the database for interaction.
Apache Hive
+ The Apache Hive is a data warehouse software that lets you read, write and
‘manage huge volumes of datasets that is stored in a distributed environment
using SQL. Itis possible to project structure onto data that is in storage.
Users can connect to Hive using a JDBC driver and a command line tool.
+ Hive is an open system. We can use Hive for analyzing and querying large
datasets. It's similar to SQL.
+ Hive supports ACID transactions: The full form of ACID is Atomicity,
Consistency, Isolation, and Durability. ACID transactions are provided at the
row levels,
+ Hive is not considered as a full database. The design rules and regulations of
Hadoop and HDFS put restrictions on what Hive can do.
+ Hive is most suitable for following data warehouse applications
+ Analyzing the relatively static data
+ Less Responsive time
+ No rapid changes in data.
hips bce. santechz.comunit2i¢-pig-aqoop-and-hive 35yarner24, a2 AN [BOCC-4 - Pig, Soop and Hive
BDC
Driver
(compiler,
Optimizer
Executor]
How does Hive work?
+ Hive was created to allow non-programmers familiar with SQL to work with petabytes of data, using a
SQL-like interface called HiveQL.. Traditional relational databases are designed for interactive queries
on small to medium datasets and do not process huge datasets well. Hive instead uses batch
processing so that it works quickly across a very large distributed database. Hive transforms HiveQL
queries into MapReduce or Tez jobs that run on Apache Hadoop’s distributed job scheduling
framework, Yet Another Resource Negotiator (YARN). It queries data stored in a distributed storage
solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and
table metadata in a metastore, which is a database or file backed store that enables easy data
abstraction and discovery.
+ Hive includes HCatalog, which is a table and storage management layer that reads data from the Hive
metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce. By using the
metastore, HCatalog allows Pig and MapReduce to use the same data structures as Hive, so that the
metadata doesn’t have to be redefined for each engine. Custom applications or third party
integrations can use WebHCat, which is a RESTful API for HCatalog to access and reuse Hive
metadata.
ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive 45yarner24, a2 AN
BDC
BOCC- 4 - Pig, Sqoop and Hive
iled by Aaron Stanislaus Johns
ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive
CHARACTERISTICS APACHE HIVE APACHE HBASE
Low-latency distributed key-value
oo ‘SQL-lke query engine designed for high volume data stores. _ store with custom query capabilites.
be Multiple file-formats are supported, Data is stored in a column-oriented
format
Processing Type [Batch processing using Apache Tez or MepReduce come ase processing
rameworks,
“Medium to high, depending on the responsiveness of the Low, but it can be inconsistent.
Laten compute engine. The distributed execution model provides Structural imitations of the HB.ase
a ‘superior performance compared to monolithic query systems, architecture can result in latency
like ROBMS, for the same data volumes. spikes under intense write loads.
Runs on top of Hadoop, with Apache Tez or MapReduce for |,
Hadoop integration | s-acessing and HDFS or Amazon S3 for storage. en creel
No SQL suppor on its own. You can
‘SQL Support Provides SL-tke querying capabilities with HiveQl. use Apache Phoenix for SQL
capabilities,
‘Schema Defined schema forall tables. Schema-tree.
‘Supports structured and unstructured data. Provides native Supports unstructured data only. The
Data Types support for common SQL data types, ike INT, FLOAT, and user defines mappings of data fields to
VARCHAR Java-supported data types.
55