0% found this document useful (0 votes)
10 views

bdcc-2.4

big data

Uploaded by

yexadat679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
10 views

bdcc-2.4

big data

Uploaded by

yexadat679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
yarner24, a2 AN [BOCC-4 - Pig, Soop and Hive BDC PIG, SQOOP AND HIVE Apache Pig + The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data. Pig also consists of the infrastructure to evaluate the programs. The advantages of Pig programming is, that it can easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically done using the textual language Pig Latin. + Pig Latin features, *+ Simple programming: itis easy to code, execute and manage + Better optimization: system can automatically optimize the execution + Extensive nature: it can be used to achieve highly specific processing tasks + Pig can be used for following purposes: + ETL data pipeline + Research on raw data * Iterative processing. ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive 48 yarner24, a2 AN [BOCC-4 - Pig, Soop and Hive BDC Map: The data element with the data type chararray where element has pig data type include complex data type Example- [‘city’#’Mumbai', ’pin’#400001] In this city and pin are data element mapping to values. Tuple: It is a collection of data types and it has fixed length. Tuple is having multiple fields and these are ordered. Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by comma Example: {(‘Bangalore', 560001), ( ‘Mysore’ 570001), ( ‘Mumbai’ , 400001) LOAD functior Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language we need to mention the input, which is completed by using ‘load’ keyword. The LOAD syntax is LOAD ‘mydata’ [USING function] [AS schema]; Example: A LOAD ‘abc.txt’ A= LOAD ‘abc. txt’ USINGPigStorage( ‘\t’); Apache Sqoop + Apache Sqoop is a tool that is extensively used to transfer large amounts of data from Hadoop to the relational database servers and vice-versa. Sqoop can be used to import the various types of data from Oracle, MySQL and such other databases. + Important Sqoop control commands to import RDBMS data + Append: Append data to an existing dataset in HDFS. ~ © append * Columns: columns to import from the table. -columns- eon " Apache Sqoop + The common large objects in Sqoop are Blob and Clob. Suppose the object is less than 16 MB, itis stored inline with the rest of the data. If there are big objects, they are temporarily stored in a. subdirectory with the name "lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0) then it is, stored in external memory. ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive 28 yarner24, a2 AN BDCC- 4 -Pig, Sqoop and Hive BDC Example: sqoop import —connect jdbi "2016-07-20" ysql://db.one.com/corp table COMPANY_EMP --where “start_date> ‘Sqoop supports data imported into following services: + HDFS + Hive + HBase + Heatalog + Accumulo ‘Sqoop needs a JDBC driver of the database for interaction. Apache Hive + The Apache Hive is a data warehouse software that lets you read, write and ‘manage huge volumes of datasets that is stored in a distributed environment using SQL. Itis possible to project structure onto data that is in storage. Users can connect to Hive using a JDBC driver and a command line tool. + Hive is an open system. We can use Hive for analyzing and querying large datasets. It's similar to SQL. + Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, + Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do. + Hive is most suitable for following data warehouse applications + Analyzing the relatively static data + Less Responsive time + No rapid changes in data. hips bce. santechz.comunit2i¢-pig-aqoop-and-hive 35 yarner24, a2 AN [BOCC-4 - Pig, Soop and Hive BDC Driver (compiler, Optimizer Executor] How does Hive work? + Hive was created to allow non-programmers familiar with SQL to work with petabytes of data, using a SQL-like interface called HiveQL.. Traditional relational databases are designed for interactive queries on small to medium datasets and do not process huge datasets well. Hive instead uses batch processing so that it works quickly across a very large distributed database. Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop’s distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery. + Hive includes HCatalog, which is a table and storage management layer that reads data from the Hive metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce. By using the metastore, HCatalog allows Pig and MapReduce to use the same data structures as Hive, so that the metadata doesn’t have to be redefined for each engine. Custom applications or third party integrations can use WebHCat, which is a RESTful API for HCatalog to access and reuse Hive metadata. ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive 45 yarner24, a2 AN BDC BOCC- 4 - Pig, Sqoop and Hive iled by Aaron Stanislaus Johns ntps:fbdce santechz.com/uni-24-p49-sqoop-ane-hive CHARACTERISTICS APACHE HIVE APACHE HBASE Low-latency distributed key-value oo ‘SQL-lke query engine designed for high volume data stores. _ store with custom query capabilites. be Multiple file-formats are supported, Data is stored in a column-oriented format Processing Type [Batch processing using Apache Tez or MepReduce come ase processing rameworks, “Medium to high, depending on the responsiveness of the Low, but it can be inconsistent. Laten compute engine. The distributed execution model provides Structural imitations of the HB.ase a ‘superior performance compared to monolithic query systems, architecture can result in latency like ROBMS, for the same data volumes. spikes under intense write loads. Runs on top of Hadoop, with Apache Tez or MapReduce for |, Hadoop integration | s-acessing and HDFS or Amazon S3 for storage. en creel No SQL suppor on its own. You can ‘SQL Support Provides SL-tke querying capabilities with HiveQl. use Apache Phoenix for SQL capabilities, ‘Schema Defined schema forall tables. Schema-tree. ‘Supports structured and unstructured data. Provides native Supports unstructured data only. The Data Types support for common SQL data types, ike INT, FLOAT, and user defines mappings of data fields to VARCHAR Java-supported data types. 55

You might also like