An introduction to Apache Hadoop Hive

Jul 5, 2013Download as ODP, PDF2 likes2,958 views

A short introduction to Apache Hadoop Hive, what is it and what can it do. How could we use it to connect a Hadoop cluster to business intelligence tools. Then create management reports from our Hadoop cluster data.

Apache Hadoop Hive
● What is it ?
● Architecture
● Related Projects
● Hive DDL
● Hive DML
● HiveQL Examples
● Business Intelligence

Hadoop – What is it ?
● A data warehouse for Hadoop
● Open source writen in Java
● Holds meta data in a relational database
● Allows SQL like queries
● Supports “big data” data sets
● Offers built in and user defined functions
● Has indexing

Hive – Architecture
Where does Hive sit in the Hadoop architecture ?

Hive – Architecture
● Given an existing HDFS and Hadoop cluster
● Then add Hive and the meta data structure
● Use Flume and Sqoop to move data
● Use Hive LOAD DATA command to load from flat files
● Use ODBC for connectivity to your BI layer

Hive – Related Projects
● Apache Flume – move large data sets to Hadoop
● Apache Sqoop – cmd line, move rdbms data to Hadoop
● Apache Hbase – Non relational database
● Apache Pig – analyse large data sets
● Apache Oozie – work flow scheduler
● Apache Mahout – machine learning and data mining
● Apache Hue – Hadoop user interface
● Apache Zoo Keeper – configuration / build

Hive - DDL
● Create table
hive> CREATE TABLE customer (age INT, address STRING);
● Partitions
hive> CREATE TABLE customer (age INT, address STRING)
PARTITIONED BY ( sdate STRING) ;
● Show table
hive> SHOW TABLES ;
● Describe table
hive> DESCRIBE customer;

Hive - DDL
● Alter table
hive> ALTER TABLE customer ADD COLUMNS ( age INT) ;
● Drop table
hive> DROP TABLE customer;

Hive - DML
● Loading flat files into Hive
hive> LOAD DATA LOCAL INPATH './data/home/x1a.txt' OVERWRITE
INTO TABLE customer;
●
No verification of incoming data

HiveQL Examples
● HiveQL, an SQL like language
hive> SELECT a.age FROM customer a WHERE a.sdate ='2008-
08-15';
selects all data from table for a partition but doesnt store it
hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file'
SELECT a.* FROM customer a WHERE a.sdate='2008-08-15';
writes all of customer table to an hdfs directory

Hive – Business Intelligence
● Use ODBC to connect Hive to your BI layer
● Now you can use BI tools like Business Objects
– Create a universe over the Hive instance
– Create reports against the universe
– Create add hoc queries against the universe

Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

This document outlines the concepts and techniques of Domain-Driven Design (DDD). It begins with basic concepts like the ubiquitous language and domain model. It then covers strategic design patterns such as bounded contexts and context mapping. Next, it discusses tactical design building blocks like entities, aggregates, and repositories. Finally, it briefly introduces related patterns like CQRS, event sourcing, and event-driven architectures. The document is intended to provide an overview of DDD from basic concepts to advanced patterns in both the strategic and tactical spheres.

Managing Data Integration InitiativesAllinConsulting

Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit

Cloud Computing & Big DataMrinal Kumar

This document discusses cloud computing, big data, Hadoop, and data analytics. It begins with an introduction to cloud computing, explaining its benefits like scalability, reliability, and low costs. It then covers big data concepts like the 3 Vs (volume, variety, velocity), Hadoop for processing large datasets, and MapReduce as a programming model. The document also discusses data analytics, describing different types like descriptive, diagnostic, predictive, and prescriptive analytics. It emphasizes that insights from analyzing big data are more valuable than raw data. Finally, it concludes that cloud computing can enhance business efficiency by enabling flexible access to computing resources for tasks like big data analytics.

Google Knowledge Graphkarthikzinavo

The document discusses Google's Knowledge Graph, which was introduced in 2012. The Knowledge Graph enhances search results by incorporating information from sources like Wikipedia to provide structured information about search topics. It aims to understand search queries better and provide relevant information without users needing to click through to other sites. The Knowledge Graph displays information in a more visual way on the right side of search results and could impact ad placement. It facilitates finding related information and benefits users and advertisers by providing more specific results.

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial: 1) What is RDD - Resilient Distributed Datasets 2) Creating RDD in Scala 3) RDD Operations - Transformations & Actions 4) RDD Transformations - map() & filter() 5) RDD Actions - take() & saveAsTextFile() 6) Lazy Evaluation & Instant Evaluation 7) Lineage Graph 8) flatMap and Union 9) Scala Transformations - Union 10) Scala Actions - saveAsTextFile(), collect(), take() and count() 11) More Actions - reduce() 12) Can We Use reduce() for Computing Average? 13) Solving Problems with Spark 14) Compute Average and Standard Deviation with Spark 15) Pick Random Samples From a Dataset using Spark

Semantic Technologies for Big DataMarin Dimitrov

This document summarizes a presentation about semantic technologies for big data. It discusses how semantic technologies can help address challenges related to the volume, velocity, and variety of big data. Specific examples are provided of large semantic datasets containing billions of triples and semantic applications that have integrated and analyzed disparate data sources. Semantic technologies are presented as a good fit for addressing big data's variety, and research is making progress in applying them to velocity and volume as well.

Hadoop File system (HDFS)Prashant Gupta

AWS Lake Formation Deep DiveCobus Bernard

This document provides an overview of AWS Lake Formation and related services for building a secure data lake. It discusses how Lake Formation provides a centralized management layer for data ingestion, cleaning, security and access. It also describes how Lake Formation integrates with services like AWS Glue, Amazon S3 and ML transforms to simplify and automate many data lake tasks. Finally, it provides an example workflow for using Lake Formation to deduplicate data from various sources and grant secure access for analysis.

Modernizing to a Cloud Data ArchitectureDatabricks

Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.

Well Architected Framework - Data Craig Milroy

Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...Neo4j

This document discusses how Neo4j 5's Clustering and Fabric features can help organizations operate Neo4j databases at large scale. Clustering allows elastic horizontal scaling of resources across multiple servers to support more and larger databases. Fabric enables querying across databases, including those sharded across clusters. Two financial use cases will be presented to illustrate how Clustering and Fabric can support real-time decision making across business graphs and make multi-terabyte datasets more manageable through sharding.

RDF and OWLRachel Lovinger

INTERFACE by apidays 2023 - How APIs are fueling the growth of 5G and MECapidays

The document discusses how 5G and mobile edge computing (MEC) are fueling growth through the use of APIs. It describes how MEC processes data closer to devices at the network edge for improved performance. 5G impacts latency and other factors. APIs allow dynamic interactions between networks, MEC, software, and devices to support new technologies. The 5G Future Forum aims to accelerate 5G and MEC solutions through API development and specifications that are interoperable across networks.

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

Building a modern data warehouseJames Serra

Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.

Data warehousing with Hadoophadooparchbook

The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.

Data Warehousing Trends, Best Practices, and Future OutlookJames Serra

Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn: - Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart - Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon - Step by step approach to building an effective data warehouse architecture - Common reasons for the failure of data warehouse implementations and how to avoid them

Data MeshPiethein Strengholt

SQOOP PPTDushhyant Kumar

Soluciones DynatraceInnovation Strategies

This document describes Dynatrace's full-stack application monitoring solution. It can automatically monitor entire application stacks from the user experience down to code level. Dynatrace provides a unified real-time model called Smartscape that maps out the entire environment and all transaction dependencies. It also uses artificial intelligence for anomaly detection since environmental complexity is too much for humans to fully analyze. Deployments can be either SaaS or on-premises to provide flexibility. Dynatrace can monitor dynamic container environments across all major platforms.

HDFS ArchitectureJeff Hammerbacher

The document summarizes Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes the key components of Hadoop including the Hadoop Distributed File System (HDFS) which stores data reliably across commodity hardware, and the MapReduce programming model which allows distributed processing of large datasets in parallel. The document provides an overview of HDFS architecture, data flow, fault tolerance, and other aspects to enable reliable storage and access of very large files across clusters.

Applying Network Analytics in KYCNeo4j

This document discusses how Rabobank, a Dutch bank, is applying network analytics to enhance its know your customer (KYC) and anti-money laundering (AML) processes. It describes building a graph model with 250 million nodes and 1 billion relations from customer data. Network features like risk triangles and communities are generated and used to identify and rank potentially risky customer cases for AML experts to review. Initial results were promising and a follow-up project was started to further develop ethical network analytics for KYC/AML monitoring.

Behind the Buzzword: Understanding Customer Data Platforms in the Light of Pr...Rising Media Ltd.

Customer Data Platform (CDP) systems are the newest answer to an old question: how to assemble a complete view of each customer. This session explores the reality of what CDPs can and cannot do, how CDPs differ from other systems, the types of CDP systems available, and how to find the right CDP for your purpose, especially with regard to data science projects and predictive modeling. You will come away with a clear understanding of where CDP fits into the larger data management landscape, what distinguishes CDP from older approaches to customer data management, and the state of the CDP industry in Europe.

Choosing the Right Graph Database to Succeed in Your ProjectOntotext

The document discusses choosing the right graph database for projects. It describes Ontotext, a provider of graph database and semantic technology products. It outlines use cases for graph databases in areas like knowledge graphs, content management, and recommendations. The document then examines Ontotext's GraphDB semantic graph database product and how it can address key use cases. It provides guidance on choosing a GraphDB option based on project stage from learning to production.

Big data in telecomTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Big data is generated from a variety of sources like web data, purchases, social networks, sensors, and IoT devices. Telecom companies process exabytes and zettabytes of data daily, including call detail records, network configuration data, and customer information. This big data is analyzed to enhance customer experience through personalization, predict churn, and optimize networks. Analytics also helps with operations, data monetization through services, and identifying new revenue streams from IoT and M2M data. Frameworks like Hadoop and MapReduce are used to analyze this distributed big data across clusters in a distributed manner for faster insights.

Data Engineering BasicsCatherine Kimani

Workshop - Build a Graph SolutionNeo4j

The document outlines an agenda for a workshop on building a graph solution using a digital twin data set. It includes sections on logistics, introductions, explaining the use case of a digital twin for a rail network, modeling the graph database solution, building the solution, and a question and answer period. Key aspects covered include an overview of Neo4j's graph database capabilities, modeling the domain entities and relationships, and exploring sample data related to operational points, sections, and points of interest for a rail network digital twin use case.

An Introduction of Apache HadoopKMS Technology

Reporting: From MySQL to Hadoop/HiveManuel Aldana

The document summarizes the transition of a company's reporting processes from MySQL to Hadoop and Hive. It discusses moving batch reporting jobs from MySQL to running MapReduce jobs on Hadoop to generate reports from log files stored in HDFS. It also notes some lessons learned, such as using Hive for SQL-like queries and Tableau for visualization. The cluster used for these reporting processes consisted of 22 nodes handling 11 reporting jobs processing 1TB of data and 5GB daily.

More Related Content

What's hot (20)

AWS Lake Formation Deep DiveCobus Bernard

Modernizing to a Cloud Data ArchitectureDatabricks

Well Architected Framework - Data Craig Milroy

Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...Neo4j

RDF and OWLRachel Lovinger

INTERFACE by apidays 2023 - How APIs are fueling the growth of 5G and MECapidays

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Building a modern data warehouseJames Serra

Data warehousing with Hadoophadooparchbook

Data Warehousing Trends, Best Practices, and Future OutlookJames Serra

Data MeshPiethein Strengholt

SQOOP PPTDushhyant Kumar

Soluciones DynatraceInnovation Strategies

HDFS ArchitectureJeff Hammerbacher

Applying Network Analytics in KYCNeo4j

Behind the Buzzword: Understanding Customer Data Platforms in the Light of Pr...Rising Media Ltd.

Choosing the Right Graph Database to Succeed in Your ProjectOntotext

Big data in telecomTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Data Engineering BasicsCatherine Kimani

Workshop - Build a Graph SolutionNeo4j

AWS Lake Formation Deep DiveCobus Bernard

Modernizing to a Cloud Data ArchitectureDatabricks

Well Architected Framework - Data Craig Milroy

Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...Neo4j

RDF and OWLRachel Lovinger

INTERFACE by apidays 2023 - How APIs are fueling the growth of 5G and MECapidays

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Building a modern data warehouseJames Serra

Data warehousing with Hadoophadooparchbook

Data Warehousing Trends, Best Practices, and Future OutlookJames Serra

Data MeshPiethein Strengholt

SQOOP PPTDushhyant Kumar

Soluciones DynatraceInnovation Strategies

HDFS ArchitectureJeff Hammerbacher

Applying Network Analytics in KYCNeo4j

Behind the Buzzword: Understanding Customer Data Platforms in the Light of Pr...Rising Media Ltd.

Choosing the Right Graph Database to Succeed in Your ProjectOntotext

Big data in telecomTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Data Engineering BasicsCatherine Kimani

Workshop - Build a Graph SolutionNeo4j

Viewers also liked (20)

An Introduction of Apache HadoopKMS Technology

Reporting: From MySQL to Hadoop/HiveManuel Aldana

Hadoop and big dataYukti Kaura

The document provides an overview of Hadoop and the Hadoop ecosystem. It discusses the history of Hadoop, how big data is defined in terms of volume, velocity, variety and veracity. It then explains what Hadoop is, the core components of HDFS and MapReduce, how Hadoop is used for distributed processing of large datasets, and how Hadoop compares to traditional RDBMS. The document also outlines other tools in the Hadoop ecosystem like Pig, Hive, HBase and gives a brief demo.

An intriduction to hiveReza Ameri

This document provides an introduction and overview of Apache Hive, including what it is, its architecture and components, how it is used in production, and performance considerations. Hive is an open source data warehouse system for Hadoop that allows users to query data using SQL-like language and scales to petabytes of data. It works by compiling queries into a directed acyclic graph of MapReduce jobs for execution. The document outlines Hive's architecture, components like the metastore and Thrift server, and how organizations use it for log processing, data mining and business intelligence tasks.

Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar

This document provides an introduction and overview of Apache Hive. It discusses how Hive originated at Facebook to manage large amounts of data stored in Oracle databases. It then defines what Hive is, how it works by compiling SQL queries into MapReduce jobs, and its architecture. Key components of Hive like its data model, metastore, and commands for creating tables and loading data are summarized.

Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks

Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data. The session will cover the latest advancements in Hive and provide practical tips for maximizing Hive Performance. Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community. Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=7c8f800cbbef256680db14c78b871f97

Apache Hive HookMinwoo Kim

The document describes Apache Hive hooks, which allow intercepting function calls or events during query execution in Hive. It provides details on the different hook points in Hive, including pre-execution, post-execution, and failure hooks. It also explains how to configure hooks by setting hook properties and the jar paths for hook implementations. Finally, it outlines the interfaces and contexts provided to hooks at each stage of query processing in Hive.

Hortonworks tech workshop in-memory processing with sparkHortonworks

Apache Spark offers unique in-memory capabilities and is well suited to a wide variety of data processing workloads including machine learning and micro-batch processing. With HDP 2.2, Apache Spark is a fully supported component of the Hortonworks Data Platform. In this session we will cover the key fundamentals of Apache Spark and operational best practices for executing Spark jobs along with the rest of Big Data workloads. We will also provide a working example to showcase micro-batch and machine learning processing using Apache Spark.

Apache Hive - IntroductionMuralidharan Deenathayalan

This document provides an introduction to Apache Hive, including: - What Apache Hive is and its key features like SQL support and rich data types - An overview of Hive's architecture and how it works within the Hadoop ecosystem - Where Hive is useful, such as for log processing, and not useful, like for online transactions - Examples of companies that use Hive - An introduction to the Hive Query Language (HQL) with examples of creating tables, loading data, queries, and more.

Introduction to Apache HiveTapan Avasthi

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows users to query large datasets stored in Hadoop file systems using a SQL-like language called HiveQL. Hive converts queries into a series of MapReduce jobs that are executed on Hadoop. It stores table data and partitions in HDFS directories with table metadata stored separately. The Hive CLI provides an interface for users to issue HiveQL queries and manage tables, databases and partitions.

Introduction to Apache HiveAvkash Chauhan

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.

Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain

Hive : WareHousing Over hadoopChirag Ahuja

This document provides an introduction to Apache Hive, including: - Hive allows for data warehousing and analysis of large datasets stored in Hadoop through use of the HiveQL query language, which is automatically translated to MapReduce jobs. - Key advantages of Hive include its higher-level query language that simplifies working with large data and lower learning curve compared to Pig or MapReduce. However, updating data can be complicated due to HDFS and Hive has high query latency.

Hadoop introductionChirag Ahuja

This document provides an introduction to Hadoop, including its ecosystem, architecture, key components like HDFS and MapReduce, characteristics, and popular flavors. Hadoop is an open source framework that efficiently processes large volumes of data across clusters of commodity hardware. It consists of HDFS for storage and MapReduce as a programming model for distributed processing. A Hadoop cluster typically has a single namenode and multiple datanodes. Many large companies use Hadoop to analyze massive datasets.

Aprendizaje social26844369

Este documento resume las teorías del aprendizaje social de Cornell Montgomery y Albert Bandura. Montgomery propuso que el aprendizaje social ocurre en cuatro etapas: contacto cercano, imitación de los superiores, comprensión de conceptos y comportamiento. Bandura estudió cómo los individuos imitan conductas que son reforzadas u olvidadas dependiendo de sus resultados. El autor observó a niños en un aula imitando las conductas de otros niños y del profesor, lo que confirma que imitan a aquellos en su círculo social temp

AshokaHub - A cloud – based social networking platform using Ruby on RailsNeev Technologies

Neev built a cloud-based web application that has emerged as one of the world’s largest social networking platforms for social entrepreneurs to connect, discuss, share, innovate and help each other. Catering to a global audience, the application supports 12 languages. The social platform has an in-built search feature that allows any profile or discussion to be searched based on tags, relevance, type of activity, etc.

China port and harbor industry market forecast and investment strategy report...Qianzhan Intelligence

This document provides an overview and analysis of China's port and harbor industry from 2011-2017. It discusses the development environment, status, construction, operation, and regional development of the industry. It also analyzes international port development, market competition within China, and the competitiveness and future patterns of container ports. The report aims to help industry players understand trends, seize opportunities, and make strategic decisions.

Digital is a Philosophy by Augustine FouDr. Augustine Fou - Independent Ad Fraud Researcher

Marketo At Enterprise Scale: How to Tame The Chaos and Maximize System Perfor...DemandGen

China small appliance industry production and marketing demand and investment...Qianzhan Intelligence

This document provides a summary of the China Small Appliance Industry Production and Marketing Demand and Investment Forecast Report for 2013-2017. It discusses the development status and trends of China's small appliance industry. Some key points include: - China has become one of the most important bases for small appliance production in the world. The production and market size of China's small appliance industry has grown significantly from 2011-2015. - The small appliance market is experiencing steady growth driven by increasing household income and consumption in China as well as the country's strong production capabilities. Rural small appliance markets are also beginning to take off. - Competition in the industry is intense with over 5,000 companies but less than 100 major

An Introduction of Apache HadoopKMS Technology

Reporting: From MySQL to Hadoop/HiveManuel Aldana

Hadoop and big dataYukti Kaura

An intriduction to hiveReza Ameri

Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar

Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks

Apache Hive HookMinwoo Kim

Hortonworks tech workshop in-memory processing with sparkHortonworks

Apache Hive - IntroductionMuralidharan Deenathayalan

Introduction to Apache HiveTapan Avasthi

Introduction to Apache HiveAvkash Chauhan

Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain

Hive : WareHousing Over hadoopChirag Ahuja

Hadoop introductionChirag Ahuja

Aprendizaje social26844369

AshokaHub - A cloud – based social networking platform using Ruby on RailsNeev Technologies

China port and harbor industry market forecast and investment strategy report...Qianzhan Intelligence

Digital is a Philosophy by Augustine FouDr. Augustine Fou - Independent Ad Fraud Researcher

Marketo At Enterprise Scale: How to Tame The Chaos and Maximize System Perfor...DemandGen

China small appliance industry production and marketing demand and investment...Qianzhan Intelligence

Similar to An introduction to Apache Hadoop Hive (20)

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84 This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial: 1) Hive Introduction 2) Why Do We Need Hive? 3) Hive - Components 4) Hive - Limitations 5) Hive - Data Types 6) Hive - Metastore 7) Hive - Warehouse 8) Accessing Hive using Command Line 9) Accessing Hive using Hue 10) Tables in Hive - Managed and External 11) Hive - Loading Data From Local Directory 12) Hive - Loading Data From HDFS 13) S3 Based External Tables in Hive 14) Hive - Select Statements 15) Hive - Aggregations 16) Saving Data in Hive 17) Hive Tables - DDL - ALTER 18) Partitions in Hive 19) Views in Hive 20) Load JSON Data 21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By 22) Bucketing in Hive 23) Hive - ORC Files 24) Connecting to Tableau using Hive 25) Analyzing MovieLens Data using Hive 26) Hands-on demos on CloudxLab

Hive HadoopFarafekr Technology Ltd.

Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Get started with Microsoft SQL PolybaseHenk van der Valk

PolyBase allows SQL Server 2016 to query data residing in Hadoop and Azure Blob Storage. It provides a unified query experience using T-SQL. To use PolyBase, you configure external data sources and file formats, create external tables, then run T-SQL queries against those tables. The PolyBase engine handles distributing parts of the query to Hadoop for parallel processing when possible for improved performance. Monitoring DMVs help troubleshoot and tune PolyBase queries.

Windows Azure HDInsight ServiceNeil Mackenzie

Hive with HDInsightKhalid Salama

This document provides an overview of Hive, including: - What Hive is and how it enables SQL-like querying of data stored in HDFS folders - The key components of Hive's architecture like the metastore, optimizer, and executor - How Hive queries are compiled and executed using frameworks like MapReduce, Tez, and Spark - A comparison of Hive to traditional RDBMS systems and how they differ - Steps for getting started with Hive including loading sample data and creating Hive projects

Yahoo! Hack Europe WorkshopHortonworks

This document provides an overview of a Hadoop workshop presented by Chris Harris. It discusses core Hadoop technologies like HDFS, MapReduce, Pig, Hive, and HCatalog. It explains what these technologies are used for, how they work, and provides examples of commands and usage. The goal is to help attendees understand the essential components of the Hadoop ecosystem and how they can access and analyze large datasets.

מיכאלsqlserver.co.il

Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.

ACADGILD:: HADOOP LESSON Padma shree. T

Hive was introduced to allow users to run SQL-like queries on large datasets stored in Hadoop. It provides a data warehouse solution built on Hadoop that allows easy data summarization, querying, and analysis of big data stored in HDFS. Hive uses HDFS for storage but stores metadata about databases and tables in MySQL or Derby databases. It allows users to run queries using HiveQL, which is similar to SQL, without needing to write complex MapReduce programs.

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB

Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.

SQL Server 2012 and Big DataMicrosoft TechNet - Belgium and Luxembourg

Working with Hive AnalyticsManish Chopra

Apache Hive and commands PPT PresentationDhanush947555

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB

Analysis of historical movie data by BHADRABhadra Gowdra

Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.

6.hivePrashant Gupta

Oracle hadoop let them talk together !Laurent Leturgez

Laurent Leturgez discusses connecting Oracle and Hadoop to allow them to exchange data. He outlines several tools that can be used, including Sqoop for importing and exporting data between Oracle and Hadoop, Spark for running analytics on Hadoop, and various connectors like ODBC connectors and Oracle Big Data connectors. He also discusses using Oracle Big Data SQL and the Gluent Data Platform to query data across Oracle and Hadoop.

Exploring sql server 2016 biAntonios Chatzipavlis

SQL Server 2016 introduces new features for business intelligence and reporting. PolyBase allows querying data across SQL Server and Hadoop using T-SQL. Integration Services has improved support for AlwaysOn availability groups and incremental package deployment. Reporting Services adds HTML5 rendering, PowerPoint export, and the ability to pin report items to Power BI dashboards. Mobile Report Publisher enables developing and publishing mobile reports.

Big data solutions in azureMostafa

The session covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.

Microsoft's Big Play for Big DataAndrew Brust

Hadoop introKeith Davis

Hadoop is an open source software project that allows distributed processing of large datasets across computer clusters. It was developed based on research from Google and has two main components - the Hadoop Distributed File System (HDFS) which reliably stores data in a distributed manner, and MapReduce which allows parallel processing of this data. Hadoop is scalable, cost effective, and fault tolerant for processing terabytes of data on commodity hardware. It is commonly used for batch processing of large unstructured datasets.

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Hive HadoopFarafekr Technology Ltd.

Get started with Microsoft SQL PolybaseHenk van der Valk

Windows Azure HDInsight ServiceNeil Mackenzie

Hive with HDInsightKhalid Salama

Yahoo! Hack Europe WorkshopHortonworks

מיכאלsqlserver.co.il

ACADGILD:: HADOOP LESSON Padma shree. T

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB

SQL Server 2012 and Big DataMicrosoft TechNet - Belgium and Luxembourg

Working with Hive AnalyticsManish Chopra

Apache Hive and commands PPT PresentationDhanush947555

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB

Analysis of historical movie data by BHADRABhadra Gowdra

6.hivePrashant Gupta

Oracle hadoop let them talk together !Laurent Leturgez

Exploring sql server 2016 biAntonios Chatzipavlis

Big data solutions in azureMostafa

Microsoft's Big Play for Big DataAndrew Brust

Hadoop introKeith Davis

More from Mike Frampton (20)

Apache AiravataMike Frampton

This presentation gives an overview of the Apache Airavata project. It explains Apache Airavata in terms of it's architecture, data models and user interface. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache MADlib AI/MLMike Frampton

This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache MXNet AIMike Frampton

This presentation gives an overview of the Apache MXNet AI project. It explains Apache MXNet AI in terms of it's architecture, eco system, languages and the generic problems that the architecture attempts to solve. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache GobblinMike Frampton

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache Singa AIMike Frampton

This presentation gives an overview of the Apache Singa AI project. It explains Apache Singa in terms of it's architecture, distributed training and functionality. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache RangerMike Frampton

This presentation gives an overview of the Apache Ranger project. It explains Apache Ranger in terms of it's architecture, security, audit and plugin features. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

OrientDBMike Frampton

This presentation gives an overview of the OrientDB database project. It explains OrientDB in terms of it's functionality, its indexing and architecture. It examines the ETL functionality as well as the UI available. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

PrometheusMike Frampton

This presentation gives an overview of the Prometheus project. It explains Prometheus in terms of it's visualisation, time series processing capabilities and architecture. It also examines it's query language PromQL. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache TephraMike Frampton

This presentation gives an overview of the Apache Tephra project. It explains Tephra in terms of Pheonix, HBase and HDFS. It examines the project architecture and configuration. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache KuduMike Frampton

Kudu is an open source column-oriented data store that integrates with the Hadoop ecosystem to provide fast processing of online analytical processing (OLAP) workloads. It scales to large datasets and clusters, with a master-tablet server architecture providing fault tolerance and high availability. Kudu uses a columnar storage format and supports various column types, configurations, and partitioning strategies to optimize performance and distribution of data and loads.

Apache BahirMike Frampton

Apache Bahir provides streaming connectors and SQL data sources for Apache Spark and Apache Flink in a centralized location. It contains connectors for ActiveMQ, Akka, Flume, InfluxDB, Kudu, Netty, Redis, CouchDB, Cloudant, MQTT, and Twitter. Bahir is an important project because it enables reuse of extensions and saves time and money compared to recreating connectors. Though small, it covers multiple Spark and Flink extensions with the potential for future extensions. The project is currently active with regular updates to the GitHub repository and comprehensive documentation for its connectors.

Apache ArrowMike Frampton

This presentation gives an overview of the Apache Arrow project. It explains the Arrow project in terms of its in memory structure, its purpose, language interfaces and supporting projects. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

JanusGraph DBMike Frampton

This presentation gives an overview of the JanusGraph DB project. It explains the JanusGraph database in terms of its architecture, storage backends, capabilities and community. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache IgniteMike Frampton

This presentation gives an overview of the Apache Ignite project. It explains Ignite in relation to its architecture, scaleability, caching, datagrid and machine learning abilities. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache SamzaMike Frampton

This presentation gives an overview of the Apache Samza project. It explains Samza's stream processing capabilities as well as its architecture, users, use cases etc. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache FlinkMike Frampton

This presentation gives an overview of the Apache Flink project. It explains Flink in terms of its architecture, use cases and the manner in which it works. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Apache EdgentMike Frampton

Apache Edgent is an open source programming model and runtime for analyzing data and events at edge devices. It allows processing data at the edge to save money by only sending essential data from devices. Edgent provides connectors for various data sources and sinks and can be used for IoT, embedded in application servers, and for monitoring machines. The edge refers to devices, gateways, and sensors at the network boundary that provide potential data. Edgent applications follow a common structure of getting a provider, creating a topology, composing processing graphs, and submitting it for execution.

Apache CouchDBMike Frampton

CouchDB is an open-source document-oriented NoSQL database that stores data in JSON format. It provides ACID support through multi-version concurrency control and a crash-only design that ensures data integrity even if the database or servers crash. CouchDB supports single node or clustered deployments and uses bidirectional replication to synchronize data across nodes. It prioritizes availability and partition tolerance according to the CAP theorem.

An introduction to Apache MesosMike Frampton

An introduction to PentahoMike Frampton

Pentaho is an open-source business intelligence system that offers analytics, visual data integration, OLAP, reports, dashboards, data mining, and ETL capabilities. It includes both a server and client components, which are available for Windows, Linux, and Mac OSX. The server provides analytics, dashboarding, reporting, and data access services, while the client offers data integration, big data support, report design, data mining, metadata management, and other tools. Pentaho also has an extensive library of plugins and supports visual drag-and-drop development of ETL jobs and integration with Hadoop for big data analytics.

Apache AiravataMike Frampton

Apache MADlib AI/MLMike Frampton

Apache MXNet AIMike Frampton

Apache GobblinMike Frampton

Apache Singa AIMike Frampton

Apache RangerMike Frampton

OrientDBMike Frampton

PrometheusMike Frampton

Apache TephraMike Frampton

Apache KuduMike Frampton

Apache BahirMike Frampton

Apache ArrowMike Frampton

JanusGraph DBMike Frampton

Apache IgniteMike Frampton

Apache SamzaMike Frampton

Apache FlinkMike Frampton

Apache EdgentMike Frampton

Apache CouchDBMike Frampton

An introduction to Apache MesosMike Frampton

An introduction to PentahoMike Frampton

Recently uploaded (20)

Splunk Security Update | Public Sector Summit Germany 2025Splunk

Semantic Cultivators : The Critical Future Role to Enable AIartmondano

Cyber Awareness overview for 2025 month of securityriccardosl1

Social Media App Development Company-EmizenTechSteve Jonas

EmizenTech is a trusted Social Media App Development Company with 11+ years of experience in building engaging and feature-rich social platforms. Our team of skilled developers delivers custom social media apps tailored to your business goals and user expectations. We integrate real-time chat, video sharing, content feeds, notifications, and robust security features to ensure seamless user experiences. Whether you're creating a new platform or enhancing an existing one, we offer scalable solutions that support high performance and future growth. EmizenTech empowers businesses to connect users globally, boost engagement, and stay competitive in the digital social landscape.

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Procurement Insights Cost To Value Guide.pptxJon Hansen

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfTelecoms Supermarket

Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity

This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots. 📕 Here's what you can expect: - Modeling: Build end-to-end processes using BPMN. - Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes. - Operating: Control process instances with rewind, replay, pause, and stop functions. - Monitoring: Use dashboards and embedded analytics for real-time insights into process instances. This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes. 👨‍🏫 Speaker: Andrei Vintila, Principal Product Manager @UiPath This session streamed live on April 29, 2025, 16:00 CET. Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies

Into The Box Conference Keynote Day 1 (ITB2025)Ortus Solutions, Corp

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfPrecisely

Vaibhav Gupta BAML: AI work flows without Hallucinationsjohn409870

Shipping Agents Vaibhav Gupta Cofounder @ Boundary in/vaigup boundaryml/baml Imagine if every API call you made failed only 5% of the time boundaryml/baml Imagine if every LLM call you made failed only 5% of the time boundaryml/baml Imagine if every LLM call you made failed only 5% of the time boundaryml/baml Fault tolerant systems are hard but now everything must be fault tolerant boundaryml/baml We need to change how we think about these systems Aaron Villalpando Cofounder @ Boundary Boundary Combinator boundaryml/baml We used to write websites like this: boundaryml/baml But now we do this: boundaryml/baml Problems web dev had: boundaryml/baml Problems web dev had: Strings. Strings everywhere. boundaryml/baml Problems web dev had: Strings. Strings everywhere. State management was impossible. boundaryml/baml Problems web dev had: Strings. Strings everywhere. State management was impossible. Dynamic components? forget about it. boundaryml/baml Problems web dev had: Strings. Strings everywhere. State management was impossible. Dynamic components? forget about it. Reuse components? Good luck. boundaryml/baml Problems web dev had: Strings. Strings everywhere. State management was impossible. Dynamic components? forget about it. Reuse components? Good luck. Iteration loops took minutes. boundaryml/baml Problems web dev had: Strings. Strings everywhere. State management was impossible. Dynamic components? forget about it. Reuse components? Good luck. Iteration loops took minutes. Low engineering rigor boundaryml/baml React added engineering rigor boundaryml/baml The syntax we use changes how we think about problems boundaryml/baml We used to write agents like this: boundaryml/baml Problems agents have: boundaryml/baml Problems agents have: Strings. Strings everywhere. Context management is impossible. Changing one thing breaks another. New models come out all the time. Iteration loops take minutes. boundaryml/baml Problems agents have: Strings. Strings everywhere. Context management is impossible. Changing one thing breaks another. New models come out all the time. Iteration loops take minutes. Low engineering rigor boundaryml/baml Agents need the expressiveness of English, but the structure of code F*** You, Show Me The Prompt. boundaryml/baml <show don’t tell> Less prompting + More engineering = Reliability + Maintainability BAML Sam Greg Antonio Chris turned down openai to join ex-founder, one of the earliest BAML users MIT PhD 20+ years in compilers made his own database, 400k+ youtube views Vaibhav Gupta in/vaigup [email protected] boundaryml/baml Thank you!

Cybersecurity Identity and Access Solutions using Azure ADVICTOR MAESTRE RAMIREZ

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In Francechb3

Mastering Advance Window Functions in SQL.pdfSpiral Mantra

Top 10 IT Help Desk Outsourcing ServicesInfrassist Technologies Pvt. Ltd.

How analogue intelligence complements AIPaul Rowe

Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results. News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?

tecnologias de las primeras civilizaciones.pdffjgm517

Splunk Security Update | Public Sector Summit Germany 2025Splunk

Semantic Cultivators : The Critical Future Role to Enable AIartmondano

Cyber Awareness overview for 2025 month of securityriccardosl1

Social Media App Development Company-EmizenTechSteve Jonas

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Procurement Insights Cost To Value Guide.pptxJon Hansen

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john

Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfTelecoms Supermarket

Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies

Into The Box Conference Keynote Day 1 (ITB2025)Ortus Solutions, Corp

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfPrecisely

Vaibhav Gupta BAML: AI work flows without Hallucinationsjohn409870

Cybersecurity Identity and Access Solutions using Azure ADVICTOR MAESTRE RAMIREZ

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In Francechb3

Mastering Advance Window Functions in SQL.pdfSpiral Mantra

Top 10 IT Help Desk Outsourcing ServicesInfrassist Technologies Pvt. Ltd.

How analogue intelligence complements AIPaul Rowe

tecnologias de las primeras civilizaciones.pdffjgm517

An introduction to Apache Hadoop Hive

1. Apache Hadoop Hive ● What is it ? ● Architecture ● Related Projects ● Hive DDL ● Hive DML ● HiveQL Examples ● Business Intelligence

2. Hadoop – What is it ? ● A data warehouse for Hadoop ● Open source writen in Java ● Holds meta data in a relational database ● Allows SQL like queries ● Supports “big data” data sets ● Offers built in and user defined functions ● Has indexing

3. Hive – Architecture Where does Hive sit in the Hadoop architecture ?

4. Hive – Architecture ● Given an existing HDFS and Hadoop cluster ● Then add Hive and the meta data structure ● Use Flume and Sqoop to move data ● Use Hive LOAD DATA command to load from flat files ● Use ODBC for connectivity to your BI layer

5. Hive – Related Projects ● Apache Flume – move large data sets to Hadoop ● Apache Sqoop – cmd line, move rdbms data to Hadoop ● Apache Hbase – Non relational database ● Apache Pig – analyse large data sets ● Apache Oozie – work flow scheduler ● Apache Mahout – machine learning and data mining ● Apache Hue – Hadoop user interface ● Apache Zoo Keeper – configuration / build

6. Hive - DDL ● Create table hive> CREATE TABLE customer (age INT, address STRING); ● Partitions hive> CREATE TABLE customer (age INT, address STRING) PARTITIONED BY ( sdate STRING) ; ● Show table hive> SHOW TABLES ; ● Describe table hive> DESCRIBE customer;

7. Hive - DDL ● Alter table hive> ALTER TABLE customer ADD COLUMNS ( age INT) ; ● Drop table hive> DROP TABLE customer;

8. Hive - DML ● Loading flat files into Hive hive> LOAD DATA LOCAL INPATH './data/home/x1a.txt' OVERWRITE INTO TABLE customer; ● No verification of incoming data

9. HiveQL Examples ● HiveQL, an SQL like language hive> SELECT a.age FROM customer a WHERE a.sdate ='2008- 08-15'; selects all data from table for a partition but doesnt store it hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file' SELECT a.* FROM customer a WHERE a.sdate='2008-08-15'; writes all of customer table to an hdfs directory

10. Hive – Business Intelligence ● Use ODBC to connect Hive to your BI layer ● Now you can use BI tools like Business Objects – Create a universe over the Hive instance – Create reports against the universe – Create add hoc queries against the universe

11. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – [email protected] ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems

An introduction to Apache Hadoop Hive

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to An introduction to Apache Hadoop Hive (20)

More from Mike Frampton (20)

Recently uploaded (20)

An introduction to Apache Hadoop Hive