SlideShare a Scribd company logo
Apache Hadoop A Natural Choice for Data Intensive Multiform at Log Processing Date: 22 nd  Aprilโ€™ 2011 Authored and Compiled By: Hitendra Kumar
A framework that can be installed on a commodity Linux cluster to permit large scale distributed data analysis.  Initial version created in 2004 by Doug Cutting and since after having broad and rapidly growing user community. Hadoop provides the robust, fault-tolerant Hadoop Distributed File System (HDFS), inspired by Google's file system, as well as a Java-based API that allows parallel processing across the nodes of the cluster using the Map-Reduce paradigm allowing - Distributed processing of large data sets Pluggable user code runs in generic framework Use of code written in other languages, such as Python and C, is possible through Hadoop Streaming, a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. Hadoop comes with Job and Task Trackers that keep track of the programsโ€™ execution across the nodes of the cluster. Natural Choice for: Data Intensive Log processing Web search indexing Ad-hoc queries Hadoop Framework A Brief Background A Brief Background
Accelerating nightly batch business processes.ย Since Hadoop can scale linearly, this can enable internal or external on-demand cloud farms to dynamically handle shrink performance windows and take on larger volume situations that an RDBMS just can't easily deal with.  Storage of extremely high volumes of enterprise data.ย The Hadoop Distributed File System is a marvel in itself and can be used to hold extremely large data sets safely on commodity hardware long term that otherwise couldn't stored or handled easily in a relational database.  HDFS creates a natural, reliable, and easy-to-use backup environment for almost any amount of data at reasonable prices considering that it's essentially a high-speed online data storage environment. Improving the scalability of applications.ย Very low cost commodity hardware can be used to power Hadoop clusters since redundancy and fault resistance is built into the software instead of using expensive enterprise hardware or software alternatives with proprietary solutions.  Use of Java for data processing instead of SQL.ย Hadoop is a Java platform and can be used by just about anyone fluent in the language (other language options are coming available soon via APIs.) Producing just-in-time feeds for dashboards and business intelligence. Handling urgent, ad hoc requests for data.ย While certainly expensive enterprise data warehousing software can do this, Hadoop is a strong performer when it comes to quickly asking and getting answers to urgent questions involving extremely large datasets. Turning unstructured data into relational data.ย While ETL tools and bulk load applications work well with smaller datasets, few can approach the data volume and performance that Hadoop can Taking on tasks that require massive parallelism.ย Hadoop has been known to scale out to thousands of nodes in production environments. Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment.ย  Hadoop Framework Leveraging Hadoop for High Performance over RDBMS Leveraging Hadoop over RDBMS
XML Logs CSV SQL Objects, JSONs Binary Hadoop Distributed File System (HDFS) M A P C R E A T I O N Reduce Commodity Server Cloud (Scale Out) Hadoop Environment RDBMS import Reporting Dash Boards BI Applications Enterprise High Volume Data In-Flow Map-Reduce Process Consume Results Hadoop Processing How it works? How it works?
Automatic & efficient parallelization / distribution Extremely popular for analyzing large datasets in cluster environments. The success of Stems from hiding the details of parallelization, fault tolerance, and load balancing in a simple programming framework. Widely accepted by community:- MapReduce preferable over a parallel RDBMS for log processing. Example:- Big Web 2.0 companies like Facebook, Yahoo and of Google.  Traditional enterprise customers of RDBMSs, such as JP Morgan Chase, VISA, The New York Times and China Mobile have started investigating and embracing MapReduce.  More than 80 companies and organizations are listed as users of Hadoop in data analytic solutions, log event processing etc. The IT giant, IBM engaged with a number of enterprise customers to prototype novel Hadoop-based solutions on massive amount of structured and unstructured data for their business analytics applications. China Mobile gathers 5โ€“8TB of call records/day. Facebook , almost 6TB of new log data collected every day, with 1.7PB of log data accumulated over time.  Just formatting and loading that much data into a parallel RDBMS in a timely manner is a challenge. Second, the log records do not always follow the same schema, This makes the lack of a rigid schema in MapReduce a feature rather than a shortcoming.  Third, all the log records within a time period are typically analyzed together, making simple scans preferable to index scans.  Fourth, log processing can be very time consuming and therefore it is important to keep the analysis job going even in the event of failures.  Joining log data with all kinds of reference data in MapReduce has emerged as an important part of analytic operations for enterprise customers, as well as Web 2.0 companies Hadoop Processing Map Reduce Algorithm . Map Reduce Algorithm
Hadoop Processing Map Reduce Algorithm .. Map Reduce Algorithm ..
There are separate Map and Reduce steps, each step done in parallel, each operating on sets of key-value pairs.  Program execution is divided into a Map and a Reduce stage, separated by data transfer between nodes in the cluster. So we have this workflow: Input -> Map() -> Copy()/Sort() -> Reduce() ->Output. In the first stage, a node executes a Map function on a section of the input data. Map output is a set of records in the form of key-value pairs, stored on that node.  The records for any given key โ€“ possibly spread across many nodes โ€“ are aggregated at the node running the Reducer for that key.  This involves data transfer between machines. This second Reduce stage is blocked from progressing until all the data from the Map stage has been transferred to the appropriate machine.  The Reduce stage produces another set of key-value pairs, as final output. This is a simple programming model, restricted to use of key-value pairs, but a surprising number of tasks and algorithms will fit into this framework.  Also, while Hadoop is currently primarily used for batch analysis of very large data sets, nothing precludes use of Hadoop for computationally intensive analyses, e.g., the Mahout machine learning project described below. Hadoop Processing Map Reduce Algorithm ... Map Reduce Algorithm โ€ฆ
Hadoop Processing Components Map Reduce Algorithm โ€ฆ HDFS , Hadoop Distributed File System HBASE , Modeled on Google's BigTable database, adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system. HIVE ,  Data-Flow-Language and Dataware House Framework on top of Hadoop Pig , High-Level Data-Flow Language (Pig Latin) and Execution Framework whose compiler produces sequences of Map/Reduce programs Zookeeper , A distributed, highly available coordination service. Zookeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop ,  A tool for efficiently moving data between relational databases and HDFS
Hadoop Processing HDFS File System HDFS File System HDFS file system  There are some drawbacks to HDFS use.  HDFS handles continuous updates (write many) less well than a traditional relational database management system.  Also, HDFS cannot be directly mounted onto the existing operating system. Hence getting data into and out of the HDFS file system can be awkward. In addition to Hadoop itself, there are multiple open source projects built on top of Hadoop. Major projects are described such below. Hive Pig Cascading HBase
Hadoop Processing HIVE Framework and Hive QL HIVE Hive is a data warehouse framework built on top of Hadoop,  Developed at Facebook, used for ad hoc querying with an SQL type query language and also used for more complex analysis.  Users define tables and columns.  Data is loaded into and retrieved through these tables.  Hive QL, a SQL-like query language, is used to create summaries, reports, analyses.  Hive queries launch MapReduce jobs.  Hive is designed for batch processing, not online transaction processing โ€“ unlike HBase (see below),  Hive does not offer real-time queries.
Hadoop Processing Hive, Why? HIVE Needed where Multi Petabyte Warehouse is required Files are insufficient data abstractions Need tables, schemas, partitions, indices SQL is highly popular Need for an open data format โ€“  RDBMS have a closed data format โ€“  flexible schema Hive is a Hadoop subproject!
Hadoop Processing Pig โ€“ High Level Data Flow Language Pig โ€“ High Level Data Flow Language Pig is a high-level data-flow language (Pig Latin) and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop.  Pig is designed for batch processing of data.  Pigโ€™s infrastructure layer consists of a compiler that turns (relatively short) Pig Latin programs into sequences of MapReduce programs.  Pig is a Java client-side application, and users install locally โ€“ nothing is altered on the Hadoop cluster itself. Grunt is the Pig interactive shell.
Hadoop Processing Mahout โ€“ Extensions to Hadoop Programming Extensions to Hadoop Programming Hadoop is not just for large-scale data processing.  Mahout is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop.  Current algorithm focus areas of Mahout: clustering, classification, data mining (frequent itemset), and evolutionary programming.  Mahout clustering and classifier algorithms have direct relevance in bioinformatics - for example, for clustering of large gene expression data sets, and as classifiers for biomarker identification.  For the growing community of Python users in bioinformatics, Pydoop, a Python MapReduce and HDFS API for Hadoop that allows complete MapReduce applications to be written in Python, is available.
Hadoop Processing HBASE โ€“ Distrubited, Fault Tolerant and Scalable DB HBASE Hbase, modeled on Google's BigTable database, HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data.  Each HBase table is stored as a multidimensional sparse map, with rows and columns, each cell having a time stamp. A cell value at a given row and column is by uniquely identified by (Table, Row, Column-Family:Column, Timestamp) -> Value.  HBase has its own Java client API, and tables in HBase can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat.  There is no HBase single point of failure. HBase uses Zookeeper, another Hadoop subproject, for management of partial failures. All table accesses are by the primary key. Secondary indices are possible through additional index tables; programmers need to denormalize and replicate. There is no SQL query language in base HBase. However, there is also a Hive/HBase integration project that allows Hive QL statements access to HBase tables for both reading and inserting.  A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. Columns can be added on-the-fly to tables, with only the parent column families being fixed in a schema. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given cell contains. In addition to being able to scale to petabyte size data sets, we may note the ease of integration of disparate data sources into a small number of HBase tables for building a data workspace, with different columns possibly defined (on-the-fly) for different rows in the same table. Such facility is also important. (See the biological integration discussion below.) In addition to HBase, other scalable random access databases are now available.  HadoopDB,  is a hybrid of MapReduce and a standard relational db system. HadoopDB uses PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for a translation layer.
Hadoop Processing Hadoop Db - Architecture Hadoop DB A Database Connector that connects Hadoop with the single-node database systems A Data Loader which partitions data and manages parallel loading of data into the database systems. A Catalog which tracks locations of different data chunks, including those replicated across multiple nodes. The SQL-MapReduce-SQL (SMS) planner which ex-tends Hive to provide a SQL interface to HadoopDB
Example System (Web Portal)  Tera-Bytes of data being populated to centralized storage and processed, every week-end!
Features Pluggable Portal Components โ€“ Portlets Functional Aggregation and Deployment as Portlets Exposing Portlets as Web Services Pluggable, interactive, user-facing web services Portlets deployed as independent WAR files Portlet Web Services can be consumed by other Portals Integration UI to provision real time integration with external systems via web and other channels Provisioning for admin features based on roles and level of access Role Management Administration Module Monitoring  Control  Report Configurations Reporting Business Intelligence Module Analysis Metrics Trends Application Integration Services Application Integration Portlet Integration Rules Data  Sources Business Applications Infrastructure and Business Services MyASUP Portal Application Set-Up Core Framework  (Logging, Exceptions, Rule Engine, Analytics, Auditing) External Apps UI Adaptation Real Time Integration Module JMS, MQ, JDBC Channels Back End Web Portal (High Level Architecture) Web Portal - High Level Architecture Which uses Hadoop, Solr and Lucene for Backend Data Processing Web Portal โ€“ Using Hadoop/Solr/Lucene Security
DB Server  J2EE Application Server  HTTP HTTP DB Server  J2EE Application Server Apache Web Server Tomcat mod_jk Plug-In JBOSS - J2EE Application JBOSS โ€“ Portal Web Service JBOSS โ€“ jBPM JBOSS - Portal HTTP JDBC Web Portal Servers (Apache + App Server) Web Portal Deployment Landscape Shrading Function Hadoop Processing Web Portal โ€“ Deployment Landscape Web Portal โ€“ Deployment Landscape DB LB LB DB
Example โ€“ AOL Advertising Platform http://www.cloudera.com/blog/2011/02/an-emerging-data-management-architectural-pattern-behind-interactive-web-application/
AOL Advertising runs one of the largest online ad serving operations, serving billions of impressions each month to hundreds of millions of people. AOL faced three data management challenges in building their ad serving platform. There were three major challenges:- How to analyze billions of user-related events, presented as a mix of structured and unstructured data, to infer demographic, psychographic and behavioral characteristics that are encapsulated into hundreds of millions of โ€œcookie profilesโ€ How to make hundreds of millions of cookie profiles available to their ad targeting platform with sub-millisecond, random read latency How to keep the user profiles fresh and current The solution was to integrate two data management systems: one optimized for high-throughput data analysis (the โ€œanalyticsโ€ system), the other for low-latency random access (the โ€œtransactionalโ€ system). After analyzing alternatives, the final architecture selected pairedย  Cloudera Distribution for Apache Hadoop ย ( CDH ) with Membase. Hadoop Processing AOL Advertising โ€“ Business Case and Solution AOL Advertising โ€“ Business Case and Solution
Thank You!
Ad

More Related Content

What's hot (20)

Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
Pham Thai Hoa
ย 
Big data
Big dataBig data
Big data
Abilash Mavila
ย 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
ย 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
ย 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
Jonathan Bloom
ย 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
Senthil Kumar
ย 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
ย 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
tipanagiriharika
ย 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
ย 
Introduction to Apacheย hadoop
Introduction to Apacheย hadoopIntroduction to Apacheย hadoop
Introduction to Apacheย hadoop
Omar Jaber
ย 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
Edureka!
ย 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!
ย 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
ย 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
ย 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
ย 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
ย 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
Nojan Emad
ย 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
ย 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
ย 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
ย 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
Pham Thai Hoa
ย 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
ย 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
Jonathan Bloom
ย 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
Senthil Kumar
ย 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
ย 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
tipanagiriharika
ย 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
ย 
Introduction to Apacheย hadoop
Introduction to Apacheย hadoopIntroduction to Apacheย hadoop
Introduction to Apacheย hadoop
Omar Jaber
ย 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
Edureka!
ย 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!
ย 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
ย 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
ย 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
ย 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
ย 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
Nojan Emad
ย 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
ย 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
ย 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
ย 

Viewers also liked (7)

chef loves windows
chef loves windowschef loves windows
chef loves windows
Mat Schaffer
ย 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
ย 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
ย 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
ย 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
ย 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
ย 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
James Sirota
ย 
chef loves windows
chef loves windowschef loves windows
chef loves windows
Mat Schaffer
ย 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
ย 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
ย 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
ย 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
ย 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
ย 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
James Sirota
ย 
Ad

Similar to Hadoop a Natural Choice for Data Intensive Log Processing (20)

Hadoop distributed file system (HDFS), HDFS concept
Hadoop distributed file system (HDFS), HDFS conceptHadoop distributed file system (HDFS), HDFS concept
Hadoop distributed file system (HDFS), HDFS concept
kuthubussaman1
ย 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
ย 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
ย 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
ย 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
ย 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
ย 
Cppt
CpptCppt
Cppt
chunkypandey12
ย 
Cppt
CpptCppt
Cppt
chunkypandey12
ย 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
ย 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
ย 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
Simplilearn
ย 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
ย 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
ย 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
VijayMohan Vasu
ย 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
ย 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
ย 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
ย 
Big data
Big dataBig data
Big data
revathireddyb
ย 
Big data
Big dataBig data
Big data
revathireddyb
ย 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
ย 
Hadoop distributed file system (HDFS), HDFS concept
Hadoop distributed file system (HDFS), HDFS conceptHadoop distributed file system (HDFS), HDFS concept
Hadoop distributed file system (HDFS), HDFS concept
kuthubussaman1
ย 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
ย 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
ย 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
ย 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
ย 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
ย 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
Simplilearn
ย 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
ย 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
ย 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
VijayMohan Vasu
ย 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
ย 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
ย 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
ย 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
ย 
Ad

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
ย 
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
ย 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
ย 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
ย 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
ย 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
ย 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
ย 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
ย 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
ย 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
ย 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
ย 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
ย 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
ย 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
ย 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
ย 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
ย 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
ย 
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything โ€“ Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
ย 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
ย 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
ย 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
ย 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
ย 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
ย 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
ย 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
ย 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
ย 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
ย 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
ย 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
ย 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
ย 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
ย 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
ย 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
ย 

Hadoop a Natural Choice for Data Intensive Log Processing

  • 1. Apache Hadoop A Natural Choice for Data Intensive Multiform at Log Processing Date: 22 nd Aprilโ€™ 2011 Authored and Compiled By: Hitendra Kumar
  • 2. A framework that can be installed on a commodity Linux cluster to permit large scale distributed data analysis. Initial version created in 2004 by Doug Cutting and since after having broad and rapidly growing user community. Hadoop provides the robust, fault-tolerant Hadoop Distributed File System (HDFS), inspired by Google's file system, as well as a Java-based API that allows parallel processing across the nodes of the cluster using the Map-Reduce paradigm allowing - Distributed processing of large data sets Pluggable user code runs in generic framework Use of code written in other languages, such as Python and C, is possible through Hadoop Streaming, a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. Hadoop comes with Job and Task Trackers that keep track of the programsโ€™ execution across the nodes of the cluster. Natural Choice for: Data Intensive Log processing Web search indexing Ad-hoc queries Hadoop Framework A Brief Background A Brief Background
  • 3. Accelerating nightly batch business processes.ย Since Hadoop can scale linearly, this can enable internal or external on-demand cloud farms to dynamically handle shrink performance windows and take on larger volume situations that an RDBMS just can't easily deal with. Storage of extremely high volumes of enterprise data.ย The Hadoop Distributed File System is a marvel in itself and can be used to hold extremely large data sets safely on commodity hardware long term that otherwise couldn't stored or handled easily in a relational database. HDFS creates a natural, reliable, and easy-to-use backup environment for almost any amount of data at reasonable prices considering that it's essentially a high-speed online data storage environment. Improving the scalability of applications.ย Very low cost commodity hardware can be used to power Hadoop clusters since redundancy and fault resistance is built into the software instead of using expensive enterprise hardware or software alternatives with proprietary solutions. Use of Java for data processing instead of SQL.ย Hadoop is a Java platform and can be used by just about anyone fluent in the language (other language options are coming available soon via APIs.) Producing just-in-time feeds for dashboards and business intelligence. Handling urgent, ad hoc requests for data.ย While certainly expensive enterprise data warehousing software can do this, Hadoop is a strong performer when it comes to quickly asking and getting answers to urgent questions involving extremely large datasets. Turning unstructured data into relational data.ย While ETL tools and bulk load applications work well with smaller datasets, few can approach the data volume and performance that Hadoop can Taking on tasks that require massive parallelism.ย Hadoop has been known to scale out to thousands of nodes in production environments. Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment.ย  Hadoop Framework Leveraging Hadoop for High Performance over RDBMS Leveraging Hadoop over RDBMS
  • 4. XML Logs CSV SQL Objects, JSONs Binary Hadoop Distributed File System (HDFS) M A P C R E A T I O N Reduce Commodity Server Cloud (Scale Out) Hadoop Environment RDBMS import Reporting Dash Boards BI Applications Enterprise High Volume Data In-Flow Map-Reduce Process Consume Results Hadoop Processing How it works? How it works?
  • 5. Automatic & efficient parallelization / distribution Extremely popular for analyzing large datasets in cluster environments. The success of Stems from hiding the details of parallelization, fault tolerance, and load balancing in a simple programming framework. Widely accepted by community:- MapReduce preferable over a parallel RDBMS for log processing. Example:- Big Web 2.0 companies like Facebook, Yahoo and of Google. Traditional enterprise customers of RDBMSs, such as JP Morgan Chase, VISA, The New York Times and China Mobile have started investigating and embracing MapReduce. More than 80 companies and organizations are listed as users of Hadoop in data analytic solutions, log event processing etc. The IT giant, IBM engaged with a number of enterprise customers to prototype novel Hadoop-based solutions on massive amount of structured and unstructured data for their business analytics applications. China Mobile gathers 5โ€“8TB of call records/day. Facebook , almost 6TB of new log data collected every day, with 1.7PB of log data accumulated over time. Just formatting and loading that much data into a parallel RDBMS in a timely manner is a challenge. Second, the log records do not always follow the same schema, This makes the lack of a rigid schema in MapReduce a feature rather than a shortcoming. Third, all the log records within a time period are typically analyzed together, making simple scans preferable to index scans. Fourth, log processing can be very time consuming and therefore it is important to keep the analysis job going even in the event of failures. Joining log data with all kinds of reference data in MapReduce has emerged as an important part of analytic operations for enterprise customers, as well as Web 2.0 companies Hadoop Processing Map Reduce Algorithm . Map Reduce Algorithm
  • 6. Hadoop Processing Map Reduce Algorithm .. Map Reduce Algorithm ..
  • 7. There are separate Map and Reduce steps, each step done in parallel, each operating on sets of key-value pairs. Program execution is divided into a Map and a Reduce stage, separated by data transfer between nodes in the cluster. So we have this workflow: Input -> Map() -> Copy()/Sort() -> Reduce() ->Output. In the first stage, a node executes a Map function on a section of the input data. Map output is a set of records in the form of key-value pairs, stored on that node. The records for any given key โ€“ possibly spread across many nodes โ€“ are aggregated at the node running the Reducer for that key. This involves data transfer between machines. This second Reduce stage is blocked from progressing until all the data from the Map stage has been transferred to the appropriate machine. The Reduce stage produces another set of key-value pairs, as final output. This is a simple programming model, restricted to use of key-value pairs, but a surprising number of tasks and algorithms will fit into this framework. Also, while Hadoop is currently primarily used for batch analysis of very large data sets, nothing precludes use of Hadoop for computationally intensive analyses, e.g., the Mahout machine learning project described below. Hadoop Processing Map Reduce Algorithm ... Map Reduce Algorithm โ€ฆ
  • 8. Hadoop Processing Components Map Reduce Algorithm โ€ฆ HDFS , Hadoop Distributed File System HBASE , Modeled on Google's BigTable database, adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system. HIVE , Data-Flow-Language and Dataware House Framework on top of Hadoop Pig , High-Level Data-Flow Language (Pig Latin) and Execution Framework whose compiler produces sequences of Map/Reduce programs Zookeeper , A distributed, highly available coordination service. Zookeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop , A tool for efficiently moving data between relational databases and HDFS
  • 9. Hadoop Processing HDFS File System HDFS File System HDFS file system There are some drawbacks to HDFS use. HDFS handles continuous updates (write many) less well than a traditional relational database management system. Also, HDFS cannot be directly mounted onto the existing operating system. Hence getting data into and out of the HDFS file system can be awkward. In addition to Hadoop itself, there are multiple open source projects built on top of Hadoop. Major projects are described such below. Hive Pig Cascading HBase
  • 10. Hadoop Processing HIVE Framework and Hive QL HIVE Hive is a data warehouse framework built on top of Hadoop, Developed at Facebook, used for ad hoc querying with an SQL type query language and also used for more complex analysis. Users define tables and columns. Data is loaded into and retrieved through these tables. Hive QL, a SQL-like query language, is used to create summaries, reports, analyses. Hive queries launch MapReduce jobs. Hive is designed for batch processing, not online transaction processing โ€“ unlike HBase (see below), Hive does not offer real-time queries.
  • 11. Hadoop Processing Hive, Why? HIVE Needed where Multi Petabyte Warehouse is required Files are insufficient data abstractions Need tables, schemas, partitions, indices SQL is highly popular Need for an open data format โ€“ RDBMS have a closed data format โ€“ flexible schema Hive is a Hadoop subproject!
  • 12. Hadoop Processing Pig โ€“ High Level Data Flow Language Pig โ€“ High Level Data Flow Language Pig is a high-level data-flow language (Pig Latin) and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop. Pig is designed for batch processing of data. Pigโ€™s infrastructure layer consists of a compiler that turns (relatively short) Pig Latin programs into sequences of MapReduce programs. Pig is a Java client-side application, and users install locally โ€“ nothing is altered on the Hadoop cluster itself. Grunt is the Pig interactive shell.
  • 13. Hadoop Processing Mahout โ€“ Extensions to Hadoop Programming Extensions to Hadoop Programming Hadoop is not just for large-scale data processing. Mahout is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop. Current algorithm focus areas of Mahout: clustering, classification, data mining (frequent itemset), and evolutionary programming. Mahout clustering and classifier algorithms have direct relevance in bioinformatics - for example, for clustering of large gene expression data sets, and as classifiers for biomarker identification. For the growing community of Python users in bioinformatics, Pydoop, a Python MapReduce and HDFS API for Hadoop that allows complete MapReduce applications to be written in Python, is available.
  • 14. Hadoop Processing HBASE โ€“ Distrubited, Fault Tolerant and Scalable DB HBASE Hbase, modeled on Google's BigTable database, HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data. Each HBase table is stored as a multidimensional sparse map, with rows and columns, each cell having a time stamp. A cell value at a given row and column is by uniquely identified by (Table, Row, Column-Family:Column, Timestamp) -> Value. HBase has its own Java client API, and tables in HBase can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat. There is no HBase single point of failure. HBase uses Zookeeper, another Hadoop subproject, for management of partial failures. All table accesses are by the primary key. Secondary indices are possible through additional index tables; programmers need to denormalize and replicate. There is no SQL query language in base HBase. However, there is also a Hive/HBase integration project that allows Hive QL statements access to HBase tables for both reading and inserting. A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. Columns can be added on-the-fly to tables, with only the parent column families being fixed in a schema. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given cell contains. In addition to being able to scale to petabyte size data sets, we may note the ease of integration of disparate data sources into a small number of HBase tables for building a data workspace, with different columns possibly defined (on-the-fly) for different rows in the same table. Such facility is also important. (See the biological integration discussion below.) In addition to HBase, other scalable random access databases are now available. HadoopDB, is a hybrid of MapReduce and a standard relational db system. HadoopDB uses PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for a translation layer.
  • 15. Hadoop Processing Hadoop Db - Architecture Hadoop DB A Database Connector that connects Hadoop with the single-node database systems A Data Loader which partitions data and manages parallel loading of data into the database systems. A Catalog which tracks locations of different data chunks, including those replicated across multiple nodes. The SQL-MapReduce-SQL (SMS) planner which ex-tends Hive to provide a SQL interface to HadoopDB
  • 16. Example System (Web Portal) Tera-Bytes of data being populated to centralized storage and processed, every week-end!
  • 17. Features Pluggable Portal Components โ€“ Portlets Functional Aggregation and Deployment as Portlets Exposing Portlets as Web Services Pluggable, interactive, user-facing web services Portlets deployed as independent WAR files Portlet Web Services can be consumed by other Portals Integration UI to provision real time integration with external systems via web and other channels Provisioning for admin features based on roles and level of access Role Management Administration Module Monitoring Control Report Configurations Reporting Business Intelligence Module Analysis Metrics Trends Application Integration Services Application Integration Portlet Integration Rules Data Sources Business Applications Infrastructure and Business Services MyASUP Portal Application Set-Up Core Framework (Logging, Exceptions, Rule Engine, Analytics, Auditing) External Apps UI Adaptation Real Time Integration Module JMS, MQ, JDBC Channels Back End Web Portal (High Level Architecture) Web Portal - High Level Architecture Which uses Hadoop, Solr and Lucene for Backend Data Processing Web Portal โ€“ Using Hadoop/Solr/Lucene Security
  • 18. DB Server J2EE Application Server HTTP HTTP DB Server J2EE Application Server Apache Web Server Tomcat mod_jk Plug-In JBOSS - J2EE Application JBOSS โ€“ Portal Web Service JBOSS โ€“ jBPM JBOSS - Portal HTTP JDBC Web Portal Servers (Apache + App Server) Web Portal Deployment Landscape Shrading Function Hadoop Processing Web Portal โ€“ Deployment Landscape Web Portal โ€“ Deployment Landscape DB LB LB DB
  • 19. Example โ€“ AOL Advertising Platform http://www.cloudera.com/blog/2011/02/an-emerging-data-management-architectural-pattern-behind-interactive-web-application/
  • 20. AOL Advertising runs one of the largest online ad serving operations, serving billions of impressions each month to hundreds of millions of people. AOL faced three data management challenges in building their ad serving platform. There were three major challenges:- How to analyze billions of user-related events, presented as a mix of structured and unstructured data, to infer demographic, psychographic and behavioral characteristics that are encapsulated into hundreds of millions of โ€œcookie profilesโ€ How to make hundreds of millions of cookie profiles available to their ad targeting platform with sub-millisecond, random read latency How to keep the user profiles fresh and current The solution was to integrate two data management systems: one optimized for high-throughput data analysis (the โ€œanalyticsโ€ system), the other for low-latency random access (the โ€œtransactionalโ€ system). After analyzing alternatives, the final architecture selected pairedย  Cloudera Distribution for Apache Hadoop ย ( CDH ) with Membase. Hadoop Processing AOL Advertising โ€“ Business Case and Solution AOL Advertising โ€“ Business Case and Solution