HBase is a distributed, scalable, big data store modeled after Google's Bigtable. The document outlines the key aspects of HBase, including that it uses HDFS for storage, Zookeeper for coordination, and can optionally use MapReduce for batch processing. It describes HBase's architecture with a master server distributing regions across multiple region servers, which store and serve data from memory and disks.
This document provides a set of icons for visualizing Hadoop architectures and components. It includes icons for generic data center infrastructure, Hadoop projects and tools, and examples of how they can be used in diagrams. Download links are provided for Visio, Omnigraffle stencils, and high resolution PNG files of the icons. Styles, typography guidelines and additional example diagrams are also included.
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
This document provides an introduction to HBase internals and schema design for HBase users. It discusses the logical and physical views of HBase, including how tables are split into regions and stored across region servers. It covers best practices for schema design, such as using row keys efficiently and avoiding redundancy. The document also briefly discusses advanced topics like coprocessors and compression. The overall goal is to help HBase users optimize performance and scalability based on its internal architecture.
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
This document summarizes a talk on implementing timeline consistency for HBase region replicas. It introduces the concept of region replicas, where each region has multiple copies hosted on different servers. The primary accepts writes, while secondary replicas are read-only. Reads from secondaries return possibly stale data. The talk outlines the implementation of region replicas in HBase, including updates to the master, region servers, and IPC. It discusses data replication approaches and next steps to implement write replication using the write-ahead log. The goal is to provide high availability for reads in HBase while tolerating single-server failures.
The document proposes using MapReduce jobs to perform scans over HBase snapshots. Snapshots provide immutable data from HBase tables. The MapReduce jobs would bypass region servers and scan snapshot files directly for improved performance. An initial implementation called TableSnapshotInputFormat is described which restores snapshot data and runs scans in parallel across map tasks. The implementation addresses security and performance aspects. An API for client-side scanning of snapshots is also proposed to allow snapshot scans outside of MapReduce.
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
The document summarizes the HBase 1.0 release which introduces major new features and interfaces including a new client API, region replicas for high availability, online configuration changes, and semantic versioning. It describes goals of laying a stable foundation, stabilizing clusters and clients, and making versioning explicit. Compatibility with earlier versions is discussed and the new interfaces like ConnectionFactory, Connection, Table and BufferedMutator are introduced along with examples of using them.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clustermas4share
This document provides five tips for maximizing performance on an HBase/Phoenix SQL cluster with over 1.2 billion records distributed across 200 nodes. The top tips are: 1) Use SQL hints when querying indexes to reduce unnecessary metadata checks; 2) Aggressively use memory allocation, allocating at least 3x the data size; 3) Manually split tables into regions that are similarly sized, without over-splitting; 4) Favor scale-out over scale-up by adding more smaller nodes rather than fewer larger nodes; and 5) Optimize other configuration settings like compaction thresholds. The document includes details on the cluster configuration, query load testing and results.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
Big data refers to large datasets that are difficult to process using traditional database management tools. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliable data storage with the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using MapReduce. The Hadoop ecosystem includes components like HDFS, MapReduce, Hive, Pig, and HBase that provide distributed data storage, processing, querying and analysis capabilities at scale.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
This document summarizes a talk about Facebook's use of HBase for messaging data. It discusses how Facebook migrated data from MySQL to HBase to store metadata, search indexes, and small messages in HBase for improved scalability. It also outlines performance improvements made to HBase, such as for compactions and reads, and future plans such as cross-datacenter replication and running HBase in a multi-tenant environment.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
This document discusses the Kite SDK and how it provides a higher-level API for developing Hadoop data applications. It introduces the Kite Datasets module, which defines a unified storage interface for datasets. It describes how Kite implements partitioning strategies to map data entities to storage partitions, and column mappings to define how data fields are stored in HBase tables. The document provides examples of using Kite datasets to randomly access and update data stored in HBase.
This document summarizes key Hadoop configuration parameters that affect MapReduce job performance and provides suggestions for optimizing these parameters under different conditions. It describes the MapReduce workflow and phases, defines important parameters like dfs.block.size, mapred.compress.map.output, and mapred.tasktracker.map/reduce.tasks.maximum. It explains how to configure these parameters based on factors like cluster size, data and task complexity, and available resources. The document also discusses other performance aspects like temporary space, JVM tuning, and reducing reducer initialization overhead.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
This document discusses improvements to compaction in Apache HBase. It begins with an overview of what compactions are and how they improve read performance in HBase. It then describes the default compaction algorithm and improvements made, including exploring selection and off-peak compactions. The document also covers making compactions more pluggable and enabling tuning on a per-table/column family basis. Finally, it proposes algorithms for different scenarios, such as level and stripe compactions, to improve compaction performance.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
The document summarizes the HBase 1.0 release which introduces major new features and interfaces including a new client API, region replicas for high availability, online configuration changes, and semantic versioning. It describes goals of laying a stable foundation, stabilizing clusters and clients, and making versioning explicit. Compatibility with earlier versions is discussed and the new interfaces like ConnectionFactory, Connection, Table and BufferedMutator are introduced along with examples of using them.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clustermas4share
This document provides five tips for maximizing performance on an HBase/Phoenix SQL cluster with over 1.2 billion records distributed across 200 nodes. The top tips are: 1) Use SQL hints when querying indexes to reduce unnecessary metadata checks; 2) Aggressively use memory allocation, allocating at least 3x the data size; 3) Manually split tables into regions that are similarly sized, without over-splitting; 4) Favor scale-out over scale-up by adding more smaller nodes rather than fewer larger nodes; and 5) Optimize other configuration settings like compaction thresholds. The document includes details on the cluster configuration, query load testing and results.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
Big data refers to large datasets that are difficult to process using traditional database management tools. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliable data storage with the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using MapReduce. The Hadoop ecosystem includes components like HDFS, MapReduce, Hive, Pig, and HBase that provide distributed data storage, processing, querying and analysis capabilities at scale.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
This document summarizes a talk about Facebook's use of HBase for messaging data. It discusses how Facebook migrated data from MySQL to HBase to store metadata, search indexes, and small messages in HBase for improved scalability. It also outlines performance improvements made to HBase, such as for compactions and reads, and future plans such as cross-datacenter replication and running HBase in a multi-tenant environment.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
This document discusses the Kite SDK and how it provides a higher-level API for developing Hadoop data applications. It introduces the Kite Datasets module, which defines a unified storage interface for datasets. It describes how Kite implements partitioning strategies to map data entities to storage partitions, and column mappings to define how data fields are stored in HBase tables. The document provides examples of using Kite datasets to randomly access and update data stored in HBase.
This document summarizes key Hadoop configuration parameters that affect MapReduce job performance and provides suggestions for optimizing these parameters under different conditions. It describes the MapReduce workflow and phases, defines important parameters like dfs.block.size, mapred.compress.map.output, and mapred.tasktracker.map/reduce.tasks.maximum. It explains how to configure these parameters based on factors like cluster size, data and task complexity, and available resources. The document also discusses other performance aspects like temporary space, JVM tuning, and reducing reducer initialization overhead.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
This document discusses improvements to compaction in Apache HBase. It begins with an overview of what compactions are and how they improve read performance in HBase. It then describes the default compaction algorithm and improvements made, including exploring selection and off-peak compactions. The document also covers making compactions more pluggable and enabling tuning on a per-table/column family basis. Finally, it proposes algorithms for different scenarios, such as level and stripe compactions, to improve compaction performance.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
The document provides an overview of the state of the Apache HBase database project. It discusses the project goals of availability, stability, and scalability. It also summarizes the mature codebase, active development areas like region replicas and ProcedureV2, and the growing ecosystem of SQL interfaces and other Hadoop components integrated with HBase. Recent releases include 1.1.2 which improved scanners and introduced quotas and throttling, and the 1.0 release which adopted semantic versioning and added region replicas.
This document provides an overview and best practices for operating HBase clusters. It discusses HBase and Hadoop architecture, how to set up an HBase cluster including Zookeeper and region servers, high availability considerations, scaling the cluster, backup and restore processes, and operational best practices around hardware, disks, OS, automation, load balancing, upgrades, monitoring and alerting. It also includes a case study of a 110 node HBase cluster.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Performance and Fault Tolerance for the Netflix API Ben Christensen
The document discusses Netflix's API architecture and how it achieves fault tolerance and high performance. It describes how the API is composed of dozens of dependencies that could each fail independently. It then outlines Netflix's approaches to avoid any single dependency taking down the entire application, including using fallbacks, failing silently, failing fast, shedding load, aggressive timeouts, semaphores, separate threads, and circuit breakers. It provides examples of how over 10 billion dependency commands can be executed per day with over 1 billion incoming requests.
Dimitris Bertsimas, Allison O'Hair, and Sanjay Sarma issued a verified certificate to Rafael Pires da Silva Mendes for successfully completing the online course 15.071x: The Analytics Edge through MITx and edX. The certificate can be verified online at the provided URL to confirm its authenticity.
Rafael Mendes completed a course in Web Intelligence and Big Data from December 02, 2013. The course taught machine learning and parallel programming techniques for analyzing large datasets from sources like social media and genomics. The certificate was issued by Dr. Gautam Shroff, Vice President and Chief Scientist of Tata Consultancy Services' Innovation Labs.
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
Apache Phoenix is a SQL query layer over Apache HBase that allows users to interact with HBase through JDBC and SQL. It transforms SQL queries into native HBase API calls for efficient parallel execution on the cluster. Phoenix provides metadata storage, SQL support, and a JDBC driver. It is now a top-level Apache project after originally being developed at Salesforce. The speaker discussed Phoenix's capabilities like joins and subqueries, new features like HBase 1.0 support and functional indexes, and future plans like improved optimization through Calcite and transaction support.
The document discusses Gluster FS, an open source distributed file system for big data. It provides an overview of Gluster FS's architecture, which distributes data across commodity servers and storage bricks. The summary describes how Gluster FS allows storage to be easily scaled out by adding additional servers and bricks, provides high availability through data replication, and supports various clients including native, NFS, HDFS and SMB/CIFS. Real-world examples are given of how the presenter's company uses Gluster FS to store and retrieve large Lucene indexes totaling over 200GB per month.
This document contains information about Apache HBase including links to documentation pages, JIRA issues, and discussions on using HBase. It provides configuration examples for viewing HFile contents, explains how Bloom filters are used in HBase, includes an overview of the HBase data model and comparisons with RDBMS. It also shows an example Git diff of modifying the HBase heap size configuration and provides links to guides on using HBase and documentation on region splitting and merging.
We start by looking at distributed database features that impact latency. Then we take a deeper look at the HBase read and write paths with a focus on request latency. We examine the sources of latency and how to minimize them.
If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort. Even highly optimized, it spans a couple TBs and a few days of computation. Enter Hadoop. In this talk, I'll show you how to generate your own custom tiles using Hadoop. There will be pretty pictures.
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data Cloude...Flavio Fonte, PMP, ITIL
O documento descreve o HBASE, um banco de dados NoSQL orientado a colunas que armazena dados no Hadoop. Também discute opções para montar ambientes Big Data como Cloudera, Hortonworks e Pivotal, que oferecem distribuições do Hadoop com suporte.
HBase is a NoSQL database that stores data in HDFS in a distributed, scalable, reliable way for big data. It is column-oriented and optimized for random read/write access to big data in real-time. HBase is not a relational database and relies on HDFS. Common use cases include flexible schemas, high read/write rates, and real-time analytics. Apache Phoenix provides a SQL interface for HBase, allowing SQL queries, joins, and familiar constructs to manage data in HBase tables.
Based on "HBase, dances on the elephant back" presentation success I have prepared its update for JavaDay 2014 Kyiv. Again, it is about the product which revolutionary changes everything inside Hadoop infrastructure: Apache HBase. But here focus is shifted to integration and more advanced topics keeping presentation yet understandable for technology newcomers.
This document provides an overview and configuration instructions for Hadoop, Flume, Hive, and HBase. It begins with an introduction to each tool, including what problems they aim to solve and high-level descriptions of how they work. It then provides step-by-step instructions for downloading, configuring, and running each tool on a single node or small cluster. Specific configuration files and properties are outlined for core Hadoop components as well as integrating Flume, Hive, and HBase.
The document discusses the evolution of Hadoop from versions 1.0 to 2.0. Key limitations of Hadoop 1.0 included lack of horizontal scalability, single points of failure, and tight coupling between components. Hadoop 2.0 addressed these issues by introducing YARN for decoupling compute from storage and enabling multiple job types beyond MapReduce. Other improvements in 2.0 included high availability, sharing between jobs, and running non-Java frameworks.
The document discusses using HBase and R for fast time-series analytics of large datasets. It begins with an introduction to NoSQL and HBase, describing its key features. It then discusses a use case of analyzing sensor data from airport vehicles. The data model designs HBase to store the hierarchical time-series data. Hive and storing on HDFS directly are considered but rejected in favor of HBase. The presentation concludes with an overview of using the rhbase package to interface R with HBase.
The document provides information on various components of the Hadoop ecosystem including Pig, Zookeeper, HBase, Spark, and Hive. It discusses how HBase offers random access to data stored in HDFS, allowing for faster lookups than HDFS alone. It describes the architecture of HBase including its use of Zookeeper, storage of data in regions on region servers, and secondary indexing capabilities. Finally, it summarizes Hive and how it allows SQL-like queries on large datasets stored in HDFS or other distributed storage systems using MapReduce or Spark jobs.
Hive It stores schema in a database and processed data into HDFS. It provides...rajsigh020
It stores schema in a database and processed data into HDFS.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible. Hive is a data warehouse infrastructure tool to process structured data in Hadoop(used for structure and semi structured data analysis and processing). It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon
Hive is not : A relational database
HBase 1.0 is the new stable major release, and the start of "semantic versioned" releases. We will cover new features, changes in behavior and requirements, source/binary and wire compatibility details, and upgrading. We'll also dive deep into the new standardized client API in 1.0, which establishes a separation of concerns, encapsulates what is needed from how it's delivered, and guarantees future compatibility while freeing the implementation to evolve.
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
The document discusses Facebook's use of HBase as the database storage engine for its messaging platform. It provides an overview of HBase, including its data model, architecture, and benefits like scalability, fault tolerance, and simpler consistency model compared to relational databases. The document also describes Facebook's contributions to HBase to improve performance, availability, and achieve its goal of zero data loss. It shares Facebook's operational experiences running large HBase clusters and discusses its migration of messaging data from MySQL to a de-normalized schema in HBase.
This document contains information about HBase concepts and configurations. It discusses different modes of HBase operation including standalone, pseudo-distributed, and distributed modes. It also covers basic prerequisites for running HBase like Java, SSH, DNS, NTP, ulimit settings, and Hadoop for distributed mode. The document explains important HBase configuration files like hbase-site.xml, hbase-default.xml, hbase-env.sh, log4j.properties, and regionservers. It provides details on column-oriented versus row-oriented databases and discusses optimizations that can be made through configuration settings.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
HBase is a distributed column-oriented database built on top of Hadoop that provides random real-time read/write access to big data stored in Hadoop. It uses a master server to assign regions to region servers and Zookeeper to track servers and coordinate tasks. HBase allows users to perform CRUD operations on tables through its shell interface using commands like create, put, get, and scan.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
HBase is a distributed, scalable, big data store that is built on top of HDFS. It is a column-oriented NoSQL database that provides fast lookups and updates for large tables. Key features include scalability, automatic failover, consistent reads/writes, sharding of tables, and Java and REST APIs for client access. HBase is not a replacement for an RDBMS as it does not support SQL, joins, or relations between tables.
HBase is a distributed, scalable, big data NoSQL database that runs on top of Hadoop HDFS. It is a column-oriented database that allows for fast lookups and updates of large tables. Key components include the HMaster that manages metadata, RegionServers that store and serve data in regions or tables, and Zookeeper which provides coordination services.
This document discusses integrating Apache Hive with Apache HBase. It provides an overview of Hive and HBase, the motivation for integrating the two systems, and how the integration works. Specifically, it covers how the schema and data types are mapped between Hive and HBase, how filters can be pushed down from Hive to HBase to optimize queries, bulk loading data from Hive into HBase, and security aspects of the integrated system. The document is intended to provide background and technical details on using Hive and HBase together.
Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools or processing applications. A lot of challenges such as capture, curation, storage, search, sharing, analysis, and visualization can be encountered while handling Big Data. On the other hand the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Big Data certification is one of the most recognized credentials of today.
For more details Click http://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document provides an overview of Apache Phoenix, including:
- What Phoenix is and how it provides a SQL interface for Apache HBase
- The current state of Phoenix including SQL support, secondary indexes, and optimizations
- New features in Phoenix 4.4 like functional indexes, user defined functions, and integration with Spark
The presentation covers the evolution and capabilities of Phoenix as a relational layer for HBase that transforms SQL queries into native HBase API calls.
The document discusses different types of block caches in HBase including LruBlockCache, SlabCache, and BucketCache. It explains that block caching improves performance by storing frequently accessed blocks in faster memory rather than slower disk storage. Each block cache has its own configuration options and memory usage characteristics. Benchmark results show that the off-heap BucketCache provides strong performance due to its use of off-heap memory for the L2 cache.
The document discusses the HBase client API for connecting to HBase clusters from applications like webapps. It describes the Java, Ruby, Python, and Thrift client interfaces as well as examples of using scans and puts with these interfaces. It also briefly mentions the REST client interface and some other alternative client libraries like asynchbase and Orderly.
This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
The document is a presentation on NoSQL databases given by Nick Dimiduk. It begins with an introduction of the speaker and their background. The presentation then covers what NoSQL is not, the motivations for NoSQL databases, an overview of Hadoop and its components, and a description of HBase as a structured, distributed database built on Hadoop.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
Apache HBase 1.0 Release
1. Apache
HBase
1.0
Release
Nick
Dimiduk,
Hortonworks
@xefyr
n10k.com
February
20,
2015
2. Release
1.0
“The
theme
of
(eventual)
1.0
release
is
to
become
a
stable
base
for
future
1.x
series
of
releases.
1.0
release
will
aim
to
achieve
at
least
the
same
level
of
stability
of
0.98
releases
without
introducing
too
many
new
features.”
Enis
Söztutar
HBase
1.0
Release
Manager
3. Agenda
• A
Brief
History
of
HBase
• What
is
HBase
• Major
Changes
for
1.0
• Upgrade
Path
8. WHAT
IS
HBASE
HBase
architecture
in
5
minutes
or
less
9. Data
Model
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value
Rows
Column Families
10. Logical
Architecture
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
11. Physical
Architecture
system and can therefore host any region (figure 3.8). By physically collocating Data
Nodes and RegionServers, you can use the data locality property; that is, RegionServ
ers can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
HBase deployments, the MapReduce framework isn’t deployed at all if the workload i
primarily random reads and writes. In other deployments, where the MapReduce pro
cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host
system and can therefore host any region (figure 3.8)
Nodes and RegionServers, you can use the data locali
ers can theoretically read and write to the local DataN
You may wonder where the TaskTrackers are in t
HBase deployments, the MapReduce framework isn’t d
primarily random reads and writes. In other deployme
cessing is also a part of the workloads, TaskTrackers,
Servers can run together.
DataNode RegionServer DataNode RegionServer
Figure 3.7 HBase RegionServer and HDFS DataNode processes are
system and can therefore host any region (figure 3.8). By physically colloca
Nodes and RegionServers, you can use the data locality property; that is, R
ers can theoretically read and write to the local DataNode as the primary D
You may wonder where the TaskTrackers are in this scheme of thing
HBase deployments, the MapReduce framework isn’t deployed at all if the w
primarily random reads and writes. In other deployments, where the MapR
cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBa
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode Reg
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on th
system and can therefore host any region (figure 3.8). By physica
Nodes and RegionServers, you can use the data locality property
ers can theoretically read and write to the local DataNode as the p
You may wonder where the TaskTrackers are in this scheme
HBase deployments, the MapReduce framework isn’t deployed at
primarily random reads and writes. In other deployments, where
cessing is also a part of the workloads, TaskTrackers, DataNodes
Servers can run together.
DataNode RegionServer DataNode RegionServer Dat
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically col
Region
Server
Data
Node
Region
Server
Data
Node
Region
Server
Data
Node
Region
Server
Data
Node
...
Nodes and RegionServers, you can use th
ers can theoretically read and write to the
You may wonder where the TaskTrac
HBase deployments, the MapReduce fram
primarily random reads and writes. In oth
cessing is also a part of the workloads, T
Servers can run together.
DataNode RegionServer DataNode
Figure 3.7 HBase RegionServer and HDFS DataNo
Master
Zoo
Keeper
Given that the underlying data is stored in HDFS, which is available to all clients as
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-
ers can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-
cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Name
Node
cessing is also a part of the workloads, TaskTrackers, DataNode
Servers can run together.
DataNode RegionServer DataNode RegionServer Da
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically co
Licensed to Nick Dimiduk <[email protected]>
HBase
Client
HDFS
HBase
13. Stability:
Co-‐Locate
Meta
with
Master
• Simplify,
Improve
region
assignment
reliability
– Fewer
components
involved
in
upda_ng
“truth”
• Master
embeds
a
RegionServer
– Will
host
only
system
tables
– Baby
step
towards
combining
RS/Master
into
a
single
hbase
daemon
• Backup
masters
unchanged
– Can
be
configured
to
host
user
tables
while
in
standby
• Plumbing
is
all
there,
off
by
default
hip://issues.apache.org/jira/browse/HBASE-‐10569
14. Availability:
Region
Replicas
• Mul_ple
RegionServers
host
a
Region
– One
is
“primary”,
others
are
“replicas”
– Only
primary
accepts
writes
• Client
reads
against
primary
only
or
any
– Results
marked
as
appropriate
• Baby
step
toward
quorum
reads,
writes
hip://issues.apache.org/jira/browse/HBASE-‐10070
hip://www.slideshare.net/HBaseCon/features-‐session-‐1
16. New
and
Noteworthy
• Greatly
expanded
hbase.apache.org/book.html
• Truncate
table
shell
command
• Automa_c
tuning
of
global
MemStore
and
BlockCache
sizes
• Basic
backpressure
mechanism
• BucketCache
easier
to
configure
• Compressed
BlockCache
• Pluggable
replica_on
endpoint
• A
Dockerfile
to
easily
run
HBase
from
source
17. Under
the
Covers
• ZooKeeper
abstrac_ons
• Meta
table
used
for
assignment
• Cell-‐based
read/write
path
• Combining
mvcc/seqid
• Sundry
security,
tags,
labels
improvements
18. Groundwork
for
2.0
• More,
Smaller
Regions
– Millions,
1G
or
less
– Less
write
amplifica_on
– Splinng
hbase:meta
• Performance
– More
off-‐heap
– Less
resource
conten_on
– Faster
region
failover/recovery
– Mul_ple
WALs
– QoS/Quotas/Mul_-‐tenancy
• Rigging
– Faster,
more
intelligent
assignment
– Procedure
bus
– Resumable,
query-‐able
opera_ons
• Other
possibili_es
– Quorum/consensus
reads,
writes?
– Hydrabase,
mul_-‐DC
consensus?
– Streaming
RPCs?
– High
level
coprocessor
API
19. Seman_c
Versioning
• Major/Minor/Patch
version
numbers
– Only
major/minor
pre-‐1.0
• Dimensions
– Client/Server
wire
compa_bility
– Server/Server
wire
and
feature
compa_bility
– API
compa_bility
– ABI
compa_bility
• Proposal
up
for
a
vote
hip://s.apache.org/hbase-‐semver
21. Online/Wire
Compa_bility
• Direct
migra_on
from
0.94
supported
– Looks
a
lot
like
upgrade
from
0.94
to
0.96:
requires
down_me
– Not
tested
yet,
will
be
before
release
• RPC
is
backward-‐compa_ble
to
0.96
– Enabled
mixing
clients
and
servers
across
versions
– So
long
as
no
new
features
are
enabled
• Rolling
upgrade
"out
of
the
box"
from
0.98
• Rolling
upgrade
"with
some
massaging"
from
0.96
– IE,
0.96
cannot
read
HFileV3,
the
new
default
– not
tested
yet,
will
be
before
release
22. Client
Applica_on
Compa_bility
• API
is
backward
compa_ble
to
0.96
– No
code
change
required
– You’ll
start
genng
new
depreca_on
warnings
– We
recommend
you
start
using
new
APIs
• ABI
is
NOT
backward
compa_ble
– Cannot
drop
current
applica_on
jars
onto
new
run_me
– Recompile
your
applica_on
vs.
1.0
jars
– Just
like
0.96
to
0.98
upgrade
23. Hadoop
Versions
• Hadoop
1.x
is
NOT
supported
– Bite
the
bullet;
you’ll
enjoy
the
performance
benefits
• Hadoop
2.x
only
– Most
thoroughly
tested
on
2.4.x,
2.5.x
– Probably
works
on
2.2.x,
2.3.x,
but
less
thoroughly
tested
hips://hbase.apache.org/book/configura_on.html#hadoop
24. Java
Versions
• JDK
6
is
NOT
supported!
• JDK
7
is
the
target
run_me
• JDK
8
support
is
experimental
hips://hbase.apache.org/book/configura_on.html#hadoop
25. 1.0.0
RCs
Available
Now!
• Release
Candidate
vo_ng
has
commenced
• Last
chance
to
catch
show-‐stopping
bugs
RELEASE
CANDIDATES
NOT
FOR
PRODUCTION
USE
• Try
out
the
new
features
• Help
us
test
your
upgrade
path
• Be
a
part
of
history
in
the
making!
• 1.0.0rc5
available
2015-‐02-‐19
hip://search-‐hadoop.com/m/DHED40Ih5n
26. Thanks!
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinac_on.com
Nick
Dimiduk
github.com/ndimiduk
@xefyr
n10k.com
hip://www.apache.org/dyn/closer.cgi/hbase/