my experience in spark tunning. All tests are made in production environment(600+ node hadoop cluster). The tunning result is useful for Spark SQL use case.
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
my experience in spark tunning. All tests are made in production environment(600+ node hadoop cluster). The tunning result is useful for Spark SQL use case.
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
Apache Cassandra is an open-source distributed database designed to handle large amounts of data across commodity servers in a highly available manner without single points of failure. It uses a gossip protocol for cluster membership and a Dynamo-inspired architecture to provide availability and partition tolerance, while supporting eventual consistency.
This document outlines and compares two NameNode high availability (HA) solutions for HDFS: AvatarNode used by Facebook and BackupNode used by Yahoo. AvatarNode provides a complete hot standby with fast failover times of seconds by using an active-passive pair and ZooKeeper for coordination. BackupNode has limitations including slower restart times of 25+ minutes and supporting only two-machine failures. While it provides hot standby for the namespace, block reports are sent only to the active NameNode, making it a semi-hot standby solution. The document also briefly mentions other experimental HA solutions for HDFS.
This document summarizes a study on FlumeBase, a system for processing streaming data using SQL queries. It describes FlumeBase's architecture, including how it integrates with Flume and uses SQL queries to define streams, flows, and flow elements for aggregating data. The document notes some potential issues with FlumeBase regarding window alignment, deployment integration with Flume, and code maturity.
This document introduces Flume and Flive. It summarizes that Flume is a distributed data collection system that can easily extend to new data formats and scales linearly as new nodes are added. It discusses Flume's core concepts of events, flows, nodes, and reliability features. It then introduces Flive, an enhanced version of Flume developed by Hanborq that provides improved performance, functionality, manageability, and integration with Hugetable.
Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.
HBase is an open source, distributed, sorted key-value store modeled after Google's BigTable. It uses HDFS for storage and provides random read/write access to large datasets. Data is stored in tables with rows sorted by key and columns grouped into column families. The master coordinates region servers that host regions, the distributed units of data. Clients locate data regions and directly communicate with region servers to read and write data.
This document discusses the versioning conventions and history of Hadoop releases. It notes that features were occasionally developed on branches off the trunk codeline and that some releases included features from different branches, causing confusion. It also summarizes the status of Hadoop 1.0, which unified many previously separated features, and the versioning of the Cloudera CDH distribution in relation to Apache Hadoop releases.
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
Hadoop MapReduce introduces YARN, which separates cluster resource management from application execution. YARN introduces a global ResourceManager and per-node NodeManagers to manage resources. Applications run as ApplicationMasters and containers on the nodes. This improves scalability, fault tolerance, and allows various application paradigms beyond MapReduce. Optimization techniques for MapReduce include tuning buffer sizes, enabling sort avoidance when sorting is unnecessary, and using Netty and batch fetching to improve shuffle performance.
The document provides an overview of the Hadoop Distributed File System (HDFS). It describes HDFS's master-slave architecture with a single NameNode master and multiple DataNode slaves. The NameNode manages filesystem metadata and data placement, while DataNodes store data blocks. The document outlines HDFS components like the SecondaryNameNode, DataNodes, and how files are written and read. It also discusses high availability solutions, operational tools, and the future of HDFS.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).