Geek camp

Sep 6, 20100 likes392 views

This document provides an introduction and overview of Hadoop. It discusses the brief history of Hadoop, including its origins from Google papers in 2005 and promotion by Yahoo since 2006. It then discusses why Hadoop is useful for big data applications that are petabyte in scale, scalable, robust, and secure. Specific use cases like analytics, reporting, filtering and machine learning on log files, user behavior data, and other structured or unstructured data sources are covered. Finally, it outlines the Hadoop ecosystem and tools like native Java APIs, Pig, Hive, and streaming options for other languages.

Hi!

● I work at

● Involved with Hadoop for 2+ years

Brief History of Hadoop
● 2005 -
● Inspired by the GFS and MapReduce papers
published by Google.
● Promoted heavily by Yahoo! Since 2006
● Today, the defacto standard in 'Big Data'
computing

Why?
● 'Big Data'
● How big? - petabyte scale
● Scalable
● Robust
● Secure!

When To Use It
● Can you use Hadoop to do X?
● Is your problem 'embarassingly' parallel?
● Workflow?
– Dependent/Independent Tasks
● Data/CPU intensive?
● Can you use Hadoop to do X in the Clouds?
● Depends where your data is

Why To Use It?
● Ad hoc analysis
● Semi/structured data
– Log files
– Text
– CSV, XML, anything really
– RDBMS
– NoSQL!

Use Cases
● Analytics
● User behavior
● Reporting
● Filtering
● Machine Learning
● Just storing your data

Just From The Logs
● Suppose you run a web-site
● User breakdown by browsers
● Location
● Understanding user session
– How long do they use it?
– Who are the active users?
– What part of my app they use the most?
– What part of my app is user X's fav?

Tools
● Native Hadoop APIs – Java
● Streaming – Perl, Python, Ruby, any language
as long it has support for 'stdin' and 'stdout'
● Pig
● HIVE
● Pipes – C and C++

Don't Wait
● Hadoop
● hadoop.apache.org
● Cloudera tutorials on Hadoop
● Books

Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It allows data to be stored reliably in its Hadoop Distributed File System (HDFS) and processed in parallel using MapReduce. HDFS stores data redundantly across nodes for fault tolerance, while MapReduce breaks jobs into smaller tasks that can run across a cluster in parallel. Together HDFS and MapReduce provide scalable and fault-tolerant data storage and processing.

introduction to data processing using Hadoop and PigRicardo Varela

Pig, Making Hadoop EasyNick Dimiduk

This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.

Big data Hadoop presentation Shivanee garg

Hadoop TechnologyAtul Kushwaha

This document provides an overview of Big Data and Hadoop. It defines Big Data as large volumes of structured, semi-structured, and unstructured data that is too large to process using traditional databases and software. It provides examples of the large amounts of data generated daily by organizations. Hadoop is presented as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop including HDFS for distributed storage and fault tolerance, and MapReduce for distributed processing, are described at a high level. Common use cases for Hadoop by large companies are also mentioned.

Facebook Hadoop Data & Applicationsdzhou

This document discusses Facebook's use of Hadoop and Hive for storing and analyzing large amounts of user-generated data. Key points include: - Facebook stores petabytes of user data including statuses, photos, videos in its Hadoop/Hive warehouse and other Hadoop clusters. - The data is used for business intelligence to inform strategies and decisions, and power artificial intelligence like recommendations and ads optimization. - Hive is used for ad hoc querying, building machine learning models at scale, and performing text analytics on large corpora. - Examples demonstrate how metrics dashboards and recommendation systems were built on Hadoop/Hive.

Hadoop TechnologiesKannappan Sirchabesan

The document discusses various Hadoop technologies including HDFS, MapReduce, Pig/Hive, HBase, Flume, Oozie, Zookeeper, and HBase. HDFS provides reliable storage across multiple machines by replicating data on different nodes. MapReduce is a framework for processing large datasets in parallel. Pig and Hive provide high-level languages for analyzing data stored in Hadoop. Flume collects log data as it is generated. Oozie manages Hadoop jobs. Zookeeper allows distributed coordination. HBase provides a fault-tolerant way to store large amounts of sparse data.

Seminar Presentation HadoopVarun Narang

The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.

Introduction to Apache HadoopSteve Watt

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It was designed to scale up from single servers to thousands of machines, with very high fault tolerance. Hadoop features two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for distributed processing of large datasets in a parallel and distributed manner. Hadoop saw widespread adoption for applications such as log analysis, data mining, and large-scale graph processing.

Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal

Asbury Hadoop OverviewBrian Enochson

This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.

Big data and HadoopRahul Agarwal

This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.

Dataiku big data paris - the rise of the hadoop ecosystemDataiku

This document discusses the rise of the Hadoop ecosystem. It outlines how the ecosystem has expanded from the original Hadoop components of HDFS for storage and MapReduce for distributed computation. New frameworks have emerged that allow for real-time queries, updates, and machine learning on big data. These include Spark, Storm, Drill, and streaming engines. The ecosystem is now a complex network of interoperable tools for storage, computation, analytics and machine learning on large datasets.

Hadoop at Yahoo! -- University Talksyhadoop

Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.

Hadoop: Distributed Data ProcessingCloudera, Inc.

Hadoop is a scalable distributed system for storing and processing large datasets across commodity hardware. It consists of HDFS for storage and MapReduce for distributed processing. A large ecosystem of additional tools like Hive, Pig, and HBase has also developed. Hadoop provides significantly lower costs for data storage and analysis compared to traditional systems and is well-suited to unstructured or structured big data. It has seen wide adoption at companies like Yahoo, Facebook, and eBay for applications like log analysis, personalization, and fraud detection.

Map reduce and hadoop at myliferesponseteam

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Hadoop Shamama Kamal

Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers. It was created in 2005 by Doug Cutting and Mike Carafella at Yahoo!, with Cutting naming it after his son's toy elephant. Hadoop features include reliable data storage with the Hadoop Distributed File System (HDFS), and its MapReduce programming model for large-scale data processing using a distributed algorithm on a computing cluster.

Hadoop basicsAntonio Silveira

Bw tech hadoopMindgrub Technologies

Hadoop is an open source distributed processing platform for large data sets across clusters of commodity hardware. It allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop features include a distributed file system (HDFS), a MapReduce programming model for large scale data processing, and an ecosystem of projects including HBase, Pig, Hive, and ZooKeeper. Hadoop is well suited for batch processing large amounts of structured and unstructured data, providing scalability and fault tolerance. However, it is not as suitable for low latency queries or updating existing data.

HadoopKartik Kalpande Patil

This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes key Hadoop components like HDFS for distributed file storage and MapReduce for distributed processing. Several companies that use Hadoop at large scale are mentioned, including Yahoo, Amazon and Facebook. Applications of Hadoop in healthcare for storing and analyzing large amounts of medical data are discussed. The document concludes that Hadoop is well-suited for big data applications due to its scalability, fault tolerance and cost effectiveness.

Migrating structured data between Hadoop and RDBMSBouquet

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Hadoop PrimerSteve Staso

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.

MapReduce basicChirag Ahuja

MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.

HADOOP TECHNOLOGY pptsravya raju

Intro to Apache HadoopSufi Nawaz

Apache Hadoop is an open-source software framework that supports distributed applications and processing of large data sets across clusters of commodity hardware. It is highly scalable, fault-tolerant and allows processing of data in parallel. Hadoop consists of Hadoop Common, HDFS for storage, YARN for resource management and MapReduce for distributed processing. HDFS stores large files across clusters and provides high throughput access to application data. MapReduce allows distributed processing of large datasets across clusters using a simple programming model.

Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...ggphotosmuskan

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

More Related Content

What's hot (19)

Seminar Presentation HadoopVarun Narang

Introduction to Apache HadoopSteve Watt

Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal

Asbury Hadoop OverviewBrian Enochson

Big data and HadoopRahul Agarwal

Dataiku big data paris - the rise of the hadoop ecosystemDataiku

Hadoop at Yahoo! -- University Talksyhadoop

Hadoop: Distributed Data ProcessingCloudera, Inc.

Map reduce and hadoop at myliferesponseteam

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Hadoop Shamama Kamal

Hadoop basicsAntonio Silveira

Bw tech hadoopMindgrub Technologies

HadoopKartik Kalpande Patil

Migrating structured data between Hadoop and RDBMSBouquet

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Hadoop PrimerSteve Staso

MapReduce basicChirag Ahuja

HADOOP TECHNOLOGY pptsravya raju

Seminar Presentation HadoopVarun Narang

Introduction to Apache HadoopSteve Watt

Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal

Asbury Hadoop OverviewBrian Enochson

Big data and HadoopRahul Agarwal

Dataiku big data paris - the rise of the hadoop ecosystemDataiku

Hadoop at Yahoo! -- University Talksyhadoop

Hadoop: Distributed Data ProcessingCloudera, Inc.

Map reduce and hadoop at myliferesponseteam

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Hadoop Shamama Kamal

Hadoop basicsAntonio Silveira

Bw tech hadoopMindgrub Technologies

HadoopKartik Kalpande Patil

Migrating structured data between Hadoop and RDBMSBouquet

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Hadoop PrimerSteve Staso

MapReduce basicChirag Ahuja

HADOOP TECHNOLOGY pptsravya raju

Similar to Geek camp (20)

Intro to Apache HadoopSufi Nawaz

Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...ggphotosmuskan

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

JOSA TechTalks - Big Data on HadoopJordan Open Source Association

Apache Hadoop is an open source framework that allows you to process large data sets (a.k.a Big Data) across clusters using simple programming models. This TechTalk will introduce you to real-life usages of Hadoop, so you can better understand when to use it, as well as describing its components and the first steps to setup a Hadoop cluster. By Dina Abu Khader - System Administrator YouTube video: http://www.youtube.com/watch?v=pSjP171i-gM

Unit 3 intro.pptxAkhilJoseph63

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.

Hadoop is a new paradigm for data processing that scales near linearly to petabytes of data. Commodity hardware running open source software provides unprecedented cost effectiveness. It is affordable to save large, raw datasets, unfiltered, in Hadoop's file system. Together with Hadoop's computational power, this facilitates operations such as ad hoc analysis and retroactive schema changes. An extensive open source tool-set is being built around these capabilities, making it easy to integrate Hadoop into many new application areas.

Drupal sharing in HP7jimyhuang

This document discusses trends in Drupal and the web for website planning. It notes that over 3000 people attended DrupalCon San Francisco and around 150 are expected at DrupalCamp Taipei 2010. It outlines trends towards semantic web, Facebook integration, and early trends like embedding flash video and Google Maps. Solutions for these trends in Drupal are discussed, including RDFa support and Facebook modules. The document also discusses trends toward cross-browser, cross-platform websites and open source solutions, noting Whitehouse.gov's use of Drupal. Workflows for planning Drupal sites are presented. Background on Drupal's popularity and features is provided. The document concludes by introducing the organization NETivism.com.tw and some

Hadoop and Big Data for Absolute BeginnersSam Dias

Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London

The document discusses Impala, a SQL query engine for Hadoop. It was created to enable low-latency queries on Hadoop data by using a new execution engine instead of MapReduce. Impala aims to provide high performance SQL queries on HDFS, HBase and other Hadoop data. It runs as a distributed service and queries are distributed to nodes and executed in parallel. The document covers Impala's architecture, query execution process, and its planner which partitions queries for efficient execution.

Getting started big dataKibrom Gebrehiwot

Scaling up wso2 bam for billions of requests and terabytes of dataWSO2

This document discusses how to scale the WSO2 BAM platform to handle billions of requests and terabytes of data. It describes scaling the major BAM components like the data receiver, data storage, analyzer engine, and dashboard. The data receiver uses Apache Thrift for efficient data transfer. Cassandra provides scalable data storage. The analyzer engine leverages Hadoop and Hive for distributed processing. Zookeeper coordinates tasks. These changes enable BAM deployments from single node to fully distributed high availability setups.

The Semantic Web and Drupal 7 - Loja 2013scorlosquet

Hadoop jonHumoyun Ahmedov

The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop addresses the growing volume, variety and velocity of big data through its core components: HDFS for storage, and MapReduce for distributed processing. Key features of Hadoop include scalability, flexibility, reliability and economic viability for large-scale data analytics.

Hadoop and Big DataHarshdeep Kaur

This presentation provides an overview of Hadoop, including: - A brief history of data and the rise of big data from various sources. - An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers. - Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture. - An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes. - Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.

Mr hadoop seedrocketSeedRocket

MapReduce is a programming model for processing large datasets in a distributed manner. It involves splitting the data into chunks which are processed in parallel by map tasks, and then combining the outputs of those maps via reduce tasks. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. It works by distributing data storage across nodes as a filesystem, and distributing computations as MapReduce jobs across clusters. Hadoop provides reliable storage and parallel processing of large datasets in a distributed environment.

201305 hadoop jpl-v3Eric Baldeschwieler

Eric Baldeschwieler, CTO of Hortonworks, presents on Apache Hadoop for big science. He discusses the history and motivation for Hadoop, including its origins at Yahoo in 2005. Baldeschwieler outlines several use cases for Hadoop in domains like genomics, oil and gas, and high-energy physics. He also explores futures for Hadoop, including innovations in YARN and the Stinger initiative to improve Hive for interactive queries.

Drupal as a Semantic Web platform - ISWC 2012scorlosquet

Introduction to Apache Sparkdatamantra

Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.

Hw09 Next Steps For HadoopCloudera, Inc.

The document discusses next steps for the Hadoop project, focusing on adopting the Avro data format and RPC protocol. Key points include: 1) Avro would provide a more expressive, efficient, and dynamic data format for Hadoop that allows browsing and working with arbitrary data without code generation. 2) Avro includes capabilities for data storage, encoding, versioning, and RPC that could help Hadoop provide cross-language access and compatibility between versions. 3) The author plans to incorporate Avro starting with using it for Hadoop job history and then providing full MapReduce support, with the possibility of using it for RPC in future Hadoop versions.

Apache pigSuresh Mandava

This document summarizes a meetup for the BigData Cloud Architects group. It provides details on the weekly meeting including topics, speakers, and structure. Suresh Mandava is introducing Apache Pig, a platform for analyzing large datasets. The meeting will cover what Pig is, how it works, performance advantages over MapReduce, and how to use Pig Latin constructs and write user-defined functions. The goal is to help people master the Hadoop ecosystem step-by-step.

Intro to Apache HadoopSufi Nawaz

Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...ggphotosmuskan

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

JOSA TechTalks - Big Data on HadoopJordan Open Source Association

Unit 3 intro.pptxAkhilJoseph63

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.

Drupal sharing in HP7jimyhuang

Hadoop and Big Data for Absolute BeginnersSam Dias

Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London

Getting started big dataKibrom Gebrehiwot

Scaling up wso2 bam for billions of requests and terabytes of dataWSO2

The Semantic Web and Drupal 7 - Loja 2013scorlosquet

Hadoop jonHumoyun Ahmedov

Hadoop and Big DataHarshdeep Kaur

Mr hadoop seedrocketSeedRocket

201305 hadoop jpl-v3Eric Baldeschwieler

Drupal as a Semantic Web platform - ISWC 2012scorlosquet

Introduction to Apache Sparkdatamantra

Hw09 Next Steps For HadoopCloudera, Inc.

Apache pigSuresh Mandava

Geek camp

1. Intro to Hadoop Jaideep Dhok

2. Hi! ● I work at ● Involved with Hadoop for 2+ years

3. Outline

4. Brief History of Hadoop ● 2005 - ● Inspired by the GFS and MapReduce papers published by Google. ● Promoted heavily by Yahoo! Since 2006 ● Today, the defacto standard in 'Big Data' computing

5. The Buzz

6. Why? ● 'Big Data' ● How big? - petabyte scale ● Scalable ● Robust ● Secure!

7. Scalability

8. When To Use It ● Can you use Hadoop to do X? ● Is your problem 'embarassingly' parallel? ● Workflow? – Dependent/Independent Tasks ● Data/CPU intensive? ● Can you use Hadoop to do X in the Clouds? ● Depends where your data is

9. Why To Use It? ● Ad hoc analysis ● Semi/structured data – Log files – Text – CSV, XML, anything really – RDBMS – NoSQL!

10. Use Cases ● Analytics ● User behavior ● Reporting ● Filtering ● Machine Learning ● Just storing your data

11. Just From The Logs ● Suppose you run a web-site ● User breakdown by browsers ● Location ● Understanding user session – How long do they use it? – Who are the active users? – What part of my app they use the most? – What part of my app is user X's fav?

12. Tools ● Native Hadoop APIs – Java ● Streaming – Perl, Python, Ruby, any language as long it has support for 'stdin' and 'stdout' ● Pig ● HIVE ● Pipes – C and C++

13. Ecosystem

14. Don't Wait ● Hadoop ● hadoop.apache.org ● Cloudera tutorials on Hadoop ● Books

15. Questions?

16. Thank You! [email protected]

Geek camp

Recommended

More Related Content

What's hot (19)

Similar to Geek camp (20)

Geek camp