A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
10 best practices and design principles to create effective dashboards using Tableau. View the webinar video recording to hear the narrated version of the good, the bad…and the downright ugly in dashboard design: http://www.senturus.com/resources/10-best-practices-for-tableau-dashboard-design/.
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
LinkedIn has a large professional network with 360M members. They build data-driven products using members' rich profile data. To do this, they ingest online data into offline systems using Apache Kafka. The data is then processed using Hadoop, Spark, Samza and Cubert to compute features and train models. Results are moved back online using Voldemort and Kafka. For example, People You May Know recommendations are generated by triangle closing in Hadoop and Cubert to count common connections faster. Site speed is monitored in real-time using Samza to join logs from different services.
The document discusses factors to consider when selecting a NoSQL database management system (DBMS). It provides an overview of different NoSQL database types, including document databases, key-value databases, column databases, and graph databases. For each type, popular open-source options are described, such as MongoDB for document databases, Redis for key-value, Cassandra for columnar, and Neo4j for graph databases. The document emphasizes choosing a NoSQL solution based on application needs and recommends commercial support for production systems.
The document discusses three papers related to data warehouse design.
Paper 1 presents the X-META methodology, which addresses developing a first data warehouse project and integrates metadata creation and management into the development process. It proposes starting with a pilot project and defines three iteration types.
Paper 2 proposes extending the ER conceptual data model to allow modeling of multi-dimensional aggregated entities. It includes entity types for basic dimensions, simple aggregations, and multi-dimensional aggregated entities.
Paper 3 presents a comprehensive UML-based method for designing all phases of a data warehouse, from source data to implementation. It defines four schemas - operational, conceptual, storage, and business - and the mappings between them. It also provides steps
Max De Marzi gave an introduction to graph databases using Neo4j as an example. He discussed trends in big, connected data and how NoSQL databases like key-value stores, column families, and document databases address these trends. However, graph databases are optimized for interconnected data by modeling it as nodes and relationships. Neo4j is a graph database that uses a property graph data model and allows querying and traversal through its Cypher query language and Gremlin scripting language. It is well-suited for domains involving highly connected data like social networks.
This document provides information about Tableau, a data visualization software. It discusses Tableau's prerequisites, products, and architecture. Tableau allows users to easily connect to various data sources and transform data into interactive visualizations and dashboards. Key Tableau concepts covered include data sources, worksheets, dashboards, stories, filters, marks, color and size properties. The document also explains Tableau's desktop and server products, and the stages of importing data, analyzing it, and sharing results.
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
The document provides information about Hadoop training. It discusses the need for Hadoop in today's data-heavy world. It then describes what Hadoop is, its ecosystem including HDFS for storage and MapReduce for processing. It also discusses YARN and provides a bank use case. It further explains the architecture and working of HDFS and MapReduce in processing large datasets in parallel across clusters.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
The document discusses Hive's new ACID (atomicity, consistency, isolation, durability) functionality which allows for updating and deleting rows in Hive tables. Key points include Hive now supporting SQL commands like INSERT, UPDATE and DELETE; storing changes in delta files and using transaction IDs; and running minor and major compactions to consolidate delta files. Future work may include multi-statement transactions, updating/deleting in streaming ingest, Parquet support, and adding MERGE statements.
Hive was initially developed by Facebook to manage large amounts of data stored in HDFS. It uses a SQL-like query language called HiveQL to analyze structured and semi-structured data. Hive compiles HiveQL queries into MapReduce jobs that are executed on a Hadoop cluster. It provides mechanisms for partitioning, bucketing, and sorting data to optimize query performance.
Apache Hive is a data warehouse software built on top of Hadoop that allows users to query data stored in various databases and file systems using an SQL-like interface. It provides a way to summarize, query, and analyze large datasets stored in Hadoop distributed file system (HDFS). Hive gives SQL capabilities to analyze data without needing MapReduce programming. Users can build a data warehouse by creating Hive tables, loading data files into HDFS, and then querying and analyzing the data using HiveQL, which Hive then converts into MapReduce jobs.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This document provides an overview of Tableau, a data visualization software. It outlines the agenda for the presentation, which will cover connecting to data, visual analytics with Tableau, dashboards and stories, calculations, and mapping capabilities. Tableau allows users to connect to various data sources, transform raw data into interactive visualizations, and share dashboards or publish them online. It is a leading tool for data analysis and visualization.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
- DynamoDB is a fully managed NoSQL database service by Amazon that provides fast and predictable performance with seamless scalability.
- It uses an eventually consistent, distributed architecture to store data across multiple servers and provides automatic scaling of read and write throughput capacity.
- DynamoDB uses vector clocks to track multiple versions of data that may exist due to asynchronous replication and eventual consistency, applying both syntactic and semantic reconciliation of data conflicts.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.
The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.
Architecting Agile Data Applications for ScaleDatabricks
Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
This document discusses benchmarking Hive at Yahoo scale. Some key points:
- Hive is the fastest growing product on Yahoo's Hadoop clusters which process 750k jobs per day across 32500 nodes.
- Benchmarking was done using TPC-H queries on 100GB, 1TB, and 10TB datasets stored in ORC format.
- Significant performance improvements were seen over earlier Hive versions, with 18x speedup over Hive 0.10 on text files for the 100GB dataset.
- Average query time was reduced from 530 seconds to 28 seconds for the 100GB dataset, and from 729 seconds to 172 seconds for the 1TB dataset.
This document provides an overview of database basics and concepts for business analysts. It covers topics such as the need for databases, different types of database management systems (DBMS), data storage in tables, common database terminology, database normalization, SQL queries including joins and aggregations, and database design concepts.
This document provides an overview and instructions for using Tableau software for data visualization and analysis. It describes Tableau as a tool for simplifying data into understandable formats via dashboards and worksheets. Steps are outlined for connecting a CSV file on demographic data to Tableau, creating a map visualization showing populations by state in India, and differences between live and extract connections. Basic concepts like dimensions, measures, and different methods for creating visualizations through drag and drop or double clicking are also summarized.
An Intro to NoSQL Databases -- NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice. (source : http://martinfowler.com/)
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
This document provides an overview and comparison of different data transformation frameworks including Apache Pig, Apache Hive, and Apache Spark. It discusses features such as file formats, source to target mappings, data quality checks, and core processing functionality. The document contains code examples demonstrating how to perform common ETL tasks in each framework using delimited, XML, JSON, and other file formats. It also covers topics like numeric validation, data mapping, and performance. The overall purpose is to help users understand the different options for large-scale data processing in Hadoop.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
The document discusses Hive's new ACID (atomicity, consistency, isolation, durability) functionality which allows for updating and deleting rows in Hive tables. Key points include Hive now supporting SQL commands like INSERT, UPDATE and DELETE; storing changes in delta files and using transaction IDs; and running minor and major compactions to consolidate delta files. Future work may include multi-statement transactions, updating/deleting in streaming ingest, Parquet support, and adding MERGE statements.
Hive was initially developed by Facebook to manage large amounts of data stored in HDFS. It uses a SQL-like query language called HiveQL to analyze structured and semi-structured data. Hive compiles HiveQL queries into MapReduce jobs that are executed on a Hadoop cluster. It provides mechanisms for partitioning, bucketing, and sorting data to optimize query performance.
Apache Hive is a data warehouse software built on top of Hadoop that allows users to query data stored in various databases and file systems using an SQL-like interface. It provides a way to summarize, query, and analyze large datasets stored in Hadoop distributed file system (HDFS). Hive gives SQL capabilities to analyze data without needing MapReduce programming. Users can build a data warehouse by creating Hive tables, loading data files into HDFS, and then querying and analyzing the data using HiveQL, which Hive then converts into MapReduce jobs.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This document provides an overview of Tableau, a data visualization software. It outlines the agenda for the presentation, which will cover connecting to data, visual analytics with Tableau, dashboards and stories, calculations, and mapping capabilities. Tableau allows users to connect to various data sources, transform raw data into interactive visualizations, and share dashboards or publish them online. It is a leading tool for data analysis and visualization.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
- DynamoDB is a fully managed NoSQL database service by Amazon that provides fast and predictable performance with seamless scalability.
- It uses an eventually consistent, distributed architecture to store data across multiple servers and provides automatic scaling of read and write throughput capacity.
- DynamoDB uses vector clocks to track multiple versions of data that may exist due to asynchronous replication and eventual consistency, applying both syntactic and semantic reconciliation of data conflicts.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.
The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.
Architecting Agile Data Applications for ScaleDatabricks
Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
This document discusses benchmarking Hive at Yahoo scale. Some key points:
- Hive is the fastest growing product on Yahoo's Hadoop clusters which process 750k jobs per day across 32500 nodes.
- Benchmarking was done using TPC-H queries on 100GB, 1TB, and 10TB datasets stored in ORC format.
- Significant performance improvements were seen over earlier Hive versions, with 18x speedup over Hive 0.10 on text files for the 100GB dataset.
- Average query time was reduced from 530 seconds to 28 seconds for the 100GB dataset, and from 729 seconds to 172 seconds for the 1TB dataset.
This document provides an overview of database basics and concepts for business analysts. It covers topics such as the need for databases, different types of database management systems (DBMS), data storage in tables, common database terminology, database normalization, SQL queries including joins and aggregations, and database design concepts.
This document provides an overview and instructions for using Tableau software for data visualization and analysis. It describes Tableau as a tool for simplifying data into understandable formats via dashboards and worksheets. Steps are outlined for connecting a CSV file on demographic data to Tableau, creating a map visualization showing populations by state in India, and differences between live and extract connections. Basic concepts like dimensions, measures, and different methods for creating visualizations through drag and drop or double clicking are also summarized.
An Intro to NoSQL Databases -- NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice. (source : http://martinfowler.com/)
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
This document provides an overview and comparison of different data transformation frameworks including Apache Pig, Apache Hive, and Apache Spark. It discusses features such as file formats, source to target mappings, data quality checks, and core processing functionality. The document contains code examples demonstrating how to perform common ETL tasks in each framework using delimited, XML, JSON, and other file formats. It also covers topics like numeric validation, data mapping, and performance. The overall purpose is to help users understand the different options for large-scale data processing in Hadoop.
This document discusses Robert Vandehey's experience moving from C#/.NET to Hadoop/MongoDB at Rovi Corporation. The three main points are:
1) Rovi was experiencing slow ETL and data loading times from their SQL databases into their applications. They implemented a solution using Hadoop to extract, transform, and load the data, and MongoDB to serve the data to applications.
2) Some challenges in the transition included moving the .NET development team to Linux/Java, backwards compatibility for web services, and writes to MongoDB overwhelming disks.
3) Lessons learned include using Hadoop and Pig for as much processing as possible, properly sizing and configuring MongoDB for writes from H
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
This document provides an agenda for a session on Apache Spark Streaming and Kafka integration. It includes an introduction to Spark Streaming, working with DStreams and RDDs, an example of word count streaming, and steps for integrating Spark Streaming with Kafka including creating topics and producers. The session will also include a hands-on demo of streaming word count from Kafka using CloudxLab.
Interest is growing in the Apache Spark community in using Deep Learning techniques and in the Deep Learning community in scaling algorithms with Apache Spark. A few of them to note include:
· Databrick’s efforts in scaling Deep learning with Spark
· Intel announcing the BigDL: A Deep learning library for Spark
· Yahoo’s recent efforts to opensource TensorFlowOnSpark
In this lecture we will discuss the key use cases and developments that have emerged in the last year in using Deep Learning techniques with Spark.
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
This document provides an overview of techniques to boost Spark performance, including:
1) Phase 1 focused on memory management, code generation, and cache-aware algorithms which provided 5-30x speedups
2) Phase 2 focused on whole-stage code generation and columnar in-memory support which are now enabled by default in Spark 2.0+
3) Additional techniques discussed include choosing an optimal garbage collector, using multiple small executors, exploiting data locality, disabling hardware prefetchers, and keeping hyper-threading on.
Interest in Deep Learning has been growing in the past few years. With advances in software and hardware technologies, Neural Networks are making a resurgence. With interest in AI based applications growing, and companies like IBM, Google, Microsoft, NVidia investing heavily in computing and software applications, it is time to understand Deep Learning better!
In this lecture, we will discuss the basics of Neural Networks and discuss how Deep Learning Neural networks are different from conventional Neural Network architectures. We will review a bit of mathematics that goes into building neural networks and understand the role of GPUs in Deep Learning. We will also get an introduction to Autoencoders, Convolutional Neural Networks, Recurrent Neural Networks and understand the state-of-the-art in hardware and software architectures. Functional Demos will be presented in Keras, a popular Python package with a backend in Theano. This will be a preview of the QuantUniversity Deep Learning Workshop that will be offered in 2017.
How to Use Apache Zeppelin with HWX HDBHortonworks
Part five in a five-part series, this webcast will be a demonstration of the integration of Apache Zeppelin and Pivotal HDB. Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. This webinar will demonstrate the configuration of the psql interpreter and the basic operations of Apache Zeppelin when used in conjunction with Hortonworks HDB.
This document provides an agenda for a presentation on using machine learning with Apache Spark. The presentation introduces Apache Spark and its architecture, Scala notebooks in Spark, machine learning components in Spark, pipelines for machine learning tasks, and examples of regression, classification, clustering, collaborative filtering, and model tuning in Spark. It aims to provide a hands-on, example-driven introduction to applying machine learning using Apache Spark without deep dives into Spark architecture or algorithms.
- The document discusses using Ansible to deploy Hortonworks Data Platform (HDP) clusters.
- It demonstrates how to use Ansible playbooks to provision AWS infrastructure and install HDP on a 6-node cluster in about 20 minutes with just a few configuration file modifications and running two scripts.
- The deployment time can be optimized by adjusting the number and size of nodes, with larger instance types and more master nodes decreasing installation time.
The path to a Modern Data Architecture in Financial ServicesHortonworks
Delivering Data-Driven Applications at the Speed of Business: Global Banking AML use case.
Chief Data Officers in financial services have unique challenges: they need to establish an effective data ecosystem under strict governance and regulatory requirements. They need to build the data-driven applications that enable risk and compliance initiatives to run efficiently. In this webinar, we will discuss the case of a global banking leader and the anti-money laundering solution they built on the data lake. With a single platform to aggregate structured and unstructured information essential to determine and document AML case disposition, they reduced mean time for case resolution by 75%. They have a roadmap for building over 150 data-driven applications on the same search-based data discovery platform so they can mitigate risks and seize opportunities, at the speed of business.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
H2O Distributed Deep Learning by Arno Candel 071614Sri Ambati
Deep Learning R Vignette Documentation: https://github.com/0xdata/h2o/tree/master/docs/deeplearning/
Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice in traditional business analytics.
This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of enterprise-scale problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization and optimization for class imbalance. World record performance on the classic MNIST dataset, best-in-class accuracy for eBay text classification and others showcase the power of this game changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core.
About the Speaker: Arno Candel
Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world's largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes.
He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
NVIDIA compute GPUs and software toolkits are key drivers behind major advancements in machine learning. Of particular interest is a technique called "deep learning", which utilizes what are known as Convolution Neural Networks (CNNs) having landslide success in computer vision and widespread adoption in a variety of fields such as autonomous vehicles, cyber security, and healthcare. In this talk is presented a high level introduction to deep learning where we discuss core concepts, success stories, and relevant use cases. Additionally, we will provide an overview of essential frameworks and workflows for deep learning. Finally, we explore emerging domains for GPU computing such as large-scale graph analytics, in-memory databases.
https://tech.rakuten.co.jp/
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
This document discusses big data analytics and SQL on Hadoop. It begins with an introduction to big data analysis and different approaches for analyzing big data, including traditional analytic databases, MapReduce, and SQL on Hadoop. The document then focuses on SQL on Hadoop, explaining that it refers to a new generation of analytic databases built on Hadoop. It provides examples of SQL on Hadoop systems like Hive, Impala, and Presto. The rest of the document discusses Google's Dremel and Tenzing systems for interactively analyzing large datasets using SQL.
The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers. It reliably stores and processes gobs of information across many commodity computers. Key components of Hadoop include the HDFS distributed file system for high-bandwidth storage, and MapReduce for parallel data processing. Hadoop can deliver data and run large-scale jobs reliably in spite of system changes or failures by detecting and compensating for hardware problems in the cluster.
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
The presentation is designed for those interested in Hadoop technology, and can enhance your knowledge in Hadoop, such as community history, current development status, features of services, distributed computing framework and scenario of big data development in Enterprise.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
This document discusses the Stinger initiative to improve the performance of Apache Hive. Stinger aims to speed up Hive queries by 100x, scale queries from terabytes to petabytes of data, and expand SQL support. Key developments include optimizing Hive to run on Apache Tez, the vectorized query execution engine, cost-based optimization using Optiq, and performance improvements from the ORC file format. The goals of Stinger Phase 3 are to deliver interactive query performance for Hive by integrating these technologies.
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
This document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across commodity hardware. It discusses Hadoop's history and goals, describes its core architectural components including HDFS, MapReduce and their roles, and gives examples of how Hadoop is used at large companies to handle big data.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
The document provides an overview and quick reference guide to big data concepts including Hadoop, MapReduce, HDFS, YARN, Spark, Storm, Hive, Pig, HBase and NoSQL databases. It discusses the evolution of Hadoop from versions 1 to 2, and new frameworks like Tez and YARN that allow different types of processing beyond MapReduce. The document also summarizes common big data challenges around skills, integration and analytics.
* The file size is 1664MB
* HDFS block size is usually 128MB by default in Hadoop 2.0
* To calculate number of blocks required: File size / Block size
* 1664MB / 128MB = 13 blocks
* 8 blocks have been uploaded successfully
* So remaining blocks = Total blocks - Uploaded blocks = 13 - 8 = 5
If another client tries to access/read the data while the upload is still in progress, it will only be able to access the data from the 8 blocks that have been uploaded so far. The remaining 5 blocks of data will not be available or visible to other clients until the full upload is completed. HDFS follows write-once semantics, so partial
The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop addresses the growing volume, variety and velocity of big data through its core components: HDFS for storage, and MapReduce for distributed processing. Key features of Hadoop include scalability, flexibility, reliability and economic viability for large-scale data analytics.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
The document discusses how Hadoop can be used for interactive and real-time data analysis. It notes that the amount of digital data is growing exponentially and will reach 40 zettabytes by 2020. Traditional data systems are struggling to manage this new data. Hadoop provides a solution by tying together inexpensive servers to act as one large computer for processing big data using various Apache projects for data access, governance, security and operations. Examples show how Hadoop can be used to analyze real-time streaming data from sensors on trucks to monitor routes, vehicles and drivers.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
This webinar series covers Apache Kafka and Apache Storm for streaming data processing. Also, it discusses new streaming innovations for Kafka and Storm included in HDP 2.2
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
1. Hadoop Demystified
What is it? How does Microsoft fit in?
and… of course… some demos!
Presentation for ATL .NET User Group
(July, 2014)
Lester Martin
Page 1
2. Agenda
• Hadoop 101
–Fundamentally, What is Hadoop?
–How is it Different?
–History of Hadoop
• Components of the Hadoop Ecosystem
• MapReduce, Pig, and Hive Demos
–Word Count
–Open Georgia Dataset Analysis
Page 2
3. Connection before Content
• Lester Martin
• Hortonworks – Professional Services
• [email protected]
• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 3
5. The Need for Hadoop
• Store and use all types of data
• Process ALL the data; not just a sample
• Scalability to 1000s of nodes
• Commodity hardware
Page 5
6. Relational Database vs. Hadoop
Relational Hadoop
Required on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
best fit use Data Discovery
Processing unstructured data
Massive storage/processing
P
7. Fundamentally, a Simple Algorithm
1. Review stack of quarters
2. Count each year that ends
in an even number
Page 7
9. Distributed Algorithm – Map:Reduce
Page 9
Map
(total number of quarters)
Reduce
(sum each person’s total)
10. A Brief History of Apache Hadoop
Page 10
2013
Focus on INNOVATION
2005: Hadoop created
at Yahoo!
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
12. HDP: Enterprise Hadoop Platform
Page 12
Hortonworks
Data Platform (HDP)
• The ONLY 100% open source
and complete platform
• Integrates full range of
enterprise-ready services
• Certified and tested at scale
• Engineered for deep
ecosystem interoperability
OS/VM Cloud Appliance
PLATFORM
SERVICES
HADOOP
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
15. Hive
• Data warehousing package built on top of Hadoop
• Bringing structure to unstructured data
• Query petabytes of data with HiveQL
• Schema on read
1
•
•
–
–
16. Hive: SQL-Like Interface to Hadoop
• Provides basic SQL functionality using MapReduce to
execute queries
• Supports standard SQL clauses
INSERT INTO
SELECT
FROM … JOIN … ON
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
• Supports basic DDL
CREATE/ALTER/DROP TABLE, DATABASE
Page 17
17. Hortonworks Investment
in Apache Hive
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Page 18
Stinger Phase 3
• Hive on Apache Tez
• Query Service (always on)
• Buffer Cache
• Cost Based Optimizer (Optiq)
Stinger Phase 1:
• Base Optimizations
• SQL Types
• SQL Analytic Functions
• ORCFile Modern File Format
Stinger Phase 2:
• SQL Types
• SQL Analytic Functions
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
Goals:
…70% complete
in 6 months…all IN Hadoop
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
18. Stinger: Enhancing SQL Semantics
Page 19
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT Sub-queries in FROM clause
DOUBLE GROUP BY, ORDER BY
STRING CLUSTER BY, DISTRIBUTE BY
TIMESTAMP ROLLUP and CUBE
BINARY UNION
DECIMAL LEFT, RIGHT and FULL INNER/OUTER JOIN
ARRAY, MAP, STRUCT, UNION CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR INTERSECT, EXCEPT, UNION DISTINCT
DATE Sub-queries in HAVING
Sub-queries in WHERE (IN/NOT IN,
EXISTS/NOT EXISTS
Hive 0.10
Hive 12
Hive 0.11
Compete Subset
Hive 13
19. Pig
• Pig was created at Yahoo! to analyze data in HDFS without writing
Map/Reduce code.
• Two components:
– SQL like processing language called “Pig Latin”
– PIG execution engine producing Map/Reduce code
• Popular uses:
– ETL at scale (offloading)
– Text parsing and processing to Hive or HBase
– Aggregating data from multiple sources
•
•
•
20. Pig
Sample Code to find dropped call data:
4G_Data = LOAD ‘/archive/FDR_4G.txt’ using TextLoader();
Customer_Master = LOAD ‘masterdb.customer_data’ using
HCatLoader();
4G_Data_Full = JOIN 4G_Data by customerID, CustomerMaster by
customerID;
X = FILTER 4G_Data_Full BY State == ‘call_dropped’;
•
•
•
22. Powering the Modern Data Architecture
HADOOP 2.0
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Page 23
Interact with all data in
multiple ways simultaneously
Redundant, Reliable Storage
HDFS 2
Cluster Resource Management
YARN
Standard SQL
Processing
Hive
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
…
HADOOP 1.0
HDFS 1
(redundant, reliable storage)
MapReduce
(distributed data processing
& cluster resource management)
Single Use System
Batch Apps
Data Processing
Frameworks
(Hive, Pig, Cascading, …)
23. Word Counting Time!!
Hadoop’s “Hello Whirled” Example
A quick refresher of core elements of
Hadoop and then code walk-thrus with
Java MapReduce and Pig
Page 25
24. Core Hadoop Concepts
• Applications are written in high-level code
–Developers need not worry about network programming, temporal
dependencies or low-level infrastructure
• Nodes talk to each other as little as possible
–Developers should not write code which communicates between
nodes
–“Shared nothing” architecture
• Data is spread among machines in advance
–Computation happens where the data is stored, wherever possible
– Data is replicated multiple times on the system for increased
availability and reliability
Page 26
25. Hadoop: Very High-Level Overview
• When data is loaded in the system, it is split into
“blocks”
–Typically 64MB or 128MB
• Map tasks (first part of MapReduce) work on relatively
small portions of data
–Typically a single block
• A master program allocates work to nodes such that a
Map tasks will work on a block of data stored locally
on that node whenever possible
–Many nodes work in parallel, each on their own part of the overall
dataset
Page 27
26. Fault Tolerance
• If a node fails, the master will detect that failure and
re-assign the work to a different node on the system
• Restarting a task does not require communication
with nodes working on other portions of the data
• If a failed node restarts, it is automatically added back
to the system and assigned new tasks
• If a nodes appears to be running slowly, the master
can redundantly execute another instance of the same
task
–Results from the first to finish will be used
–Known as “speculative execution”
Page 28
27. Hadoop Components
• Hadoop consists of two core components
–The Hadoop Distributed File System (HDFS)
–MapReduce
• Many other projects based around core Hadoop (the
“Ecosystem”)
–Pig, Hive, Hbase, Flume, Oozie, Sqoop, Datameer, etc
• A set of machines running HDFS and MapReduce is
known as a Hadoop Cluster
–Individual machines are known as nodes
–A cluster can have as few as one node, as many as several
thousand
– More nodes = better performance!
Page 29
28. Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
• Data is split into blocks and distributed across
multiple nodes in the cluster
–Each block is typically 64MB (the default) or 128MB in size
• Each block is replicated multiple times
–Default is to replicate each block three times
–Replicas are stored on different nodes
– This ensures both reliability and availability
Page 30
30. HDFS *is* a File System
• Screenshot for “Name Node UI”
Page 32
31. Accessing HDFS
• Applications can read and write HDFS files directly via
a Java API
• Typically, files are created on a local filesystem and
must be moved into HDFS
• Likewise, files stored in HDFS may need to be moved
to a machine’s local filesystem
• Access to HDFS from the command line is achieved
with the hdfs dfs command
–Provides various shell-like commands as you find on Linux
–Replaces the hadoop fs command
• Graphical tools available like the Sandbox’s Hue File
Browser and Red Gate’s HDFS Explorer
Page 33
32. hdfs dfs Examples
• Copy file foo.txt from local disk to the user’s directory
in HDFS
–This will copy the file to /user/username/fooHDFS.txt
• Get a directory listing of the user’s home directory in
HDFS
• Get a directory listing of the HDFS root directory
Page 34
hdfs dfs –put fooLocal.txt fooHDFS.txt
hdfs dfs –ls
hdfs dfs –ls /
33. hdfs dfs Examples (continued)
• Display the contents of a specific HDFS file
• Move that file back to the local disk
• Create a directory called input under the user’s home
directory
• Delete the HDFS directory input and all its contents
Page 35
hdfs dfs –cat /user/fred/fooHDFS.txt
hdfs dfs –mkdir input
hdfs dfs –rm –r input
hdfs dfs –get /user/fred/fooHDFS.txt barLocal.txt
34. Hadoop Components: MapReduce
• MapReduce is the system used to process data in the
Hadoop cluster
• Consists of two phases: Map, and then Reduce
–Between the two is a stage known as the shuffle and sort
• Each Map task operates on a discrete portion of the
overall dataset
–Typically one HDFS block of data
• After all Maps are complete, the MapReduce system
distributes the intermediate data to nodes which
perform the Reduce phase
–Source code examples and live demo coming!
Page 36
35. Features of MapReduce
• Hadoop attempts to run tasks on nodes which hold
their portion of the data locally, to avoid network
traffic
• Automatic parallelization, distribution, and fault-
tolerance
• Status and monitoring tools
• A clean abstraction for programmers
–MapReduce programs are usually written in Java
– Can be written in any language using Hadoop Streaming
– All of Hadoop is written in Java
–With “housekeeping” taken care of by the framework, developers
can concentrate simply on writing Map and Reduce functions
Page 37
38. MapReduce: The Mapper
• The Mapper reads data in the form of key/value pairs
(KVPs)
• It outputs zero or more KVPs
• The Mapper may use or completely ignore the input
key
–For example, a standard pattern is to read a line of a file at a time
– The key is the byte offset into the file at which the line starts
– The value is the contents of the line itself
– Typically the key is considered irrelevant with this pattern
• If the Mapper writes anything out, it must in the form
of KVPs
–This “intermediate data” is NOT stored in HDFS (local storage only
without replication)
Page 40
39. MapReducer: The Reducer
• After the Map phase is over, all the intermediate
values for a given intermediate key are combined
together into a list
• This list is given to a Reducer
–There may be a single Reducer, or multiple Reducers
–All values associated with a particular intermediate key are
guaranteed to go to the same Reducer
–The intermediate keys, and their value lists, are passed in sorted
order
• The Reducer outputs zero or more KVPs
–These are written to HDFS
–In practice, the Reducer often emits a single KVP for each input
key
Page 41
40. MapReduce Example: Word Count
• Count the number of occurrences of each word in a
large amount of input data
Page 42
map(String input_key, String input_value)
foreach word in input_value:
emit(w,1)
reduce(String output_key, Iter<int> intermediate_vals)
set count = 0
foreach val in intermediate_vals:
count += val
emit(output_key, count)
41. MapReduce Example: Map Phase
Page 43
• Input to the Mapper
• Ignoring the key
– It is just an offset
• Output from the Mapper
• No attempt is made to optimize
within a record in this example
– This is a great use case for a
“Combiner”
(8675, ‘I will not eat
green eggs and ham’)
(8709, ‘I will not eat
them Sam I am’)
(‘I’, 1), (‘will’, 1),
(‘not’, 1), (‘eat’, 1),
(‘green’, 1), (‘eggs’, 1),
(‘and’, 1), (‘ham’, 1),
(‘I’, 1), (‘will’, 1),
(‘not’, 1), (‘eat’, 1),
(‘them’, 1), (‘Sam’, 1),
(‘I’, 1), (‘am’, 1)
42. MapReduce Example: Reduce Phase
Page 44
• Input to the Reducer
• Notice keys are sorted and
associated values for same key
are in a single list
– Shuffle & Sort did this for us
• Output from the Reducer
• All done!
(‘I’, [1, 1, 1])
(‘Sam’, [1])
(‘am’, [1])
(‘and’, [1])
(‘eat’, [1, 1])
(‘eggs’, [1])
(‘green’, [1])
(‘ham’, [1])
(‘not’, [1, 1])
(‘them’, [1])
(‘will’, [1, 1])
(‘I’, 3)
(‘Sam’, 1)
(‘am’, 1)
(‘and’, 1)
(‘eat’, 2)
(‘eggs’, 1)
(‘green’, 1)
(‘ham’, 1)
(‘not’, 2)
(‘them’, 1)
(‘will’, 2)
43. Code Walkthru & Demo Time!!
• Word Count Example
–Java MapReduce
–Pig
Page 45
45. Dataset: Open Georgia
• Salaries & Travel Reimbursements
–Organization
– Local Boards of Education
– Several Atlanta-area districts; multiple years
– State Agencies, Boards, Authorities and Commissions
– Dept of Public Safety; 2010
Page 47
46. Format & Sample Data
Page 48
NAME (String) TITLE (String)
SALARY
(float)
ORG TYPE
(String)
ORG (String) YEAR (int)
ABBOTT,DEEDEE W
GRADES 9-12
TEACHER
52,122.10 LBOE
ATLANTA INDEPENDENT
SCHOOL SYSTEM
2010
ALLEN,ANNETTE D
SPEECH-LANGUAGE
PATHOLOGIST
92,937.28 LBOE
ATLANTA INDEPENDENT
SCHOOL SYSTEM
2010
BAHR,SHERREEN T GRADE 5 TEACHER 52,752.71 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
BAILEY,ANTOINETT
E R
SCHOOL
SECRETARY/CLERK
19,905.90 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
BAILEY,ASHLEY N
EARLY INTERVENTION
PRIMARY TEACHER
43,992.82 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
CALVERT,RONALD
MARTIN
STATE PATROL (SP) 51,370.40 SABAC
PUBLIC SAFETY, DEPARTMENT
OF
2010
CAMERON,MICHAE
L D
PUBLIC SAFETY TRN
(AL)
34,748.60 SABAC
PUBLIC SAFETY, DEPARTMENT
OF
2010
DAAS,TARWYN
TARA
GRADES 9-12
TEACHER
41,614.50 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2011
DABBS,SANDRA L
GRADES 9-12
TEACHER
79,801.59 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2011
E'LOM,SOPHIA L
IS PERSONNEL -
GENERAL ADMIN
75,509.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
EADDY,FENNER R SUBSTITUTE 13,469.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
EADY,ARNETTA A ASSISTANT PRINCIPAL 71,879.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
47. Simple Use Case
• For all loaded State of Georgia salary information
–Produce statistics for each specific job title
– Number of employees
– Salary breakdown
– Minimum
– Maximum
– Average
–Limit the data to investigate
– Fiscal year 2010
– School district employees
Page 49
48. Code Walkthru & Demo; Part Deux!
• Word Count Example
–Java MapReduce
–Pig
–Hive
Page 50
49. Demo Wrap-Up
• All code, test data, wiki pages, and blog posting can
be found, or linked to, from
–https://github.com/lestermartin/hadoop-exploration
• This deck can be found on SlideShare
–http://www.slideshare.net/lestermartin
• Questions?
Page 51
50. Thank You!!
• Lester Martin
• Hortonworks – Professional Services
• [email protected]
• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 52
Editor's Notes
#6: Hadoop fills several important needs in your data storage and processing infrastructure
Store and use all types of data: Allows semi-structured, unstructured and structured data to be processed in a way to create new insights of significant business value.
Process all the data: Instead of looking at samples of data or small sections of data, organizations can look at large volumes of data to get new perspective and make business decisions with higher degree of accuracy.
Scalability: Reducing latency in business is critical for success. The massive scalability of Big Data systems allow organizations to process massive amounts of data in a fraction of the time required for traditional systems.
Commodity hardware: Self-healing, extremely scalable, highly available environment with cost-effective commodity hardware.
#7: KEY CALLOUT: Schema on Read
IMPORTANT NOTE: Hadoop is not meant to replace your relational database. Hadoop is for storing Big Data, which is often the type of data that you would otherwise not store in a database due to size or cost constraints You will still have your database for relational, transactional data.
#11: I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.
What we now know of as Hadoop really started back in 2005, when the team at yahoo! – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.
By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.
In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.
[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
#18: SQL is a query language
Declarative, what not how
Oriented around answering a question
Requires uniform schema
Requires metadata
Known by everyone
A great choice for answering queries, building reports, use with automated tools
#20: With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance.
That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
#35: “hdfs dfs” is the *new* “hadoop fs”
Blank acts like ~
#36: These two slides were just to make folks feel at home with CLI access to HDFS
#48: See https://martin.atlassian.net/wiki/x/FwAvAQ for more details
Surely not the typical Volume/Velocity/Variety definition of “Big Data”, but gives us a controlled environment to do some simple prototyping and validating with
#49: See https://martin.atlassian.net/wiki/x/NYBmAQ for more details
#51: See https://martin.atlassian.net/wiki/x/FwAvAQ for more information