Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
This document provides an introduction to big data, including its key characteristics of volume, velocity, and variety. It describes different types of big data technologies like Hadoop, MapReduce, HDFS, Hive, and Pig. Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. MapReduce is a programming model used for processing large datasets in a distributed computing environment. HDFS provides a distributed file system for storing large datasets across clusters. Hive and Pig provide data querying and analysis capabilities for data stored in Hadoop clusters using SQL-like and scripting languages respectively.
This document discusses the concept of big data. It defines big data as massive volumes of structured and unstructured data that are difficult to process using traditional database techniques due to their size and complexity. It notes that big data has the characteristics of volume, variety, and velocity. The document also discusses Hadoop as an implementation of big data and how various industries are generating large amounts of data.
This document provides an overview of big data. It begins by defining big data and noting that it first emerged in the early 2000s among online companies like Google and Facebook. It then discusses the three key characteristics of big data: volume, velocity, and variety. The document outlines the large quantities of data generated daily by companies and sensors. It also discusses how big data is stored and processed using tools like Hadoop and MapReduce. Examples are given of how big data analytics can be applied across different industries. Finally, the document briefly discusses some risks and benefits of big data, as well as its impact on IT jobs.
- Big data refers to large volumes of data from various sources that is analyzed to reveal patterns, trends, and associations.
- The evolution of big data has seen it grow from just volume, velocity, and variety to also include veracity, variability, visualization, and value.
- Analyzing big data can provide hidden insights and competitive advantages for businesses by finding trends and patterns in large amounts of structured and unstructured data from multiple sources.
The new age big data technologies include predictive analytics, no SQL databases, search and knowledge discovery, stream analytics, in-memory data fabric, data virtualization and more.
This document provides case studies on how several companies leverage big data, including Google, GE, Cornerstone, and Microsoft. The Google case study describes how Google processes billions of search queries daily and uses this data to continuously improve its search algorithms. The GE case study outlines how GE collects vast amounts of sensor data from power turbines, jet engines, and other industrial equipment to optimize operations and efficiency. The Cornerstone case study examines how Cornerstone uses employee data to help clients predict retention and performance. Finally, the Microsoft case study discusses how Microsoft has positioned itself as a major player in big data and offers data hosting and analytics services.
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
1) Big data is being generated from many sources like web data, e-commerce purchases, banking transactions, social networks, science experiments, and more. The volume of data is huge and growing exponentially.
2) Big data is characterized by its volume, velocity, variety, and value. It requires new technologies and techniques for capture, storage, analysis, and visualization.
3) Analyzing big data can provide valuable insights but also poses challenges related to cost, integration of diverse data types, and shortage of data science experts. New platforms and tools are being developed to make big data more accessible and useful.
This document provides an overview of Hadoop, an open-source framework for storing and processing large datasets across clusters of commodity servers. It discusses how Hadoop addresses the challenges of big data by moving computation to the data through its MapReduce programming model and storing data across clusters using its Hadoop Distributed File System (HDFS). Key components of Hadoop include the NameNode for metadata, DataNodes for data storage, and a JobTracker that coordinates tasks on TaskTrackers. Hadoop allows for scalable, fault-tolerant distributed computing of big data.
Big Data Information Architecture PowerPoint Presentation SlideSlideTeam
The document appears to be a presentation on big data. It includes slides on what big data is, facts about the size and growth of big data, sources of big data, the 3Vs and 5Vs models of big data, differences between small and big data, objectives and technologies of big data, the big data workflow, forms of big data, the data analytics process, impacts of big data, benefits of big data, the future of big data, and opportunities and challenges of big data. Additional slides provide templates that are editable for topics such as mission, team, about us, goals, comparisons, financials, quotes, dashboards, and locations.
This document discusses the rise of big data and how the volume of data being created is growing exponentially, with 2.5 quintillion bytes created daily from various sources like sensors, social media, images, videos and purchases. It outlines how traditional databases and data analytics are struggling to handle this unstructured data, leading to the emergence of new solutions like Hadoop. It also explores how new roles like data scientists are emerging to help organizations extract value from all this big data through advanced analytics.
Vikas Samant is a big data and data science engineer who works with Entrench Electronics and Pentaho. He provides an overview of big data, defining it as large volumes of structured, semi-structured, and unstructured data that businesses must process daily. He describes the key characteristics of big data using the 3Vs - volume, variety, and velocity, and sometimes a fourth V of veracity. The document then discusses data structures, data science, the data science process, and provides examples of big data use cases like optimizing funnel conversion, behavioral analytics, customer segmentation, and fraud detection. It concludes with an overview of big data technologies, vendors, what Hadoop is, and why Hadoop is widely adopted.
Big data refers to terabytes or larger datasets that are generated daily and stored across multiple machines in different formats. Analyzing this data is challenging due to its size, format diversity, and distributed storage. Moving the data or code during analysis can overload networks. MapReduce addresses this by bringing the code to the data instead of moving the data, significantly reducing network traffic. It uses HDFS for scalable and fault-tolerant storage across clusters.
Big data refers to massive amounts of structured and unstructured data that is difficult to process using traditional databases. It is characterized by volume, variety, velocity, and veracity. Major sources of big data include social media posts, videos uploaded, app downloads, searches, and tweets. Trends in big data include increased use of sensors, tools for non-data scientists, in-memory databases, NoSQL databases, Hadoop, cloud storage, machine learning, and self-service analytics. Big data has applications in banking, media, healthcare, energy, manufacturing, education, and transportation for tasks like fraud detection, personalized experiences, reducing costs, predictive maintenance, measuring teacher effectiveness, and traffic control.
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
The Pros and Cons of Big Data in an ePatient WorldPYA, P.C.
PYA Principal Dr. Kent Bottles, who is also PYA Analytics’ Chief Medical Officer, presented “The Pros and Cons of Big Data in an ePatient World” at the ePatient Connections 2013 conference.
Big data - what, why, where, when and howbobosenthil
The document discusses big data, including what it is, its characteristics, and architectural frameworks for managing it. Big data is defined as data that exceeds the processing capacity of conventional database systems due to its large size, speed of creation, and unstructured nature. The architecture for managing big data is demonstrated through Hadoop technology, which uses a MapReduce framework and open source ecosystem to process data across multiple nodes in parallel.
The document discusses different types of data. It defines data as information that has been converted into a format suitable for processing by computers, usually binary digital form. Data types represent the kind of data that can be processed in a computer program, such as numeric, alphanumeric, or decimal. The main types of data discussed are strings, characters, integers, and floating point numbers.
Big data analytics involves analyzing large and complex datasets. There are different types and orders of analytics, including first order analytics of individual data points and second order analytics involving relationships between data points. Examples of second order analytics are basket analysis of related purchased items, collaborative filtering to make recommendations, and social network analysis to understand user connections. Popular platforms for big data include Hadoop for storage and MapReduce for distributed processing, while newer technologies like Spark are gaining popularity. Understanding users and their relationships is key to predicting their needs and behavior through analytics.
An outline of how Moneytree uses Amazon SWF to coordinate our backend aggregation workflow. Focuses on how to run a large scale distributed system with a few developers while still sleeping at night.
This document provides an overview of big data concepts including definitions of big data, sources of big data, and uses of big data analytics. It discusses technologies used for big data including Hadoop, MapReduce, Hive, Mahout, MATLAB, and Revolution R. It also addresses challenges around big data such as lack of standardization and extracting meaningful insights from large datasets.
After the computing industry got started, a new problem quickly emerged. How do you operate this machines and how to you program them. The development of operating systems was relatively slow compared to the advances in hardware. First system were primitive but slowly got better as demand for computing power increased. The ideas of the Graphical User Interfaces or GUI (Gooey) go back to Doug Engelbarts Demo of the Century. However, this did not have much impact on the computer industry. One company though, Xerox, a photocopy company explored these ideas with Palo Alto Park. Steve Jobs of Apple and Bill Gates of Microsoft took notice and Apple introduced first Apple Lisa and the Macintosh. In this lecture on we look so lessons for the development of software, and see how our business theories apply.
In this lecture on we look so lessons for the development of algorithms or software, and see how our business theories apply.
In the second part we look at where software is going, namely Artificial Intelligence. Resent developments in AI are causing an AI boom and new AI application are coming all the time. We look at machine learning and deep learning to get an understanding of the current trends.
Ernestas Sysojevas. Hadoop Essentials and EcosystemVolha Banadyseva
This document provides information about Hadoop training delivered by DATA MINER, a training and consultancy company located in Vilnius, Lithuania. It introduces the trainer, Ernestas Sysojevas, and his qualifications. It then provides details about DATA MINER and its role as an exclusive Cloudera Training Partner in several countries. The document discusses the growing Hadoop market and increasing job trends related to Hadoop. It also briefly outlines the main roles involved in Hadoop like system administrators, developers, data analysts, data scientists, and data stewards as well as the typical skills and responsibilities of each.
Introduction to Hadoop - The EssentialsFadi Yousuf
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
This document provides case studies on how several companies leverage big data, including Google, GE, Cornerstone, and Microsoft. The Google case study describes how Google processes billions of search queries daily and uses this data to continuously improve its search algorithms. The GE case study outlines how GE collects vast amounts of sensor data from power turbines, jet engines, and other industrial equipment to optimize operations and efficiency. The Cornerstone case study examines how Cornerstone uses employee data to help clients predict retention and performance. Finally, the Microsoft case study discusses how Microsoft has positioned itself as a major player in big data and offers data hosting and analytics services.
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
1) Big data is being generated from many sources like web data, e-commerce purchases, banking transactions, social networks, science experiments, and more. The volume of data is huge and growing exponentially.
2) Big data is characterized by its volume, velocity, variety, and value. It requires new technologies and techniques for capture, storage, analysis, and visualization.
3) Analyzing big data can provide valuable insights but also poses challenges related to cost, integration of diverse data types, and shortage of data science experts. New platforms and tools are being developed to make big data more accessible and useful.
This document provides an overview of Hadoop, an open-source framework for storing and processing large datasets across clusters of commodity servers. It discusses how Hadoop addresses the challenges of big data by moving computation to the data through its MapReduce programming model and storing data across clusters using its Hadoop Distributed File System (HDFS). Key components of Hadoop include the NameNode for metadata, DataNodes for data storage, and a JobTracker that coordinates tasks on TaskTrackers. Hadoop allows for scalable, fault-tolerant distributed computing of big data.
Big Data Information Architecture PowerPoint Presentation SlideSlideTeam
The document appears to be a presentation on big data. It includes slides on what big data is, facts about the size and growth of big data, sources of big data, the 3Vs and 5Vs models of big data, differences between small and big data, objectives and technologies of big data, the big data workflow, forms of big data, the data analytics process, impacts of big data, benefits of big data, the future of big data, and opportunities and challenges of big data. Additional slides provide templates that are editable for topics such as mission, team, about us, goals, comparisons, financials, quotes, dashboards, and locations.
This document discusses the rise of big data and how the volume of data being created is growing exponentially, with 2.5 quintillion bytes created daily from various sources like sensors, social media, images, videos and purchases. It outlines how traditional databases and data analytics are struggling to handle this unstructured data, leading to the emergence of new solutions like Hadoop. It also explores how new roles like data scientists are emerging to help organizations extract value from all this big data through advanced analytics.
Vikas Samant is a big data and data science engineer who works with Entrench Electronics and Pentaho. He provides an overview of big data, defining it as large volumes of structured, semi-structured, and unstructured data that businesses must process daily. He describes the key characteristics of big data using the 3Vs - volume, variety, and velocity, and sometimes a fourth V of veracity. The document then discusses data structures, data science, the data science process, and provides examples of big data use cases like optimizing funnel conversion, behavioral analytics, customer segmentation, and fraud detection. It concludes with an overview of big data technologies, vendors, what Hadoop is, and why Hadoop is widely adopted.
Big data refers to terabytes or larger datasets that are generated daily and stored across multiple machines in different formats. Analyzing this data is challenging due to its size, format diversity, and distributed storage. Moving the data or code during analysis can overload networks. MapReduce addresses this by bringing the code to the data instead of moving the data, significantly reducing network traffic. It uses HDFS for scalable and fault-tolerant storage across clusters.
Big data refers to massive amounts of structured and unstructured data that is difficult to process using traditional databases. It is characterized by volume, variety, velocity, and veracity. Major sources of big data include social media posts, videos uploaded, app downloads, searches, and tweets. Trends in big data include increased use of sensors, tools for non-data scientists, in-memory databases, NoSQL databases, Hadoop, cloud storage, machine learning, and self-service analytics. Big data has applications in banking, media, healthcare, energy, manufacturing, education, and transportation for tasks like fraud detection, personalized experiences, reducing costs, predictive maintenance, measuring teacher effectiveness, and traffic control.
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
The Pros and Cons of Big Data in an ePatient WorldPYA, P.C.
PYA Principal Dr. Kent Bottles, who is also PYA Analytics’ Chief Medical Officer, presented “The Pros and Cons of Big Data in an ePatient World” at the ePatient Connections 2013 conference.
Big data - what, why, where, when and howbobosenthil
The document discusses big data, including what it is, its characteristics, and architectural frameworks for managing it. Big data is defined as data that exceeds the processing capacity of conventional database systems due to its large size, speed of creation, and unstructured nature. The architecture for managing big data is demonstrated through Hadoop technology, which uses a MapReduce framework and open source ecosystem to process data across multiple nodes in parallel.
The document discusses different types of data. It defines data as information that has been converted into a format suitable for processing by computers, usually binary digital form. Data types represent the kind of data that can be processed in a computer program, such as numeric, alphanumeric, or decimal. The main types of data discussed are strings, characters, integers, and floating point numbers.
Big data analytics involves analyzing large and complex datasets. There are different types and orders of analytics, including first order analytics of individual data points and second order analytics involving relationships between data points. Examples of second order analytics are basket analysis of related purchased items, collaborative filtering to make recommendations, and social network analysis to understand user connections. Popular platforms for big data include Hadoop for storage and MapReduce for distributed processing, while newer technologies like Spark are gaining popularity. Understanding users and their relationships is key to predicting their needs and behavior through analytics.
An outline of how Moneytree uses Amazon SWF to coordinate our backend aggregation workflow. Focuses on how to run a large scale distributed system with a few developers while still sleeping at night.
This document provides an overview of big data concepts including definitions of big data, sources of big data, and uses of big data analytics. It discusses technologies used for big data including Hadoop, MapReduce, Hive, Mahout, MATLAB, and Revolution R. It also addresses challenges around big data such as lack of standardization and extracting meaningful insights from large datasets.
After the computing industry got started, a new problem quickly emerged. How do you operate this machines and how to you program them. The development of operating systems was relatively slow compared to the advances in hardware. First system were primitive but slowly got better as demand for computing power increased. The ideas of the Graphical User Interfaces or GUI (Gooey) go back to Doug Engelbarts Demo of the Century. However, this did not have much impact on the computer industry. One company though, Xerox, a photocopy company explored these ideas with Palo Alto Park. Steve Jobs of Apple and Bill Gates of Microsoft took notice and Apple introduced first Apple Lisa and the Macintosh. In this lecture on we look so lessons for the development of software, and see how our business theories apply.
In this lecture on we look so lessons for the development of algorithms or software, and see how our business theories apply.
In the second part we look at where software is going, namely Artificial Intelligence. Resent developments in AI are causing an AI boom and new AI application are coming all the time. We look at machine learning and deep learning to get an understanding of the current trends.
Ernestas Sysojevas. Hadoop Essentials and EcosystemVolha Banadyseva
This document provides information about Hadoop training delivered by DATA MINER, a training and consultancy company located in Vilnius, Lithuania. It introduces the trainer, Ernestas Sysojevas, and his qualifications. It then provides details about DATA MINER and its role as an exclusive Cloudera Training Partner in several countries. The document discusses the growing Hadoop market and increasing job trends related to Hadoop. It also briefly outlines the main roles involved in Hadoop like system administrators, developers, data analysts, data scientists, and data stewards as well as the typical skills and responsibilities of each.
Introduction to Hadoop - The EssentialsFadi Yousuf
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
This session will provide an executive overview of the Apache Hadoop ecosystem, its basic concepts, and its real-world applications. Attendees will learn how organizations worldwide are using the latest tools and strategies to harness their enterprise information to solve business problems and the types of data analysis commonly powered by Hadoop. Learn how various projects make up the Apache Hadoop ecosystem and the role each plays to improve data storage, management, interaction, and analysis. This is a valuable opportunity to gain insights into Hadoop functionality and how it can be applied to address compelling business challenges in your agency.
The document provides an overview of Apache Hadoop and how it addresses challenges with traditional data architectures. It discusses how Hadoop provides a distributed storage and processing framework to allow businesses to store all of their data in its native format and access it via different engines. The key components of Hadoop include the Hadoop Distributed File System (HDFS) for storage and Yet Another Resource Negotiator (YARN) for distributed computing. Hadoop addresses issues around cost, speed, and the ability to leverage both new and cold data with modern data applications.
Join Cloudera’s founder and Chief Scientist, Jeff Hammerbacher, as he describes ten common problems that are being solved with Apache Hadoop.
A replay of the webinar can be viewed here:
https://www1.gotomeeting.com/register/719074008
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
3 Things to Learn About:
* The concept of lambda architectures
* The Hadoop ecosystem components involved in lambda architectures
* The advantages and disadvantages of lambda architectures
Copyright trolls have emerged as opportunistic entities that aggressively pursue copyright infringement lawsuits against individuals, not to protect creative works, but rather to profit through coercing settlements. The article discusses the rise of copyright trolls and their abusive tactics, such as targeting large numbers of "John Doe" defendants and exploiting high statutory damages to pressure settlements. While a victory against a troll lawsuit in Oregon reduced the financial viability of the troll business model, the problem persists nationwide due to imbalances in copyright law promoted by entertainment industry lobbying.
Самая гибкая, надежная и безопасная с точки зрения вложения денег Жилищная программа в мире МЛМ-бизнеса.
Аналогов нет. Простые условия входа. Мгновенные начисления. Минимальная суммы для вывода.
Be it Big or small we are there for all. Ultimate satisfaction is assured in wedding.Our collective experience is more than 25years with successful completion of more than 500 weddings across India and Abroad. Be it destination or a local we are at your service with improvised ideas keeping in mind the given budget. Our professional and dedicated team always works at any given time frame.
Lavina Rana is a graphic designer from Singapore who was raised in Adelaide, Australia. She has a bachelor's degree in photography and a graduate diploma in graphic design. Her interests include typography, photography, and art direction. Her goal is to become an art director. The document includes samples of her magazine layouts, posters, print designs and wedding invitation designs showcasing her skills and interests in graphic design, photography, and combining images and text.
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...Sebastian Blazkiewicz
7th SASMA Business Security Conference – Your Business Challenges Today & Tomorrow
Zapraszamy wszystkich! Nadchodzi 7 edycja SASMA Business Security Conference 2015! 26-27 listopada 2015, Warszawa
Jest to niepowtarzalna szansa, do poszerzania perspektyw oraz poznania zarysu przyszłego udziału branży na rynku, a także poznania nowatorskich i najbardziej skutecznych praktyk w dziedzinie bezpieczeństwa biznesu.
Mohankumar Soman Menon has over 38 years of experience in finance roles. He worked his way up from Accounts Assistant to Deputy General Manager of Finance at a reputed electrical engineering company. He implemented ERP systems like SAP and developed financial systems. Currently, he is a consultant specializing in optimizing accounting procedures and systems.
This document lists the licensing portfolio for 2012-13, including Royal Mail, Dark Bunny Tees, Character Options, Mr. Potato Head, Titan, Eaglemoss, Pyramid Posters, Steiff, Paul Lamond, Danilo, Lovarsi, Winning Moves, and Titan again.
El documento habla sobre el desarrollo del lenguaje y alfabetismo en los niños. Explica que los niños aprenden el lenguaje de forma natural a través de la imitación y el refuerzo de sus padres. También describe las etapas del desarrollo lingüístico y los componentes de la lectura. Finalmente, enfatiza la importancia de que los niños desarrollen la capacidad de leer y escribir a través de la enseñanza en la escuela en un ambiente rico en materiales impresos.
Un manual es una guía de instrucciones para el uso de un dispositivo o la corrección de problemas. Un manual de calidad especifica la misión, visión y política de calidad de una empresa. Describe la estructura del sistema de gestión de calidad e incluye la política, objetivos, responsabilidades y procedimientos de una organización. Un mapa de procesos representa gráficamente los procesos de una organización y sus relaciones internas y externas.
Haiku Deck is a presentation platform that allows users to create Haiku-style slideshows. The document encourages the reader to get started creating their own Haiku Deck presentation on SlideShare by providing a link to do so. It aims to inspire the reader to try out Haiku Deck's unique presentation style.
This presentation provides an overview of Memphis' Strong Cities, Strong Communities initiative and how it uses a triad of 311, performance metrics, and CitiStat to improve government services. It summarizes how Memphis transformed its Mayor's Call Center into a 311 system, highlights examples of using 311 data and performance metrics in CitiStat meetings to address issues like curbside trash complaints, and outlines next steps to further develop these tools. The presentation aims to demonstrate how these public administration strategies can be connected to drive continuous improvement in service delivery.
This document contains the resume of Waqas Naeem, including his contact information, objective, personal details, education history, work experience, skills, and computer skills. He is currently working as a Valve Technician in Abu Dhabi and is looking for career opportunities where he can demonstrate and grow his mechanical and technical skills. He has 5 years of experience in various mechanical fields including valve servicing, pressure testing, plant shutdowns, and fabrication work.
Berikut ini cara install windows 8 lengkap beserta gambarnyaagus
Berikut ringkasan cara instalasi Windows 8 dalam 3 kalimat:
Pertama, masukkan DVD instalasi Windows 8 saat komputer dalam mode booting dan pilih boot dari CD/DVD-ROM. Kemudian ikuti proses instalasi dengan memasukkan produk key dan memilih partisi untuk instalasi. Setelah proses selesai, buatlah akun pengguna untuk mengakses layar utama Windows 8.
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016BrioWeb
Le slide del workshop dedicato a come la mindfulness, e quindi la capacità di riuscire ad ascoltare se stessi ed il mondo che ci circonda, può aiutare la nostra strategia di marketing
This document discusses big data and Hadoop frameworks for managing large volumes of data. It begins with an overview of how data generation has increased exponentially from employees to users to machines. Next, it discusses the history of big data technologies like Google File System and MapReduce, which were combined to create Hadoop. The document then covers sources of big data, challenges of big data, and how Hadoop provides a solution through distributed processing and its core components like HDFS and MapReduce. Finally, data processing techniques with traditional databases versus Hadoop are compared.
The document summarizes Aginity's efforts to build a 10 terabyte database application using $5,682.10 worth of commodity hardware. They constructed a 9-box server farm with off-the-shelf components to test leading database systems like MapReduce, in-database analytics, and MPP on a scale that previously would have cost $2.2 million. The goal was to build similar big data capabilities on a smaller budget for their research lab to experiment with different technologies.
Hadoop was born out of the need to process Big Data.Today data is being generated liked never before and it is becoming difficult to store and process this enormous volume and large variety of data, In order to cope this Big Data technology comes in.Today Hadoop software stack is go-to framework for large scale,data intensive storage and compute solution for Big Data Analytics Applications.The beauty of Hadoop is that it is designed to process large volume of data in clustered commodity computers work in parallel.Distributing the data that is too large across the nodes in clusters solves the problem of having too large data sets to be processed onto the single machine.
Introduction to big data – convergences.saranya270513
Big data is high-volume, high-velocity, and high-variety data that is too large for traditional databases to handle. The volume of data is growing exponentially due to more data sources like social media, sensors, and customer transactions. Data now streams in continuously in real-time rather than in batches. Data also comes in more varieties of structured and unstructured formats. Companies use big data to gain deeper insights into customers and optimize business processes like supply chains through predictive analytics.
Big Data
Hadoop
NoSQL databases and type: column oriented,document oriented, map based.
Map-reduce Example
Bigdata Analytics Case study
Case Study R
Retail and Finance Case Study
The document discusses big data and its key characteristics known as the 5Vs: volume, velocity, variety, variability, and value. It provides examples of how different companies and industries deal with large volumes of data from various sources in real-time. Big data technologies like Hadoop, HDFS, MapReduce, Cassandra, and MongoDB are helping companies analyze and gain insights from both structured and unstructured data across industries like retail, finance, and social media. Data scientists use tools, techniques and programming languages to understand trends and patterns in large, complex data sets.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. HDFS stores data reliably across machines in a Hadoop cluster and MapReduce processes data in parallel by breaking the job into smaller fragments of work executed across cluster nodes.
This document provides an overview of big data by exploring its definition, origins, characteristics and applications. It defines big data as large data sets that cannot be processed by traditional software tools due to size and complexity. The creator of big data is identified as Doug Laney who in 2001 defined the 3Vs of big data - volume, velocity and variety. A variety of sectors are discussed where big data is used including social media, science, retail and government. The document concludes by stating we are in the age of big data due to new capabilities to analyze large data sets quickly and cost effectively.
This document provides an overview of big data by exploring its definition, origins, characteristics and applications. It defines big data as large datasets that cannot be processed by traditional software tools due to size and complexity. The document traces the development of big data to the early 2000s and identifies the 3 V's of big data as volume, velocity and variety. It also discusses how big data is classified and the technologies used to analyze it. Finally, the document provides examples of domains where big data is utilized, such as social media, science, and retail, before concluding on the revolutionary potential of big data.
This document discusses big data, where it comes from, and how it is processed and analyzed. It notes that everything we do online now leaves a digital trace as data. This "big data" includes huge volumes of structured, semi-structured, and unstructured data from various sources like social media, sensors, and the internet of things. Traditional computing cannot handle such large datasets, so technologies like MapReduce, Hadoop, HDFS, and NoSQL databases were developed to distribute the work across clusters of machines and process the data in parallel.
its name suggests, the most common characteristic associated with big data is its high volume. This describes the enormous amount of data that is available for collection and produced from a variety of sources and devices on a continuous basis.
Big data velocity refers to the speed at which data is generated. Today, data is often produced in real time or near real time, and therefore, it must also be processed, accessed, and analyzed at the same rate to have any meaningful impact. As its name suggests, the most common characteristic associated with big data is its high volume. This describes the enormous amount of data that is available for collection and produced from a variety of sources and devices on a continuous basis. Big data can be messy, noisy, and error-prone, which makes it difficult to control the quality and accuracy of the data. Large datasets can be unwieldy and confusing, while smaller datasets could present an incomplete picture. The higher the veracity of the data, the more trustworthy it is.
Topics
What is Big Data?
Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time. These datasets are so huge and complex in volume, velocity, and variety, that traditional data management systems cannot store, process, and analyze them.
The amount and availability of data is growing rapidly, spurred on by digital technology advancements, such as connectivity, mobility, the Internet of Things (IoT), and artificial intelligence (AI). As data continues to expand and proliferate, new big data tools are emerging to help companies collect, process, and analyze data at the speed needed to gain the most value from it.
Big data describes large and diverse datasets that are huge in volume and also rapidly grow in size over time. Big data is used in machine learning, predictive modeling, and other advanced analytics to solve business problems and make informed decisions.
Read on to learn the definition of big data, some of the advantages of big data solutions, common big data challenges, and how Google Cloud is helping organizations build their data clouds to get more value from their data.
Get started for free
Big data examples
Data can be a company’s most valuable asset. Using big data to reveal insights can help you understand the areas that affect your business—from market conditions and customer purchasing behaviors to your business processes.
Here are some big data examples that are helping transform organizations across every industry:
Tracking consumer behavior and shopping habits to deliver hyper-personalized retail product recommendations tailored to individual customers
Monitoring payment patterns and analyzing them against historical customer activity to detect fraud in real time Combining data and information from every stage.
This document provides an overview of big data concepts including:
- Mohamed Magdy's background and credentials in big data engineering and data science.
- Definitions of big data, the three V's of big data (volume, velocity, variety), and why big data analytics is important.
- Descriptions of Hadoop, HDFS, MapReduce, and YARN - the core components of Hadoop architecture for distributed storage and processing of big data.
- Explanations of HDFS architecture, data blocks, high availability in HDFS 2/3, and erasure coding in HDFS 3.
This document discusses the paradigm shift in data integration due to growing amounts of data from various sources. It outlines 5 principles and 5 capabilities of modern data integration, which takes processing to where the data lives, leverages multiple platforms, moves data point-to-point, manages rules centrally, and allows changes using existing logic. A case study shows how a bank migrated data to Hadoop in 3 weeks using these principles, lowering costs by 50% compared to traditional ETL. Looking ahead, real-time data access will become more important for businesses.
This Presentation is completely on Big Data Analytics and Explaining in detail with its 3 Key Characteristics including Why and Where this can be used and how it's evaluated and what kind of tools that we use to store data and how it's impacted on IT Industry with some Applications and Risk Factors
This document provides an overview of big data presented by five individuals. It defines big data, discusses its three key characteristics of volume, velocity and variety. It explains how big data is stored, selected and processed using techniques like Hadoop and MapReduce. Examples of big data sources and tools are provided. Applications of big data across various industries are highlighted. Both the risks and benefits of big data are summarized. The future growth of big data and its impact on IT is also outlined.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Big data and hadoop ecosystem essentials for managers
1. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
Manjeet Singh Nagi
(Manjeet Singh Nagi - https://in.linkedin.com/in/manjeetnagi)
2. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Index
Chapter 1 – A Brief History of…?
Chapter 2 - NoSQL databases
Chapter 3 - The backbone I (Hadoop, HDFS, MapReduce)
Chapter 4 - The backbone II (MapReduce continued)
Chapter 5 – A quick view of the ecosystem around Hadoop
Chapter 6 - Hive
Chapter 7 - Pig
Chapter 8 - Hbase
Chapter 9 - Sqoop
Chapter 10 - Flume
Chapter 11 - Kafka
Chapter 12 - Oozie
Chapter 13 - Zookeeper
Chapter 14 - Solr
Chapter 15 - Giraph
Chapter 16 - Putting it all together
Chapter 17 – Hadoop ecosystem on Amazon
Chapter 18– Machine Learning with Mahout
3. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Preface
StartedwritingthisebookexplainingBigDatato managers.Didnotget time tocomplete it.Still
uploadingitforeveryone tohave alook.
4. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 1 – A brief history of …
Any data set that cannot be processed on a single machine within reasonable amount of time
is big data. The underlined phrase in the sentence is very critical to determine if the problem
at hand classifies as a big data problem.
Theoretically any amount of data can be processed on a single machine, with high amount of
storage and multi processors (e.g. a mainframe). But if the machine takes a couple of days, or
even a day, it may not be of much use to the business. If the business or customer or
consumer of the data is OK getting the data processed in as much time as it takes on a single
machine(and there are valid scenarios for such requirements) you do not need to solve the
problem like a big data problem. But in the world today, a large amount of data is coming to
companies. A quicker analysis or processing of this data can help them get quicker insights
and quicker decisions.
Theoretically big data has three attributes. Volume, velocity and variety, Lets understand
each first.
In a big data problem the amount of that data that needs to be processed would typically be
huge (volume). It might run into terra-bytes, peta-bytes etc.
It would typically be coming in at a high speed (real time in some case) (that’s velocity).
And it would come in a lot of variety. Variety could mean it could be coming from different
sources each of which could have different formats of sending the data. Even within the data
from the same source the format could vary over a period of time. Even within the data from
the same source at a given time the data may not have structure.
Having said that, why is it happening that the companies are getting such huge amount of
data at huge velocity and in so much variety?
The following reasons over the past couple of decades lead us to big data
Digitization of organizations – Over the past three decades or so organizations have
become more and more digitized. Every activity done by organizations has become digitized.
Every interface of the organization, be it with consumers, partnering vendors, government
agencies have become digitized. All this is creating a lot of data. But all this would not have
generated data (in volume, velocity and variety) needed to qualify as a big data unless the
developments mentioned in the following paragraphs would have takenplace.
Web 2.0 – Web 2.0 introduced technologies which made billions of people not the
consumers of websites but the content generators. Blogging & social websites are the
examples of the web 2.0. Even on sites not typically classified as social or blogging sites there
are features which enable billions of people to generate contents e.g. sharing the new articles
from news websites, commenting on specific content on a website etc.
Web 2.0 is often a hotly debated term. It is not as if a new version of web or any related
technology were released. But the web in the last decade of the last century was about flow of
information from website owners to billions of web users. Slowly the web evolved to enable
billions of user to generate the content. The content in the web today is much more
democratic. It is by the people and for the people.
Mobile devices – With the advent of mobile devices users are performing many more
activities and spending more hours in the web than earlier. Add to that the fact that mobile
5. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
devices capture much more user contextual information (location for example) than
desktops did earlier. This contextual information if processed and analyzed can enable
organizations understand their consumers much better and provide much more meaningful
services andor products. Also the very fact the user are spending lot of time on their mobile
devices means the amount of information that is getting generated is much more thanwhat
was getting generated when users were using desktop.
More digitization of the organizations - With more and more retail, social interactions
and information consumption moving to web organizations need to, literally, every click of a
user to understand her better. As opposed to a brick a mortar, where a user a can be
observed physically and assisted in-store assistants, in a e-tail website the only way to
observe a user is observe and analyze every click made by the user on the web. E-tail offers
an advantage over brick and mortar shops in the sense that the user activity can be saved for
analysis later which is not possible in a brick and mortar shop. This analysis needs a lot of
data, in form of logs, to be analyzed.
Typical pattern of big data problem solution
As we all know from our experience, typically in an application the time taken to process the
data is order of magnitudes smaller than the time taken for IO of the data from data
repository (disk or database). Also the time taken to the read (IO) the data over the network
(say from network storage or database on another server) is many times larger than the time
taken to read the data locally from the disk.
So typically when a big data problem is solved,
1. The data is distributed across multiple machines (called nodes). Transferring a
petabyte of data to a single machine would take much more time than the time taken
to divide this data into smaller chunks and transferring to 100 smaller machines in
parallel. The IO is now done to 100 nodes in parallel which reduces the IO time
significantly.
2. Now that the data is distributed across multiple nodes, the
code/application/binary/jar etc. is copied to all the nodes. This is unconventional as
compared to a typical application where data is brought from multiple sources to a
single machine where the application resides and is processed on this single machine.
In big data solutions it is the application that moves closer to the data.
3. Finally, the output from all the nodes is brought to a smaller number of nodes (many
times only 1 node) for final processing or summarization.
So, as you can see, the solution for a big data problems is about distributed storage (#1
above) and distributed processing (#2 and 3). The evolution of the solutions for big data
problem also happened approximately in the same manner. Firstly, many solutions around
distributed storage arrived and then around distributed processing.
How did it all start?
Commercial relational databases ruled the roost when it came to consistent storage since
70s/80s. These relational database had their own advantage which made them so popular.
But they had certain limitations which did not come to the fore till late 90s.
In the late 90s and early part of this century companies had more and more data to store.
The option available with the relational databases was to buy bigger and bigger machines
which are really costly.
6. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Also in order to keep their websites available 100% of the time ( you do not expect google to
be down, do you?) the companies needed to scale out(add more active hot backups). This
made the relational database costly only from hardware perspective (relational database in
companies always ran on high grade servers) but also from licensing cost. The licensing cost
for the relational database was directly proportional to the number of machines it was going
to run on. For e-tailers and search giants had data that needed 100s and 1000s of machines
to store data with fault tolerance.
Also, the relational databases were designed to store data in a fixed format. They were
designed to lock-in to the schema at the time of database design. But companies were getting
data which was un-structured to a large extent (imagine logs). They could have formatted
this data and stored data in a structured format in the relational database but that eliminated
any possibility of using any data, discarded during the formatting stage, whose significance
was realized only later companies were looking for persistent data storage where schema was
not locked-in at the time of database design but at the time of database read.
To summarize, organizations were running into the following limitations of relational
database storages:
Licensing cost prohibited the scaling out that was needed to store large data sets.
Licensing cost as the higher-grade servers needed were prohibitive for the creating fault
tolerance in the storage.
Relational database were designed for locking-in the schema at the time of database design.
As companies started coming against these limitations many of them started designing
databases on their own and bringing them out in public in form of open source database.
These databases were together called NoSQL databases. All these database had the following
attributes (in addition to the fact that they were open source),
They were designed to run on clusters made of commodity hardware. Unlike relational
databases, they did not need high end servers.
They were inherently designed to run on clusters. So as the size of data increases an
organization could just add more commodity hardware and scale out rather than buying
costly servers.
Fault tolerance was inherent in their design. Any data on one node of the cluster was backed
up on another node (number of backups was configurable not only at database level but at
much more granular level). This low cost fault tolerance made them much more resilient on
commodity hardware than relational databases on enterprise servers.
They were design for unstructured data. So you could just load the data in whatever format
you get it. You need not even know what information comes in the data. It was up to the
application to know what to expect in the data.
Also NoSQL databases challenged the very foundation of the relational databases. The
foundation was that relational database updates were ACID (Atomic, Consistent, Isolated
and Durable). NoSQL databases challenged this very foundation. They questioned if every
business scenario really needed the databases to be ACID compliant. We will get into much
more details on this in the chapter 2.
7. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
While many authors do not talk about NoSQL databases when they talk about dig data
technologies, NoSQL databases brought to the fore distributed storage, used for big datasets,
as we know it today.
Sometime in the early part of this century, Google published two papers.
One of the papers was about their distributed file storage system. This was not the first
distributed file storage system in the world. But it had many architectural and design aspect
to solve a very specific problem they had at hand. It was designed for
Fault tolerance using commodity hardware. So the data was distributed across a
cluster of commodity machines (instead of high end servers). Since the machines are
high-grade there are high risks of failure The distributed file system will take care of
backing up data on each node on other nodes and recovering it in case a machine
fails.
Scenarios where files written once to the distributed file system are read multiple
times.
Random reads (reading a specific record from the file) are not required or are an
exception.
Files are required to be ready sequentially in big chunks rather than one record each
time. These big chunks are also read in sequential manner rather than from random
places in the file
Random write (updating some particular read) is not needed. So you do not have a
scenario to update a record in the file
Updates to the file are about adding/appending more data that too in huge chunks
rather than one record at a time.
Scenarios where a modest number of huge files need to be stored rather than huge
number of modest/small need to be stored.
Clients(of distributed file system) which want to process bulk of data
faster(throughput) rather thansmall amount of data quickly(latency)
The other paper from Google was about a framework they developed for processing their
data. It is called MapReduce. In this framework user specifies a map function that will
transform their data and a Reduce function that will summarize the output of Map function.
The MapReduce framework takes the onus of
1. distributing the data to be processed across many nodes
2. distributing the map and reduce functions to all the nodes so that code is closer to the
data and hence IO is reduced(refer to Typical patternof big data problemsolution we
discussed earlier in this section)
3. Schedule to run Map and Reduce functions on all the nodes
4. Manage to recover from a failed machine – So the framework will take care of
restoring the data from a backup on another node and restart the map or reduce
function there if some machine fails.
The MapReduce framework was designed to run simple functions on a huge amount of data.
It lets programmers write Map and Reduce functions while it takes care of distributing the
data and code, schedule the run and recovering from a failure.
I do not want you to get bogged down by the term MapReduce. It is similar to a typical
processing of data in other applications. Here is more detail on what a map reduce function
are to make you more comfortable before we move forward.
8. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
A map function accepts a record in the form of a key-value pair and does processing or
formatting on the record and generates an output in the form of another key-value pair. I do
not want you to think of “value” in the key-value pair as a single field. A “value” in the key-
value pair could a complete record with a lot of fields or information in it. E.g. key could be
employee ID and value could be all the details of that employee, key could be a transaction id
and value could be all the details of the transaction. It is up to the map function to decide
what processing or formatting it wants to do on which field in the value.
Similarly, Reduce function reads the key-value output from all the map functions running of
different nodes and summarizes to generate the final output.
A very simple example is – Say you have a huge data which has all the details of the
employees of many organizations from the world in a file. What you want to achieve is
calculate avg salary for each designation (assume there are standard designation). Your map
function will read the part of input file provided to it. The input Key-value would be
designation and the value would be the rest of the information about that employee. Your
map function will read each record and for each input record it will generate an output with
key as designation and value as salary from that record. It sounds simple. Isn’t it? What is
important is that the map function is parallelizable. You candivide your input records into as
many processing nodes as you have and run Map function in parallel on all those nodes. The
map function is not dependent on getting information from another record on another node
while processing a specific record.
The reduce function in our example will read records from all the nodes where map function.
Its input would be the output key-value from map function. It will do an avg for each
designation present in the file.
This lead to the development of Hadoop, an open source product which delivers capability
similar to the ones shared by google in their two documents. Hadoop has two components:
HDFS (Hadoop Distributed File System) – This is similar to google’s distributed file
system (as described above). As the name suggests HDFS is a distributed fault tolerant file
system. It enables storing large files across a cluster of commodity machines.
MapReduce – MapReduce is a framework to process data in the form of key-value pair by
distributing the key-value pairs across a cluster of machines. It is run in two steps. First step
is called a Map where input in the form of Key-value is processed to generate intermediate
key-value pairs. The intermediate key-value pair go through a reduce step which summarizes
the key-value pair to generate the final output.
Hadoop was quickly adopted across organizations. This eventually led to the development of
a lot of other products which extended the functionality of Hadoop further. E.g. Flume,
Sqoop, Pig, Hive etc. We will understand each of these open source product developed
around Hadoop in subsequent chapters in enough details for us to be able to design, at a high
level, a solution to solve a big data business problem.
9. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 2 - NoSQL Database
How NoSQL databases are so scalable
When we say an application or a database is scalable we generally mean horizontal
scalability. The capacity (to process more data or take more load) of a scalable solution can
easily be increased by adding more machines to the cluster. On the other hand, the capacity
of a not-so-scalable solution can either be not increased at all or it can be increased only by
replacing the existing machine/server with a bigger (costlier) (vertical scalability) one.
The way relational and NoSQL databases store the data is very different. This difference
makes NoSQL database very scalable and cluster oriented. Let’s understand this with an
example.
Let’s take example of a professional networking website. Users maintain the information
about their education institute (schools and colleges they passed from) in this application.
Let’s also assume that the typical access pattern from the application is such that if every
time application accesses user information it will access her school/college information as
well.
A typical relational database design to store this information would be to save user and
educational institutes two different tables and maintaining the relationships between the 2
using foreign keys(or using a third table to maintain the start and end date of relationship).
Typical design in a relational database
User table Education Institute
Table
Relationship Table
Nowhere in this design have we told the database the typical access pattern of the data from
the application. I.e. we have not told the database if every time application accesses user
information it will access her school/college information as well.
Now let’s look at the database design of a NoSQL database. NoSQL database would typically
be designed in such a way that a user’s schoolcollege information will be embedded within
User ID
User Name
User DoB
…..
EducationInstitute ID
Institute Name
Institute City
….
User ID(ForeignKey)
EducationInstitute ID(ForeignKey)
Start Date
End Date
10. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
the user information itself and stored at the same physical node in the cluster. So the user
information would be like,
User {Name, D0B, Current designation, Educational Institute [(Institute1, Start Date, End
Date), (Institute2, Start Date, End Date)…]}
Note that the number of educational institute for a user can vary from zero to any number
your applications wants to allow. Relational database are generally not conducive to store
such limitless list like information. If you try to embed the educational institute within the
user information in a relational database it would also make it highly denormalized and
inefficient.
The NoSQL database would store the user and her educational institute information at the
same node in a cluster. If the educational institute information was not embedded in the user
information but instead maintained separately with a relation between two (like in relational
database) the two could be saved in different node. In that case every time a user information
was accessed the application would have to connect to another node as well to get the
information on the education institute (due to the typical access pattern of our application
described above). This would increase the IO and slow downthe application.
This way of storing the data makes NoSQL database very cluster oriented and scalable. As
you have more users, you can add mode nodes to the cluster and spread data across the
cluster. When application needs the data for a user, the database will get the data from the
node it is on.
You cannot scale the relational database in the same manner. The user and educational
institute are maintained as separate tables. If you spread user information across the nodes
on the cluster what should you do about the educational institutes? Typically many people
would have gone to the same institute. Relational database would maintain this many-to-one
relationship by using foreign key. But you cannot spread educational institutes across nodes
because user on node 1 would have gone to institute on node 2(do that you increase the IO).
Please note NoSQL makes a very strong assumption in terms of how data will be typically
accessed. If there are multiple ways in which data will be typically accessed then probably
NoSQL databases would not be a good option. In our example, what if the application will
also need to typically generate reports by counting the user by their educational institute. In
that case the application will have to scan through all the users across all the nodes in the
database to get the output which would be very in-efficient. Insuch a scenario a relational
database would be a good option or you can used NoSQL for general queries and create a
materialized view for storing counts of users by educational institutes.
I hope now you can imagine how NoSQL databases store the data. They spread the data
across the nodes in clusters but they ensure that the data that is accessed typically together
stays on the same node (latter part needs to be ensured by a good design). As the data
increases in the application one can add more nodes to the database to scale the solution
horizontally.
Please note that, conceptually, it is not as if NoSQL database stores one table on one node
and another table on another node. The scalability comes from the fact that it can distribute
each row of the table to a different node. Imagine (pardon for using a very simplistic example
here) you have a pack of pringles and a pack of chocolates and 10 dishes (plates to serve
them). Unless you open the pack of pringles and chocolates you cannot use the 10 dishes.
You will be able to use only 2 dishes. So users (guests who need the pringles and chocolates)
would put load on those two containers. But if you open the packs you can spread the
pringles and chocolates across 10 dishes. Some containers could have only pringles, some
11. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
only chocolate, some a combination of both. Some would have more quantity which you can
keep near areas which have more guest. Others canhave less quantity and can be kept near
guest who would not be consuming as much. That’s scalability.
Atomicity
Leaving the technical terms aside, what atomicity means is that related database updates
should be done in a manner that either all are done or none is done. In case of relational
database, the atomicity is ensured by the concept of transaction. So if updates to multiple
tables need to be done in a manner that either all are done or none is done then relational
database wrap these updates in a transaction and make the updates. If there are issues after
updating a few tables and the rest of the tables could not be updated, the database will roll
back the transaction, i.e. the updates already made to the table as a part of the transaction
are rolled back.
Let’s take an example. Say there is a simple application that records the transactions done on
a bank account and maintains the final balance. There are two tables in the database. One
stores the transactions. Other stores the final balance. If a user executes a transaction on an
account, the application needs to make two updates. First it needs to insert the transaction in
the transaction table. Second, it needs to update the final balance in the final balance table.
The application will indicate to the relational database that these two updates constitute one
transaction. The database will update the tables in any order but ensure that either both
updates are done or none. If after updating the transaction table first it runs into issues and
not able to update the final balance table it would roll back the updates made to the first
table and inform the application which must have a code to handle such exceptions.
NoSQL databases manage the atomicity a little bit differently. Though they are atomic to an
extent but not thoroughly.
Let’s continue with the same example of account update we used for understanding
atomicity in relational databases. If it were a NoSQLdatabase there were two ways in which
it could be designed. The final balance could be embedded in the table which lists the
transactions in an account. Or the final balance could be a separate table with a relationship
between the transaction and final balance table.
Design 1
{Bank Account ID, Final Balance, Transactions [(Transaction1 ID, Date, Credit/Debit Ind,
Amount, Other transaction details), (Transaction2 ID, Date, Credit/Debit Ind, Amount,
Other transaction details)… ]
Design 2
Table 1(Bank Account ID, Final Balance)
Table 2(Bank Account ID, Transactions [(Transaction1 ID, Date, Credit/Debit Ind, Amount,
Other transaction details), (Transaction2 ID, Date, Credit/Debit Ind, Amount, Other
transaction details)…]
In Design 1 the final balance is embedded within the same table which has the list of the
transactions. So updates made to transaction and final balance would be both be either done
or none. It would not happen that one of them is done and the other one not. So the
atomicity is ensured as much as a relational database.
In case of design 2, the final balance of an account could be stored on node different from the
list of transactions for that account. The NoSQL database would not be able to ensure
atomicity across nodes. The application will have to ensure that either both the updates are
12. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
made or none. NoSQL databases would ensure that all the updates made to one table are
either all done or none. But it cannot ensure this across tables.
So atomicity in relational database is ensured by design. While designing the database if we
keep the data, on which we want atomic updates, together the atomicity is ensured as the
data update is a part of the single update. Anything more than this needs to be ensured by
application.
Consistency
The NoSQL databases were designed to run on cluster of commodity grade machines. The
machines of commodity grade have a higher chances of failure. So any data saved on a node
must be backed up other nodes. NoSQL databases generally store a copy of each data on 3
different nodes (this is called replication factor and is configurable). This adds complexity to
the game now. If your data is replicated across 3 nodes you need to ensure they are in synch.
If you don’t they will go out of synch and different users access data from different nodes will
read different version of the data. But if your application needs to wait till all the nodes are
updated before confirming to the user that updates have been made it will make the
application slower. In case of relational database updates only on one node would be needed
(or two in case of hot-backup). But in case of NoSQL there would be 3 IOs needed (or mode
in case your database is configured that way) which would make the response to the
application less responsive or less alive.
So NoSQL database use the concept of Quorum. So when updates to the database are made
the NoSQL database would not wait for update to all the nodes. It would wait only for the
majority to get updated. So if the replication factor is 3 the NoSQL database would wait for
updates confirmed only by 2 nodes (quorum for updates). The third node will be consistent
later. This concept is called eventual consistency as the different nodes eventually become
consistent. What if one of these nodes fail before the updates are made to the 3rd node? The
NoSQL database would take the latest update from the rest of the two nodes on which the
data is saved and replicate it on another node.
What about the quorum while reading the data? The NoSQL database would not read the
data from all the nodes and give the result to the user. That would make the application
slower. The number of nodes from which it will read the data (or quorum for read) should be
1 more than the number of nodes which were not a part of quorum for update. So if your data
was replicated across 3 nodes, 2 nodes were part of update quorum and 1 was not a part of
the quorum, the NoSQL database would read the data from 1+(number of nodes which were
not a part of the update quorum) i.e. 1+1=2 nodes.
General rule is that
Qu+Qr > Rf
Also Qu > (Rf/2)
Qu Quorum for updates
Qr Quorum for reads
Rf Replication factor
Please note that any operation that needs more nodes to participate in a quorum will become
slower than the complementary operation (read is complementary for write and vice versa)
in the equation. But the complementary operation would become faster due to the above
13. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
equation. So if your replication factor is 7(Rf) and you configure that quorum for updates as
6(Qu) then you need to read data from only 2(Qr) nodes (refer to the above equation). The
reads will be much faster than the updates. So based on the requirement of your application
you can configure all the 3 parameter of the above equation in the database. If you need
updates faster then go for smaller quorum (but still a majority) would be good. If you need
reads to be faster instead you need to have higher quorum for updates and lower quorum for
reads. Some of the database allow you to configure the values of the 3 parameters in the
above equation not only at the database level but also at the transaction level. So while
executing the transaction you can indicate to the database if you want to the transaction to
be confirmed by a majority or lesser number of nodes.
Types of NoSQL databases
NoSQL databases are generally classified into four categories based on the information is
stored and accessed –
Key-valuestore
This is one of the easiest NoSQL database category to comprehend. These are typical key-
hash store. You can store and fetch any value for a given key. The database does not care
what is inside the value. You can store xml, JSON or anything you want as a value. The
database does not care. You can evenstore different format for different keys in the same
table (called bucket here). The database does not care. The onus of making sense of the value
read from the database lies with your application. This also means the database cannot do
any validations on the value, it cannot create index on the value and you cannot fetch data
based on any information within the value. All the access is done only based on the key
which makes is very fast.
Typically used to store session information, shopping cart information, user profiles all of
which require fast access.
A table in key-value data store is generally called a bucket. So a database can have multiple
buckets. The buckets are used to categorize keys and store them separately. e.g if you have
three different values for a key you can merge the values into one value and store it(design 1).
In such a case the onus of reading the value and splitting into 3 different values will lie with
your application. Or you can have 3 buckets and in the table and store the 3 values separately
(design 2). First design involves less IO and hence faster. The second design has more IO and
hence slower but the design is less complex
Design 1
Database
Bucket 1
Design 2
Key1
Value1,
Value2,
Value3
14. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Database
Bucket 1 Bucket 2 Bucket 3
Atomic updates are ensured for a given key-value. But if the application needs atomic
updates across buckets in a database then application will have to ensure that.
Examples of these databases are Redis, Memcached DB, Berkeley DB, Amazon’s Dynamo DB
Documentstore
These are similar to key-value store but these databases do not have a key and the value part
(called a document in this case) stored is not opaque. So you can fetch data based on
different fields with the document. You canvery well save your key within the document and
fetch based on that field. Indexes can be created based on the fields within the document.
The schema of information within the document canvary across documents saved in
different rows. Tables are called collections.
Database
Row 1/Document 1
Row 2/Document 2
Please note the schema of documents in different row is different.
MongoDB and CouchDB and famous examples of this category.
These databases are generally used for Event logging by enterprise application and as a
datastore for Document management systems.
Key1
Value1
Key1
Value2
Key1
Value3
{Name:ABC
LasName:XYZ
DoB:DD/MM/
YYYY}
{Name:DEF
LastName:HKJ
Place:Mumbai
}
15. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Atomicity is maintained at a single document level just like a key-value store maintained
atomicity at single key-value level.
Columnfamily
These databases are more difficult to understand than key-value and document. But these
are the most interesting ones. These are also the most relevant from Hadoop perspective
because the Hbase, a data store on top of HDFS, belongs to this category.
These database are accessed by a key to read multiple valuescolumns. Different columns are
grouped together into column families. Columns that may be accessed together are generally
grouped into column family and are stored together.
Try not to imagine the columns here like the columns in the relational database. The
columns in relational database have same name across multiple rows. If a row does not have
value for a column it is saved as null in relational database. But in a column-family database
if a row does not have a column it just does not have the that column.
The figure below shows a good example of column-family data store
Row 1
Row 2
So in the example above the access to the database is by using Key (UserID). The columns
(FirstName, MiddleName and LastName) have been grouped together into a column family
(UserDtls) as they will be accessed together and the columns (Institute, StartDate, EndDate)
have been grouped as another column family (Education). Please note that the columns in
the first row in column family ‘UserDtls’ are different from that in the second row.
ColumnFamily-
UserDtls{Name:Manjeet,Mi
ddleName:Singh,LastName:
Nagi}
ColumnFamily-
Education{Institute:AMC
University,StartDate:30/06/2
012,EndDate:30/06/2016}
ColumnFamily-
UserDtls{Name:XYZ,LastNa
me:ABC}
ColumnFamily-
Education{Institute:AMC
University,StartDate:30/06/2
012,EndDate:30/06/2016}
Key – User1
Key-User2
16. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Indexes can be created on different columns. While adding a new columns family for a
database requires database restart, an application can very easily add new columns with in
the column family. Atomicity of updates is maintained at a columnfamily for a given key.
Since different columns families for a given key can be stored as different nodes atomicity of
updates cannot be maintained across updates for different column families for a given row.
Cassandra and Hbase are examples of these databases.
Before we move on to the next category of the NoSQL database type, I want to reiterate that
the term column should not be visualized similar to column in relational database. In
relational databases the columns are similar to the columns in an excel in the sense that
- all the rows will have all the columns defined on a table
- if a row does not have value for a column its value will be saved a null
- the name of the column is not saved at each row
In a column family table the columns should be imagined like the attributes in an xml. Also
- all the rows need not have all the columns
- if a row does not have value for a column it will not save its value as null. It will not
have that columnat all
- the name of the column is saved at each row
Graph
The category is the most difficult to comprehend. These databases store entities and
relationships between them. Conceptually, entities are nodes in a graph and relationship are
depicted as directional edges in the graphs. Edges can have additional attributes which
depict further properties of the relationship. Neo4J is a good example of this category.
The picture below depicts the kinds of relationships that are generally stored in such data
bases.
Legends
node depicts an entity
Person1
Person2
Book1
Person3
Movie1
BigData
Reportsto
ReportstoisFriendsWith
Knows Likes
Org1
WorksIn,
StartDate,EndDate
17. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
shows attributes of relationship between entities
Depicts a one way relationship
Depicts two way relationships
Something important to understand is how this category is different from the other NoSQL
databases in terms of scalability. A key-value data store ensures scalability by spreading the
key-values across all the nodes of a cluster. It can do so because it understands (by virtue of
the database design) that it is a key-value combination that will be accessed mostly together.
A document store achieves this by spreading documents across the nodes on the cluster.
Similarly, a column-family data store achieves this by spreading key-column family
combination across cluster.
But a graph data store cannot achieve this as nodes in a relationship are linked to each other.
So graph databases cannot spread the data across nodes in a peer-to-peer manner. They
achieve scalability by using master-slave configuration of the cluster. This can be achieved in
many ways:
1. Reads operations can be directed to slave nodes. Writes operations are directed to
master node. Once master is updated a confirmation is provided to user about the
database updates. Slave nodes are updated after this. Add more and more slave nodes
makes reads more scalable. If writes need to be made more scalable then the data
needs to be sharded across multiple masters and the logic to do so is very specific to
the domain.
2. Writes are directed to salves as well but they provide confirmation to user only after
master has been updated. This makes writes as well scalable without really sharding
the data.
As must be clear by now, graph databases are used more for networking problem (social or
professional networking being one such problem).
As must be clear from the name “NoSQL”, none of these databases use SQL for database
access. All of them have their syntax for database operations. We have not gone into those
languages as the objective of the book is not to get to the level of code. Having said that, the
languages for each of the database are not very difficult to grasp.
So why did we understand so much about the NoSQL databases when the book is primarily
about Hadoop ecosystem. One of the open source in the Hadoop ecosystem, Hbase, is a
column-family store built on top of HDFS. We will understand Hbase in a detailed manner in
chapter 8.
18. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 3 - The backbone (Hadoop, HDFS, MapReduce)
Hadoop is an open source software/framework/product to implement a distributed file
system (HDFS) and for processing your map-reduce (MapReduce) solutions.
I do not want you to get bogged down by the term MapReduce. It is similar to a typical
processing of data in other applications. If you can recall the set theory and functions from
our old school days, you will realize that we learnt about Map in the school days itself.
A Map is a function that processes an input data to produce output data. E.g.
f(x)=x2
The above is a map function. It processes any number x and produces its square. Another
way to look at the above function is using the set theory (again something we learnt in
school)
In the above diagram,there are two sets, A and B. The function or the map f(x) maps each
number in the set A to its square in Set B.
In our enterprise applications the functions or maps are complicated but they are still
functions/maps. e.g. Let’s say we have a mutual fund transaction processing batch system
which receives transaction from the sales agents and processes them. The first program or
script in the transaction processing system would typically do some formatting on the
transaction, validate it and persist the transaction in the data store. So our first program is a
function as depicted below.
f(Input Transaction)=Formatted, Validated, Persisted Transaction
Or we can imagine our program as a map as shownbelow
1
2
3
4
1
4
9
16
f(x)
SetA SetB
19. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Set A is our set of input transactions. So Set A is our input file. f(x) is our program which
maps the input transaction Tn to Tnfvp which is a formatted, validated transaction persisted
in the data store.
A Reduce function is just a program which reads a group of records and produces a summary
of those records. So extending the same example above there could be a program at the end
of the transaction processing system which sums all the transactions and produce a sum total
of the amount of transactions processed on that day (Fig 1 below). Or it could produce sum
total of transaction separately for each mutual fund product (Fig 2 below).
T1
T2
T3
T4
T5
T1fvp
T2fvp
T3fvp
T4fvp
SetA SetB
f(x)
T1fvp
T2fvp
T3fvp
T
4fvp
Sum of the amount of
transactionsprocessed
Sum of the amount of
transactions processed for
Product A
T1fvp
T2fvp
T3fvp
T4fvp
Sum of the amount of
transactions processed for
Product B
Fig1
Fig2
20. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Hadoop MapReduce is more suitable where processing of a data record is not dependent,
especially during Map function, on other data records in the input. So if your map program
(In the example) needs information of some of the other transactions when processing a
transaction Tn then Hadoop MapReduce is not the best solution for it. The reason is that
Hadoop MapReduce distributes the transactions across nodes on the cluster and sends the
Map function (your program in this case) to all these nodes to process these transactions. If
your program needs information from other transactions while processing a transaction Tn
then there will be network overhead to get details from transactions on the other nodes. This
network IO could slow down the Map.
In case of Reduce the program may need inputs from other transactions (say if it is summing
all the transactions) but Reduce is generally processed on a very few nodes. Even when
Reduce is processed on more than 1 node (but still very few nodes as compared to that of
Map) the data is divided amongst the nodes in such a manner that Reduce on a node will not
need information from data on the other node executing Reduce. In the example above if you
want to sum the transactions by Product you could send the transaction for Product A to
node 1 and that of Product B to node 2 and run reduce on both the nodes. Reduce will
calculate the sum for each product separately on each node.
Hadoop consists of two components – HDFS and MapReduce. We will understand each of
these in more detail in the sections below
HDFS
HDFS (Hadoop Distributed File System) is, as its name suggest, an open source distributed
file system to store huge amount of data. It splits the files that need to be stored into small
blocks and stores those blocks of file on different nodes on a cluster while letting the users
(applications, software, frameworks which use HDFS for its storage) still view the file as a
single ,unified and un-split file. So the distribution of the file to different nodes on the cluster
is not visible to the user.
At this stage it is important to re-iterate that HDFS is suitable only for certain scenarios.
These are -
Scenarios where files written once to the distributed file system are read multiple
times.
Random reads (reading a specific record from the file) are not required or are an
exception.
Files are required to be read sequentially in big chunks rather than one record each
time. These big chunks are also read in sequential manner rather than from random
places in the file
Random write (updating some particular read) is not needed. So you do not have a
scenario to update a record in the file
Updates to the file are about adding/appending more data that too in huge chunks
rather than one record at a time.
Scenarios where a modest number of huge files need to be stored rather than huge
number of modest/small need to be stored.
Clients(of distributed file system) which want to process bulk of data
faster(throughput) rather thansmall amount of data quickly(latency)
HDFS works on master-slave architecture. Master node (generally 1) has a Namenode and
SecondaryNode daemons (or processes) running on it. Rest all the nodes in the HDFS cluster
21. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
are slave nodes and have DataNode process/daemon running on them. Actually blocks of
any data file is saved on slave machines where DataNodes are running. MasterNode only has
metadata of each of the block of these files.
Master Node
Slave Node
Namenode, SecondaryNodeandDataNode
Namenode and SecondaryNode are the processes/daemons that run on the master node of
the cluster. Namenode stores metadata about the files stored on HDFS. It stores information
about each block of each file. It does not read or write blocks of files on DataNodes. It only
tells, during a write operation, the HDFS client about the nodes where blocks of files can be
stored. Similarly during the read operations it only tells the HDFS client about the
DataNodes where the blocks of each file are stored. It is the HDFS client that stores or reads
the blocks of the files by connecting with each DataNode.
The metadata is stored in a file named fsimage on the disk. When the Namenode is started
the metadata is loaded into the memory. After this all the metadata updates (about new files
added, old files updated or deleted) are stored in the memory. This is risky for the obvious
reason that if the Namenode goes down all the updates since the last restart would be lost. So
Namenode stores the updates as well in a local file names edits. This eliminates the risk only
to some extent. If the Namenode goes down and needs to be restarted, it will have to merge
edits file into fsimage file. This will slow down the restart of a Namenode. This risk is further
brought down by adding a SecondaryNode. The SecondaryNode daemon/process merges the
edits file on the primary node with the fsimage on the primary node and replaces the existing
fsimage file with this new merged file.
Challenges or Limitationsof the HDFSarchitecture
Since Namenode stores all the Metadata and if it goes bad all your cluster will be useless, the
Namenode is a single point of failure. Hence the physical machine on which Namenode and
the SecondaryNode daemons are run should be of robust standard and not of the same
specifications as the machines on which DataNodes are run which could be commodity
machines. For the same reason, the NameNode should also be backed up frequently to
ensure the metadata can be restored in case the Namenode cannot be restarted after a
failure.
MasterNode
SecondaryNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
HDFS Client
22. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Also as we know, the Namenode loads all the Metadata from the fsimage file into memory at
the time of start and operates on this data during operations. Metadata of each block of file
takes about 200 bytes. This adds a limitation on the usage of HDFS. Storing a huge file
broken into small blocks works fine on HDFS. But storing too many small files (smaller than
the block size on HDFS) creates Metadata overload which clogs the memory on the
Namenode. This is primarily the reason HDFS is not a suitable distributed storage for
smaller files.
As you would have observed by now, Namenode could create a bottleneck as all the read and
write operations operation on cluster would need to access Namenode for the metadata it
needs. The bottleneck problem was solved in the later versions of Hadoop (Hadoop
2.0/YARN)
Each block of file is saved on the slave nodes running the daemons/processes called
DataNode. DataNodes also send regulars messages (called heartbeat) to the Namenode. This
heartbeat informs Namenode if a specific DataNode is up and running.
Replication of data
Since the DataNodes are run on commodity machines and the chances of these machines
going down is high, each block of file is replicated on 3(default and can be changed) different
DataNodes. The first replica is stored on a node at random. The second replica is stored on a
DataNode which is on a different rack. This ensures against a rack failure. The third replica is
saved on a different machine on the same rack. The chances of multiple racks going down is
less. Hence the third replica is saved on a different node on the same rack as the second
replica without increasing the risk of failure. Saving the third replica on a machine on the
third rack would increase the network IO and make the read and write operations slower as
different copies of the replicas are accessed during the read and write operations. Please
note the number of replica can be configured at HDFS level as well as at each file level.
Increasing the number of replicas makes HDFS operations slower as the IO increases.
Typical read-write operationsin HDFS
When a file needs to be written to HDFS, users/applications interface with HDFS client. The
client starts download the file. Once the download reaches the size of a block, the client
works with Namenode to find out on which DataNode can each block of the file be saved.
Once it get this information it sends the block of the file to first DataNode which starts
writing file on its disk and at the same time starts sending it to the second DataNode where
its replica needs to be saved. The DataNode 2 starts writing it on its disk and start sending it
to disk 3. On completion of write disk 3 confirms to disk 2 which confirms to disk 1 which
eventually confirms to HDFS client and which in turn confirms to Namenode. Once
Namenode gets confirmation it persists the metadata information and makes the file visible
on HDFS. This process is repeated for each block of the file and complete file is saved in this
manner. Checksum on each block is calculated and saved in HDFS to validate the integrity of
each block when the file needs to be read.
Similar process is followed at the time of read. When a file needs to be read, the HDFS client
gets the DataNode information from Namenode for each block and reads it from the
DataNode. Checksum is calculated again and matched with the checksum saved at the time
of write to validate integrity. If the read fails from a DataNode (node is down, or checksum
fails) the block is read from the replicated node
In the above read write operation we assumed a replication factor of 3. This factor can be
configured ad HDFS level or a file level. Even after file has been written to HDFS its
23. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
replication factor can be reduced. HDFS deletes some of the block replicas of the file to bring
down the replication factor of the file if we reduce the replication factor of a file
When a file is deleted by HDFS client, only the metadata information is updated to mark it as
a deleted file. This makes deletion faster. Actual deletion of file happens later.
All the DataNodes send messages, called heartbeat, to Namenode every 3 seconds. If
Namenode does not receive message from a DataNode it assumes it has failed. Since
Namenode maintains the metadata information of file blocks saved on each DataNode and
also on which other nodes they are replicated, it recreates those blocks on other nodes which
are running and updates its metadata information.
MapReduce
MapReduce is an open source framework to execute your Map (and Reduce) programs on a
cluster of machines. MapReduce copies your Map program (provided by you) to each node
on which a block of input file is stored and runs it on that node to process the block of input
data. Once all the nodes in the cluster have run their Map programs the MapReduce copies
the output from all the nodes to a smaller set of nodes where it copies the Reduce
program(again provided by you) and runs the Reduce program on each of these smaller set
of nodes to process and summarize the output from Map step. Though this is a simplified
view of MapReduce, this is what it does. As we progress in this chapter and next we will see
more complex and detailed view of MapReduce.
Just like HDFS, MapReduce also works on a master slave configuration. Master machine has
a daemon, named JobTracker, running on it. All the other machines on the cluster are salve
machines and have a daemon, named TaskTracker, running on them.
JobTrackerand TaskTracker
JobTracker is responsible for coordinating with all TaskTrackers on the slave nodes where
the Map and Reduce programs are run. It checks with the Namenode (of HDFS) where the
blocks of input files are kept. It sends the Map and Reduce programs to those node. It asks
TaskTracker on each of the slave nodes to run the Map and reduce programs. It keeps
receiving heartbeats from the TaskTracker to check if they are fine. If a TaskTracker does not
send the heartbeat the JobTracker assumes it has failed and reschedules the Map/Reduce
program running on that node on another node which has a replica of that data.
Just like Namenode in HDFS, if the JobTracker in MapReduce goes down all the cluster
running the MapReduce becomes useless. So JobTracker must be run on a machine with
specifications better than that of a machine running TaskTracker.
24. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 4 - The backbone II (MapReduce continued)
Sample MapReduce solution
Let’s look at a very simple MapReduce solution. Let’s say you have billions of successful sale
transactions of all the mutual fund products of all the mutual funds companies in USA since
1975. You need to sum the transactions by year of sale. Your input file has a record for each
transaction. Each record has transaction date, the transaction amount and other transaction
details. While this problem can be solved by processing the transactions on a single machine,
the chances of it overwhelming evena high-end machine is very high. Even if it completes
successfully it would take a lot of time. You can solve this problem much more easily by
distributing the transactions over a cluster and processing them in parallel.
You need to write a Map program which will read a transaction and emit Year of sale and
transaction amount to the output file. You need to write a Reduce program which will take
multiple records(for a given year) with Year of sale and transaction amount as input and
generate an output where transactionamount is summed and it emits Year of sale and
summed transaction amount as output. So,
Map Program:
Reduce Program
TransactionID,
TransactionDate,
TransactionAmount,
Mutual FundProduct,
……..
Year of Transaction,
TransactionAmount
Map Program
Year of Transaction,
TransactionAmount1
Year of Transaction,
TransactionAmount2
Year of Transaction,
TransactionAmount3
Year of Transaction,
Sumof Transaction
Amounts
Reduce Program
25. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
For the sake of simplicity, our Reduce program assumes that all the transactions it receives
belong to the same year. It just needs to sum all the transactions and emit the sum along
with the year from any of the transactions.
Once your programs are ready you will provide the following inputs to Hadoop
1. Name of the Map Program
2. Name of the Reduce Program
3. Name of the Partition Program. If we skip this then Hadoop uses default Partition
class available to it. We will learn about it later in the chapter.
4. Name of the Combiner Program. Ifwe skip this then Hadoop skips the Combiner
step. We will learn about it later in the chapter.
5. Jar file(in case your programs are in Java) and the path from where to pick it from
6. Path to your input file with billions of transactions
7. Number of reducers you want to run. We will specify 1 reducer for each year since
1975, so a total of 42 reducers. This will ensure each reducer receives transactions of
only 1 year.
Hadoop will take the input file and split it into multiple blocks and store these on multiple
nodes on the cluster (as described in the Typical read-write operations in HDFS)
JobTracker will then copy your Jar (which has the Map and Reduce programs) to each of the
nodes which has a block of input file (it will get this information from Namenode of HDFS).
Hadoop will then run the following steps to execute your Map and Reduce programs. Please
note in the diagram below which phase runs on Map node and which on Reduce node.
Map
This phase will run your Map program on each node which has a block of your data file (not
on the replicas). The output of this phase will be a file on each node with Year of sale as key
Map
Partition
Combine
Sort
Reduce
Map
Partition
Combine
Map
Partition
Combine
Map
Partition
Combine
Shuffle
Sort
Reduce
26. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
and transaction amount as value. The output file on each node may have records for multiple
years from 1975 to 2016.
Partition
In this phase MapReduce will take the output of Map on each node and partition it into as
many files as there are Reducers (42 in our case). It will do this by partitioning the output file
of each map by key. So the output file of each Map step will be partitioned into 42 files (max)
each of which will have transaction of one year on that node. Partitioning the output file of a
Map by the key is the default Partition behavior. It can be customized to partition by some
other criteria and we will see it in the next chapter. If we do not mention any class of
Partition to Hadoop, it will use the default class available to partition the Map output by the
key in the Map output.
Output file from Map 1
Output file from Map 2
Output file from Map 3
Partition will come into action only if the number of Reducers are going to be > 1. If only 1
Reducer is going to be used, there is no need for portioning as all the records from all the
Maps need to go to only one reducer.
(Key1,value1),
(Key1,value2),
(Key2,value3)
(Key2,value4),
(Key3,value5),
(Key3,value6)
(Key1,value7),
(Key2,value8),
(Key2,value9)
Partition
Partition
Partition
(Key1,
value1),(Key1,value2)
(Key2,value3)
Outputfile forRedcuer1
Outputfile forRedcuer2
(Key2, value4)
(Key3,value5),(Key3,val
ue6),
Outputfile forRedcuer2
Outputfile forRedcuer3
(Key1, value7)
(Key2,value8),(Key2,val
ue9),
Outputfile forRedcuer1
Outputfile forRedcuer2
N
o
d
e
1
N
o
d
e
2
N
o
d
e
3
27. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Partition ensures that records for a specific key go to the same reducer from the all the
nodes. Reducer nodes will connect to the nodes on which Maps are running and collect the
files generated for only for them (based on name).But it does not ensure all the reducers get
equal load. Roughly, it divides the keys equally between reducers. If some key has more
records in the Map output than other keys then the reducer that is assigned that key will take
more time to complete. There are ways to ensure the load is equally divided between the
reducers. We will see how this is done later in this chapter.
The default Partition program does the partitioning by calculating an index for the key of
each record to be written to the output
Index = Hash of Key % Number of reducers to run
Hash is nothing but any function that generates a unique value for a given input. So two keys
will not generate the same output when run through a hash function. Also a given key will
always generate the same output when run through a hash function.
% is nothing but the simple Modulo function from our mathematics. A % B provides the
remainder left when A is divided by B.
Different index values are assigned to different reducers. Based on the Index value calculated
for a key, all the records with that Index are written to the output file for a reducer which has
that index value assigned to it. Different keys may go to a single reducers but a given key will
not go to multiple reducers.
We can very well overwrite all this default behavior of the Partition program by extending
the default class and customizing the Partitioning method. e.g. in our case we can overwrite
the default behavior by partitioning simply by the key(which is the year of transaction)
instead of the default behavior of calculating Index etc.
Combiner
This is an optional step. Please note that there are millions of transactions on each Map
node. Our Map program does not remove any transaction from further processing. So the
output of each map will also have millions of transactions (though each with two fields, year
of sale and transaction amount). So there are billions of records spread across the
partitioned outputs of Mappers across multiple node. Sending these records to 42 reduce
nodes will cause a lot of network IO and slow downthe overall processing. This is where a
Combiner can help.
Since the Reducer is going to sum all the transactions it receives with an assumption that all
the transactions it receives belong to the same year, we canrun the same summation on the
each partitioned output of each mapper. So thousands of records in each partitioned output
of a mapper will be summed into one record. A Combiner will sum the transactions in each
partitioned output of Partition step. It will take all the records in one partition, sum the
transaction values and emit Year of sale as key and sum of transaction amount as value. So
for each partitioned output (which has thousands of records), the Combiner will generate
only one records. This reduces the amount of data that needs to be transmitted over network.
If you delve over the behavior of Combiner it is like running reducer on the Map node before
transmitting the data to Reduce node.
28. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
The above diagram shows how combiner works. It does not show the scale by which it
reduces the number of transactions which need to be transmitted to Reducer nodes. Imagine
when you have thousands of transactions for a key in a file and combiner generates only one
transaction which summarizes all those transactions and needs to be transmitted to the
reducer node then the amount of data to be transmitted reduces significantly.
As I said earlier, Combiner step is optional and we will have to tell Hadoop to run combiner.
Whether combiner can be run in your solution or not is very specific to the problem you are
trying to solve in the MapReduce. If some processing can be done in the Map output locally
to reduce the amount of data before transmitting it to the reducer nodes then you should
think about running Combiners. We also need to write a combiner program, add it to the jar
which we provide to Hadoop. We also inform Hadoop that a combiner needs to be run. This
can be done by providing the combiner class to Hadoop just like how we provide Map and/or
Reduce class to Hadoop.
Shuffle
Meanwhile, MapReduce would have identified 42 nodes that need to run the Reduce
program and assigned a Key (Year of sale) to each of them. The TaskTracker on each of these
(Key1,
value1),(Key1,value2)
(Key2,value3)
Outputfile forRedcuer1
Outputfile forRedcuer2,Key2
(Key2, value4)
(Key3,value5),(Key3,val
ue6),
Outputfile forRedcuer2
Outputfile forRedcuer3
(Key1, value7)
(Key2,value8),(Key2,val
ue9),
Outputfile forRedcuer1
Outputfile forRedcuer2
N
o
d
e
1
N
o
d
e
2
N
o
d
e
3
Combiner
Combiner
Combiner
(Key1,
Sum(value1,value2)
(Key2,value3)
(Key3,
Sum(value5,value6)
(Key2,value4)
(Key1,value7)
(Key2,
Sum(value8,value9)
Combiner
Combiner
Combiner
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer3
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
29. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
nodes will keep on scanning the nodes on which Maps are running and as soon as it finds the
output file generated for its processing (based on name of the file) it will copy the file to its
node. Once a reducer node gets all the files for it processing, MapReduce will go to the next
step.
In the diagram above we assumed that there are only 2 nodes for Reduce phase and
MapReduce assigned Key1 to Reduce1 on Node4 and Key2 to Reduce2 on Node5. We could
have assumed 3 nodes for Reduce phase as well and assigned one key to each node executing
Reduce phase. But keeping only two nodes for Reduce phase and assigning two keys(Key2
and Key3) to the Reducer on Node5 will help you understand the Sort phase better.
Sort
Each reduce node would have received files from multiple Map nodes. So in this step
MapReduce will merge all the files into one and sort by key (Year of transaction in this case)
all the input records to a Reducer.
N
o
d
e
1
N
o
d
e
2
N
o
d
e
3
(Key1,
Sum(value1,value2)
(Key2,value3)
(Key3,
Sum(value5,value6)
(Key2,value4)
(Key1,value7)
(Key2,
Sum(value8,value9)
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer3
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
N
o
d
e
4
N
o
d
e
5
(Key1,
Sum(value1,value2)
(Key1,value7)
(Key2,value3)
(Key2,value4)
(Key2,value4)
(Key3,
Sum(value5,value6)
30. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Please note Sort phase is run by default. Reduce phase must get its data sorted by keys. We
can overwrite the default behavior of sorting by key by extending the default class. We can
overwrite the default class to sort the input to Reduce by Keys as well values( or a part of the
value) if our scenarios expects that.
Reduce
Reducer will sum all the transactions in a file to generate {Year of Sale, Sum of Transaction}
as output.
N
o
d
e
4
N
o
d
e
5
(Key1,
Sum(value1,value2)
(Key1,value7)
(Key2,value3)
(Key2,value4)
(Key2,value4)
(Key3,
Sum(value5,value6)
Sort
Sort
(Key1,
Sum(value1,value2),
(Key1,value7)
(Key2,value3),
(Key2,value4),
(Key2,value4),
(Key3,
Sum(value5,value6)
31. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Please note that if you are sending multiple keys to your Reduce phase then your Reduce
program should be able to handle that. In the diagram above we have assumed this. But in
the example we have been going through in this chapter, we assumed each instance of
Reducer running will get only 1 key.
N
o
d
e
4
N
o
d
e
5
Reduce
Reduce
(Key1,
Sum(Sum(value1,valu
e2),value7)
(Key2,Sum
(value3,value4,value5
))
(Key3, Sum
(value5,value6)
(Key1,
Sum(value1,value2),
(Key1,value7)
(Key2,value3),
(Key2,value4),
(Key2,value5),
(Key3,
Sum(value5,value6)
32. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 5 – A quick view of the ecosystem around Hadoop
By now we have understood the capabilities of Hadoop quite well. HDFS (Hadoop
Distributed File System) offers a distributed storage for huge amount of data using a cluster
of commodity hardware. The distribution of the data on the cluster is transparent to the end
users or the applications interacting with HDFS. The data is also replicated within the cluster
to provide failover in case a machine in the cluster goes kaput.
MapReduce sits on top of HDFS and provides capability to process MapReduce programs on
data stored on the HDFS cluster.
Over a period of time a lot open source products have cropped up which either enhance the
capability of Hadoop further or overcome the limitations of Hadoop framework.
These new products can be categorized in the following four categories:
1. Ingestion – While we have huge storage available in HDFS, transferring huge amount
of data from the sources available with enterprises could be daunting. Products like
Sqoop, Flume and Kafka offer capability to move the data from our enterprise sources
into HDFS and vice versa. While Sqoop is used for importing the data from SQL data
sources within the enterprise, Kafka and Flume are used to import data from Non
SQL data sources (log etc.). Kafka and Flume have some finer differences between
them and we will see those as we move forward.
2. Processing – While MapReduce offers capability to process data stored on the HDFS
cluster, in order to use MapReduce one must know coding. The coding required to
develop MapReduce programs is quite complicated. Many times you need your
business users to be able to process the data stored on HDFS. Even for technology
teams, developing MapReduce programs in Java or any other language could be
inefficient. So frameworks or products were required which could ease the task of
processing data stored on HDFS. Pig and Hive are products which offer ease to
process data stored on HDFS. Hive offers a language HQL, much similar to SQL,
using which we canquery the data in HDFS. Pig offers, an easy to learn and use
language, called Pig Latin using which we can ETL(extract, transform, load) kind of
procedural programs can be developed to process the data on HDFS. Both, HQL
queries and Pig Latin programs, eventually get converted into MapReduce programs
at the back end and get executed. Thus Pig and Hive offer a higher level of abstraction
as compared to the Java program that one has to write if we need to develop a
MapReduce program.
MapReduce
HDFS
33. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
3. Real-time systems – MapReduce is designed for high throughput processing rather
than low-latency processing. It could process huge amount of data but it has some
kick start time. It is not designed to process data quickly and turn around. The initial
start time needed for Hadoop to identify the nodes for Map and Reduce, transfer the
code to these nodes and kick start the processing makes HDFS unsuitable for real-
time process where you need the response to your query/program quickly. Hbase
offers such capability. Hbase basically uses the distributed storage offered by HDFS
to offer key-value datastore services (refer to the Chapter 2 – NoSQL Database to
recall what a key-value store is). So it is a key-value type of NoSQL database using
HDFS for storing the keys and the values.
4. Coordination – There are two products in this category that are used for designing
solutions to big data process using Hadoop. Oozie is a workflow schedule to manage
Hadoop jobs. Zookeeper is used for coordination amongst different products in the
Hadoop eco system.
So keeping these products in mind the ecosystem developed around Hadoop looks like this
The subsequent chapters will each pick up one product each from the eco system and explain
it in detail. Since we have already understood the MapReduce which is for processing of data,
we will take up the processing category (Hive, Pig, Hbase) first. Amongst this category, we
will take up Hive first. Understanding Hive is easy as the programing is done using HQL
which is very similar to SQL which most of us understand well. Next we will take up Pig
MapReduce
HDFS
Hive(SQL like processing
capability)
Pig(ETL like procedural
capability)
Hbase(key-value
store using HDFS)
Sqoop(Ingestdata
from SQL data
sources in the
enterprise)
Flume (Ingest
data from non
SQL data sources
in the enterprise)
Kafka (Ingest
data from non
SQL data sources
inthe enterprise)
Oozie(Hadoop job scheduling) Zookeeper(coordination amongst
products)
34. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
which again is easy to understand as the programming language Pig Latin is very easy. Hbase
is more difficult to understand as compared to Hive and Pig so we will take it up last in this
category.
Next we will take up the Ingestion category of product. We will take up Sqoop first. The
reason again being that this product is related to SQL world to which we all can relate to.
Next we will move to Flume as it originated before Kafka. Once we understand Flume we can
identify limitations and see how Kafka overcomes those.
At last, we will move to Oozie and Zookeeper as understanding other products in detail will
help us appreciate these two product better.
35. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 6 – Hive
Why Hive?
If we look back the example of transaction processing we took in chapter four, we are
essentially doing the following to the transactions,
1. Select certain fields from each transaction.
2. Group the transactions by year(by sending them to different Reducers)
3. Sum the transaction amount for each group
If you are even remotely aware of SQLworld the equivalent in the SQL is something like
Select Transaction.Year, SUM (Transaction.Amount)
From Transaction
Group By Transaction.Year
In case we wanted to filter out some transactions from processing we could have added the
filter in the Map (just an If condition). Let’s assume we want only those transactions to be
processed which have ’Purchase’ in a field named ‘type’. In the Map program that you
develop for the processing you would add an IF condition to process only those transactions
which have the value ‘Purchase’ in the field named ‘type.’ The SQL equivalent SQL would be
Select Transaction.Year, SUM (Transaction.Amount)
From Transaction
Where Transaction.type=’Purchase’
Group By Transaction.Year
Let’s also consider a scenario where the transaction has another field named “ProductCode”
which has a numeric code for the financial product on which transaction was done. We also
have a file which has a mapping between the “ProductCode” and “ProductName”. If we need
the field “ProductName” in the final output from the Reducer and also want to sum the
transactions on Year and ProductName instead of only Year of transaction, the Map Reduce
processing would be modified as below
Map:
1. Select transaction with ‘Purchase’ code in the ‘type’ field of transaction for further
processing in Map
2. Output year, product code and amount for each transaction with ‘Purchase’ in the
transaction type field.
Partition:
1. Partition transactions by year so that transactions for each year go to a different
Reducer.
Combiner:
1. Sum the transactions on each Partition by Year and ProductCode.
Shuffle:
36. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Each Reducer picks its files from the Map nodes.
Sort:
1. Sort the transactions by Year and Product Code
Reducer:
1. Load the file which has ProductCode-ProductName mapping into the memory.
2. Sum the input transactions by Year and Product Code. This time this step will sum
the transactions coming from different Map nodes( in the Combiner the same
processing did the sum only for the transactions on each node).
3. Just before writing a Sum read the ProductCode-ProductName mapping from
memory (loaded earlier in the Reducer) to resolve ProductCode in the output record
to ProductName.
4. Write the Sum of transactions by year and product name to the output.
The SQL equivalent of the above processing would be
Select Transaction.Year, Product.ProductName, SUM (Transaction.Amount)
From Transaction,Product
Where Transaction.type=’Purchase
Transaction.ProductCode=Product.ProductCode
Group By Transaction.Year,Product.Name
By now you would have noticed that it takes only a few lines of SQL code to do the processing
that we are trying to in MapReduce. When it comes to writing Java programs for MapReduce
1. The number of lines of codes is large
2. There are many libraries that need to be imported.
3. You need to be aware of which out of the box class file to extend for our specific
requirement.
4. There are variables to be defined, set and reset. And all the other complications
involved in any programing.
5. There are steps for building the jar.
When you have so much of raw of data residing on the HDFS is there no easier way to
process the data? Is there no way a business person or a person with limited technology skill
set can process and analyze the data? Is there a tool/framework which can
1. take the queries in the form similar to SQL written above,
2. do the laborious work of developing Map, Reduce, Partition and combiner classes,
3. schedule as many Maps and Reducers as needed and
4. produce the end result for the user.
That is what Hive does. It does all the 4 points written above and much more. Welcome to
the world of Hive! Hive is a tool operating at a higher level than Hadoop. It takes away the
37. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
difficult task of writing MapReduce programs. It develops those programs based on the
instructions given to it in the form of HQL (Hive Query Language) which is similar to SQL.
Thus it brings the power of Hadoop within the reach of people who are not programmer but
know what logic needs to be applied to the data to analyze and process it. HQL, like SQL, is a
much easier to pick up than Java or any other programming language. If one already know
SQL then the learning curve is much steeper.
Please note since Hive is only a layer above Hadoop it inherits the limitations of Hadoop
1. Hive does not support row level updates, inserts and deletes
Hive architecture
The following diagram shows the architecture of Hive
Hive sits on top of Hadoop, thus taking away all the complications of writing Map and
Reduce programs to process data. There are three ways to access Hive:
CLI: This is a typical command line interface where a user can write a few queries to load,
read and process data.
HWI: Hive Web Interface is an interface on the web serving the same objective as CLI
Thriftserver: It exposes Hive functionality to other applications that access Hive via JDBC or
ODBC drivers.
MapReduce HDFS
Hadoop
Driver Metastore
Command line Interface Web Interface Thriftserver
JDBC ODBC
Hive
38. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Metastore: While all the data accessed by Hive is saved on HDFS, the data about databases
and tables is stored in Metastore. We will learn about what kind of data is stored in the
Metastore in the subsequent sections.
Creating tables and types of tables
Hive is a layer sitting above Hadoop. It only extends the functionality of Hadoop by letting
user provide inputs in the form of Hive Query language (HQL) rather than low level
programs. In this and the subsequent sections we will take a few example of HQL to
understand what it does for the user and what it does in Hadoop at the back-end. While we
understand this, we will avoid trying to understand each and every option or variation
possible with an HQL command. The essence of this is to explain the core functionality of a
product without really getting into a lot of code.
Our logical flow would be to first understand the HQLs for defining databases and table.
Then we will move on to understanding HQLs for loading data into these tables. Finally, we
will understand the HQLs for processing the data. All along, we will also understand what
these HQLs eventually do on Hadoop.
Let’s assume that in addition to the transaction file (with all the details of mutual fund
transactions) we also have another file which provides the mapping between the mutual fund
product ID ( the financial product on which the transaction was executed) and mutual fund
name(the name of the product).
In order to load and process the data available in these two files we will firstly create the
database and tables to store the data in these files.
Create database if not exists transproc
The above command will create a directory in the HDFS with name transproc. A database is
just an umbrella folder to contain and organize all the tables. A registry will also be made in
the Metastore table about this new database.
Once your database is created you can very well create tables within the database with
command very similar to the one we used for creating the database
Create table if not exists transproc.transactions(transid STRING, ,transamount
FLOAT,…..)
Create table if not exists transproc.prodinfo (prodid string, prodname string)
The above command would create two subdirectories within the transproc directory and also
make a registry with the Metastore for the two new tables created.
Internaland External tables
Hive will notcreate the
database if it alreadyexisits
Name of the database
Name of the database
withinwhichtable needsto
be created
Name of the table
Layout of the table
39. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
There are two types of tables in Hive.
If Hive keeps the ownership of data stored in a table then the table is called internal or
managed table. In case of an internal/managed table, when a table is dropped the data from
HDFS as well as the reference to the table in the Metastore is deleted.
If Hive does not keep the ownership of the data stored in a table then the table is called an
external table. In case of an external table, when a table is dropped only the reference to the
table from the Metastore is deleted but the data from the HDFS is not deleted. So the table
stops existing for Hive but the data is still retained in Hadoop.
External tables are used to exchange data between multiple applications. E.g. in our case of
mutual fund transaction processing it may be the case that the product data (product id to
product name mapping) is not owned by the department which has the responsibility of
processing the transactions (typical scenario). Insuch a case, the product department would
make the product information available in a flat file in some HDFS location. The transaction
processing application would define an external table on top of this data. When the
transactions processing is done it could delete the external table. But that would not delete
the product data in the flat file. That data might be referenced by other applications as well.
If we do not mention in our HQL command if the table is internal or external, Hive would
assume it to be internal.
The command to create an external table is
Create external table if not exists transproc.prodinfo (prodid string, prodname string) row
format delimited fields terminated by ‘,’ location ‘location of the external file’
Internalpartitioned tables
Let’s look back at the query that creates the transaction table,
Create table if not exists transproc.transactions(transid STRING, ,transamount
FLOAT,ProductID STRING, SubproductID STRING,… )
Assume that SubproductID indicates a variety of a Product. So a Product can have different
varieties and each can be indicated by the sub product ID.
Now let’s assume that we know the access pattern for this table. By access pattern I mean we
know that when the data is accessed it will be mostly accessed for a specific Product ID
and/or Sub Product ID. Let’s say we also know that the data would generally not be accessed
for many or all the products ID at the same time.
Now the above HQL command for creating a table would create one single directory for the
table. All the data for the table would be in one directory. Every time the table is accessed
Hive (and HDFS in the back-end) would have to find the data for that particular product
and/or subproduct id to fetch it. The directory structure created in HDFS by the above
command would be
../transproc/transactions
Providesthe locationof
the external fileHive
Informshive toexpect
the fieldsinthe external
file separatedby‘,’
40. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Instead, if we know that typically data would be accessed using product ID and/or
subproduct ID we can segregate the data within the directory into separate subdirectories
for product ID and subproduct ID. This is called partitioning.
The command to partition the table is:
Create table if not exists transproc.transactions(transid STRING, ,transamount FLOAT,…)
Partitioned by (ProductID STRING, SubproductID STRING)
The above command will create a subdirectory like
../transproc/transactions
As and when data is added to this table separate sub-directories will be created within the
transaction directory in HDFS for each ProductID and SuproductID combination.
Load data local inpath ‘path from where data needs to be picked up’
Into table transactions
Partition (ProductID=’A’, SubproductID=’1’)
The above command will create a subdirectory like
../transproc/transactions/ProductID=’A’/SubproductID=’1’
Anytime data is loaded into this table the command to load the data would have to specify
partition information and the data will be loaded into the directory structure for that
partition.
Please also note that the table schema does not have the columns which are a part of
partition now. There is no need save ProductID and SubproductID in the table itself as this
information can be derived from the path of the partition.
If data has to be read for a specific Product ID and Subproduct ID combination the HQL
command would be
Select transproc.transactions where ProductID=’A’ and SubproductID=’1’
This command will make Hive read only the specific subdirectory we created earlier.
Partitioning improves the performance of Hive as it has to read a specific subdirectory to
fetch the data
If the command above is modified like the one given below Hive will read all the
subdirectories with the subdirectory ../transproc/transactions/ProductID=’A’
Select transproc.transactions where ProductID=’A’
If the typical access pattern is not to access the data for specific Product ID and Subproduct
ID combination then it is not a good idea to create partitions. If you create partitions by
Product ID and Subproduct ID but end up writing queries that read data across multiple
41. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Product ID and Subproduct ID Hive will have to scan multiple subdirectories and it will
impact the performance of Hive.
External partitionedtables
Just like the internal tables the external tables can be partitioned. Since the data is not
managed by Hive, it assumes that the data at the external location is segregated as per the
partition keys
Create external table if not exists transproc.prodinfo (subprodid string, subprodname
string) partitioned by (prodid string) row format delimited fields terminated by ‘,’
Please note we do not declare the location of the data for a partitioned external table as we
would in case of non-partitioned external table. That needs to be done separately using alter
command for each partition separately.
Loading Data
You can load data into tables from a file. If we need to load data into our transaction table
the HQL command would be
Load data local inpath “path of the file here”
Overwrite into table transproc.transactions
Partition (ProductID=’A’ and SubproductID=’1’)
Please note the table should have been defined with partitions on Product ID and
Subproduct ID. Overwrite clause will overwrite (as is obvious from the name) the existing
data present in the table. Hive will create a directory in HDFS for this ProductID and
SubproductID combination if it is not already existing. If the table is not partitioned you can
skip the partition clause.
You can evenread data from one table and insert it into another table. E.g. if we assume the
transaction records were present in another table where they were loaded for initial clean up
by business, we can write query like the one below to load the data into our transaction table
From PreProdTransactions
Insert Overwrite table transactions
Partition (ProductID=’A’ and SubproductID=’1’)
Select * from PreProdTransactions. ProductID=’A’ and PreProdTransactions
.SubproductID=’1’
Insert Overwrite table transactions
Partition (ProductID=’A’ and SubproductID=’1’)
Select * from PreProdTransactions. ProductID=’A’ and PreProdTransactions
.SubproductID=’2’
Insert Overwrite table transactions
Partition (ProductID=’B’ and SubproductID=’1’)
42. Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Select * from PreProdTransactions. ProductID=’B’and PreProdTransactions
.SubproductID=’1’
Insert Overwrite table transactions
Partition (ProductID=’B’ and SubproductID=’2’)
Select * from PreProdTransactions. ProductID=’B’and PreProdTransactions
.SubproductID=’2’
The above query will scan the PreProdTransactions table once and then create the Partitions
for the Transactions table based on the Partitions clause.
A more concise way of writing the above query is
From PreProdTransactions
Insert Overwrite table transactions
Partition (ProductID, SubproductID)
Select …, PreProdTransactions. ProductID, PreProdTransactions. SubproductID
from PreProdTransactions
In this case Hive itself will analyze the data present in the PreProdTransactions table and
create as many partitions in the transactions table as many unique combinations of
ProductID and SubproductID it finds in the PreProdTransactions table.
Reading Data from Hive