Akash Decap456 Introduction to Big Data
Akash Decap456 Introduction to Big Data
B.TECH [CSE]
SEMESTER-4
Title: INTRODUCTION TO BIG DATA
Unit 1: Introduction to Big Data 1
Objectives
After studying this unit, you will be able to:
• understand what is BIG DATA.
• understand Applications of BIG DATA
• learn tools used in BIG DATA
• known challenges in BIG DATA
Introduction
The quantity of data created by humans is quickly increasing every year as a result of the
introduction of new technology, gadgets, and communication channels such as social networking
sites. Big data is a group of enormous datasets that can't be handled with typical computer
methods. It is no longer a single technique or tool; rather, it has evolved into a comprehensive
subject including a variety of tools, techniques, and frameworks. Quantities, letters, or symbols on
which a computer performs operations and which can be stored and communicated as electrical
signals and recorded on magnetic, optical, or mechanical media.
1
Notes
Introduction t
o Big Data
Big data is a term that defines the massive amount of organized and unstructured data that a
company encounters on a daily basis.
Note
• It may be studied for insights that lead to improved business choices and strategic
movements.
• It is a collection of organized, semi-structured, and unstructured data that may be mined
for information and utilized in machine learning, predictive modelling, and other
advanced analytics initiatives.
2
Notes
Volume
The term 'Big Data' refers to a massive amount of information. The term "volume" refers to a large
amount of data. The magnitude of data plays a critical role in determining its worth. When the
amount of data is extremely vast, it is referred to as 'Big Data.'
This means that the volume of data determines whether or not a set of data may be classified as Big
Data. As a result, while dealing with Big Data, it is vital to consider a certain 'Volume.'
Example:
In 2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion GB) per month. Furthermore, by
2020, we will have about 40000 ExaBytes of data.
Velocity
The term "velocity" refers to the rapid collection of data. Data comes in at a high rate from
machines, networks, social media, mobile phones, and other sources in Big Data velocity. A large
and constant influx of data exists. This influences the data's potential, or how quickly data is
created and processed in order to satisfy needs. Data sampling can assist in dealing with issues
such as'velocity.' For instance, Google receives more than 3.5 billion queries every day. In addition,
the number of Facebook users is growing at a rate of around 22% every year.
Variety
Structured data is just data that has been arranged. It usually refers to data that has been specified in
terms of length and format.
Semi-structured data is a type of data that is semi-organized. It's a type of data that doesn't follow the
traditional data structure. This sort of data is represented by log files.
Unstructured data is just data that has not been arranged. It usually refers to data that doesn't fit
cleanly into a relational database's standard row and column structure.Texts, pictures, videos etc.
are the examples of unstructured data which can’t be stored in the form of rows and columns.
3
Notes
Introduction t
o Big Data
3. Improved customer service (Traditional customer feedback systems are getting replaced by
new systems designed with Big Data technologies.
4. Improved customer service (In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer responses.
5. Early identification of risk to the product/services, if any
6. Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights
All company activities are shaped by big data analytics. It allows businesses to meet client
expectations. Big data analytics aids in the modification of a company's product range. It
guarantees that marketing initiatives are effective.
4
Notes
1. Monitoring
and tracking
applications
Three major
types of
Business
applications
2. Analysis
and Insight
Applications
3. New
Product
Development
5
Notes
Introduction t
o Big Data
• Asset Tracking
The US department of defence is encouraging the industry to devise a tiny RFID chip that
could prevent the counterfeiting of electronic parts that end up in avionics or circuit board
for other devices. Airplanes are one of the heaviest users of sensors which track every
aspect of the performance of every part of the plane. The data can be displayed on the
dashboard as well as stored for later detailed analysis. Working with communicating
devices, these sensors can produce a torrent of data.Theft by shoppers and employees is a
major source of loss of revenue for retailers. All valuable items in the store can be assigned
RFID tags, and the gates of the store can be equipped with RF readers. This can help
secure the products, and reduce leakage(theft) from the store.
• Supply chain monitoring
All containers on ships communicate their status and location using RFID tags. Thus
retailers and their suppliers can gain real-time visibility to the inventory throughout the
global supply chain. Retailers can know exactly where the items are in the warehouse, and
so can bring them into the store at the right time. This is particularly relevant for seasonal
items that must be sold on time, or else they will be sold at a discount.With item-level
RFID tacks, retailers also gain full visibility of each item and can serve their customers
better.
6
Notes
7
Notes
Introduction t
mechanism to obtain small
campaign contributions from millions of supporters. They o Big Data
created personal profiles of millions of supporters and what they had done and could do
for the campaign. Data was used to determine undecided voters who could be converted
to their side. They provided phone numbers of these undecided voters to the volunteers.
The results of the calls were recorded in real time using interactive web applications.
Obama himself used his twitter account to communicate his message directly with his
millions of followers.After the elections, Obama converted his list of tens of millions of
supporters to an advocacy machine that would provide the grassroots support for the
president initiatives. Since then, almost all campaigns use big data.
Senator Bernie sanders used the same big data playbook to build an effective national
political machine powered entirely by small donors. Election analyst, Nate silver, created
sophistical predictive models using inputs from many political polls and surveys to win
pundits to successfully predict winner of the US elections. Nate was however,
unsuccessful in predicting Donald trump’s rise and ultimate victory and that shows the
limits of big data.
Personal health
Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson
system is a big data analytics engine that ingests and digests all the medical information in
the world, and then applies it intelligently to an individual situation.Watson can provide a
detailed and accurate medical diagnosis using current symptoms, patient history, medical
history and environmental trends, and other parameters. Similar products might be
offered as an APP to licensed doctors, and even individuals, to improve productivity and
accuracy in health care.
8
Notes
Recommendation service
Ecommerce has been a fast-growing industry in the last couple of decades. A variety of
products are sold and shared over the internet. Web users browsing and purchase history
on ecommerce sites is utilized to learn about their preference and needs, and to advertise
relevant product and pricing offers in real-time. Amazon uses a personalized
recommendation engine system to suggest new additional products to consumers based
on affinities of various products.
Netflix also use a recommendation engine to suggest entertainment options to its users.Big
data is valuable across all industries.
These are three major types of data sources of big data. Example (people to people communication,
people-machine communications, Machine-machine communications.)Each type has many sources
of data. There are three types of applications. They are the monitoring type, the analysis type and
new product development.They have an impact on efficiency, effectiveness and even disruption of
industries.
Apache Hadoop
A large data framework is the Apache Hadoop software library. It enables massive data sets to be
processed across clusters of computers in a distributed manner. It's one of the most powerful big
data technologies, with the ability to grow from a single server to thousands of computers.
Features
• When utilising an HTTP proxy server, authentication is improved.
9
Notes
Introduction t
• Hadoop Compatible Filesystem effort
specification. Extended characteristics for POSIXstyle filesystems are supported.
10
Notes
It has big data technologies and tools that offers robust ecosystem that is well suited to
meet the analytical needs of developer.
• It brings Flexibility in Data Processing. It allows for faster data Processing
HPCC
HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a single platform, a
single architecture and a single programming language for data processing. Features
• It is one of the Highly efficient big data tools that accomplish big data tasks with far less
code.
• It is one of the big data processing tools which offers high redundancy and availability.
• It can be used both for complex data processing on a Thor cluster. Graphical IDE for
simplifies development, testing and debugging. It automatically optimizes code for
parallel processing
• Provide enhance scalability and performance. ECL code compiles into optimized C++, and
it can also extend using C++ libraries
Apache STORM
Storm is a free big data open source computation system. It is one of the best big data tools which
offers distributed real-time, fault-tolerant processing system. With real-time computation
capabilities. Features
• It is one of the best tool from big data tools list which is benchmarked as processing one
million 100 byte messages per second per node
• It has big data technologies and tools that uses parallel calculations that run across a
cluster of machines.
• It will automatically restart in case a node die. The worker will be restarted on another
node. Storm guarantees that each unit of data will be processed at least once or exactly
once
• Once deployed Storm is surely easiest tool for Bigdata analysis
Qubole
Qubole Data is Autonomous Big data management platform. It is a big data open-source tool
which is self-managed, self-optimizing and allows the data team to focus on business outcomes.
Features
• Features:
• Single Platform for every use case
• It is an Open-source big data software having Engines, optimized for the Cloud.
• Comprehensive Security, Governance, and Compliance
• Provides actionable Alerts, Insights, and Recommendations to optimize reliability,
performance, and costs.
• Automatically enacts policies to avoid performing repetitive manual actions
Apache Cassandra
The Apache Cassandra database is widely used today to provide an effective management of large
amounts of data.
11
Introduction to Big Data
•
Features
• Support for replicating across multiple data centers by providing lower latency for users
• Data is automatically replicated to multiple nodes for fault-tolerance
• It one of the best big data tools which is most suitable for applications that can't afford to
lose data, even when an entire data center is down
• Cassandra offers support contracts and services are available from third parties
•
Statwing
Statwing is an easy-to-use statistical tool. It was built by and for big data analysts. Its modern
interface chooses statistical tests automatically.
Features
• It is a big data software that can explore any data in seconds. Statwing helps to clean data,
explore relationships, and create charts in minutes
• It allows creating histograms, scatterplots, heatmaps, and bar charts that export to Excel or
PowerPoint. It also translates results into plain English, so analysts unfamiliar with
statistical analysis
CouchDB
CouchDB stores data in JSON documents that can be accessed web or query using JavaScript. It
offers distributed scaling with fault-tolerant storage. It allows accessing data by defining the Couch
Replication Protocol.
Features
• CouchDB is a single-node database that works like any other database
• It is one of the big data processing tools that allows running a single logical database
server on any number of servers.
• It makes use of the ubiquitous HTTP protocol and JSON data format. Easy replication of a
database across multiple server instances. Easy interface for document insertion, updates,
retrieval and deletion
• JSON-based document format can be translatable across different languages
Pentaho
Pentaho provides big data tools to extract, prepare and blend data. It offers visualizations and
analytics that change the way to run any business. This Big data tool allows turning big data into
big insights.
Features:
• Data access and integration for effective data visualization. It is a big data software that
empowers users to architect big data at the source and stream them for accurate analytics.
Seamlessly switch or combine data processing with in-cluster execution to get maximum
processing. Allow checking data with easy access to analytics, including charts,
visualizations, and reporting
• Supports wide spectrum of big data sources by offering unique capabilities
12
Notes
Apache Flink
Apache Flink is one of the best open source data analytics tools for stream processing big data. It is
distributed, high-performing, always-available, and accurate data streaming applications.
Features:
• Provides results that are accurate, even for out-of-order or late-arriving data
• It is stateful and fault-tolerant and can recover from failures.
• It is a big data analytics software which can perform at a large scale, running on
thousands of nodes
• Has good throughput and latency characteristics
• This big data tool supports stream processing and windowing with event time semantics.
It supports flexible windowing based on time, count, or sessions to data-driven windows
• It supports a wide range of connectors to third-party systems for data sources and sinks
Cloudera
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows anyone to get
any data across any environment within single, scalable platform. Features:
• High-performance big data analytics software
• It offers provision for multi-cloud
• Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google Cloud
Platform. Spin up and terminate clusters, and only pay for what is needed when need it
• Developing and training data models
• Reporting, exploring, and self-servicing business intelligence
• Delivering real-time insights for monitoring and detection
• Conducting accurate model scoring and serving
Open Refine
OpenRefine is a powerful big data tool. It is a big data analytics software that helps to work with
messy data, cleaning it and transforming it from one format into another. It also allows extending it
with web services and external data. Features:
• OpenRefine tool help you explore large data sets with ease. It can be used to link and
extend your dataset with various webservices. Import data in various formats.
• Explore datasets in a matter of seconds
• Apply basic and advanced cell transformations
• Allows to deal with cells that contain multiple values
• Create instantaneous links between datasets. Use named-entity extraction on text fields to
automatically identify topics. Perform advanced data operations with the help of Refine
Expression Language
RapidMiner
RapidMiner is one of the best open-source data analytics tools. It is used for data prep, machine
learning, and model deployment. It offers a suite of products to build new data mining processes
and setup predictive analysis.
13
Introduction to Big Data
•
Features
• Allow multiple data management methods
• GUI or batch processing
• Integrates with in-house databases
• Interactive, shareable dashboards
• Big Data predictive analytics
• Remote analysis processing
• Data filtering, merging, joining and aggregating
• Build, train and validate predictive models
• Store streaming data to numerous databases
• Reports and triggered notifications
Data cleaner
Data Cleaner is a data quality analysis application and a solution platform. It has strong data
profiling engine. It is extensible and thereby adds data cleansing, transformations, matching, and
merging.
Feature:
• Interactive and explorative data profiling
• Fuzzy duplicate record detection.
• Data transformation and standardization
• Data validation and reporting
• Use of reference data to cleanse data
• Master the data ingestion pipeline in Hadoop data lake. Ensure that rules about the data
are correct before user spends thier time on the processing. Find the outliers and other
devilish details to either exclude or fix the incorrect data
Kaggle
Kaggle is the world's largest big data community. It helps organizations and researchers to post
their data & statistics. It is the best place to analyze data seamlessly. Features:
• The best place to discover and seamlessly analyze open data
• Search box to find open datasets.
Contribute to the open data movement and connect with other data enthusiasts
Apache Hive
Hive is an open-source big data software tool. It allows programmers analyze large data sets on
Hadoop. It helps with querying and managing large datasets real fast.
Features:
• It Supports SQL like query language for interaction and Data modeling
14
Notes
Solution
In order to handle these large data sets, companies are opting for modern techniques, such as
compression, tiering, and deduplication. Compression is used for reducing the number of bits
in the data, thus reducing its overall size. Deduplication is the process of removing duplicate
and unwanted data from a data set. Data tiering allows companies to store data in different
storage tiers. It ensures that the data is residing in the most appropriate storage space. Data
tiers can be public cloud, private cloud, and flash storage, depending on the data size and
importance. Companies are also opting for Big Data tools, such as Hadoop, NoSQL and other
technologies. This leads us to the third Big Data problem.
Solution
The best way to go about it is to seek professional help. You can either hire experienced
professionals who know much more about these tools. Another way is to go for Big Data
consulting. Here, consultants will give a recommendation of the best tools, based on your
company’s scenario. Based on their advice, you can work out a strategy and then select the
best tool for you.
15
Introduction to Big Data
•
engineers who are experienced in working with the tools and making sense out of huge
data sets.Companies face a problem of lack of Big Data professionals. This is because data
handling tools have evolved rapidly, but in most cases, the professionals have not.
Actionable steps need to be taken in order to bridge this gap.
Solution
Companies are investing more money in the recruitment of skilled professionals. They also
have to offer training programs to the existing staff to get the most out of them.Another
important step taken by organizations is the purchase of data analytics solutions that are
powered by artificial intelligence/machine learning. These tools can be run by professionals
who are not data science experts but have basic knowledge. This step helps companies to save
a lot of money for recruitment.
Securing data
• Securing these huge sets of data is one of the daunting challenges of Big Data. Often
companies are so busy in understanding, storing and analyzing their data sets that they
push data security for later stages. But, this is not a smart move as unprotected data
repositories can become breeding grounds for malicious hackers.
• Companies can lose up to $3.7 million for a stolen record or a data breach.
• Solution
• Companies are recruiting more cybersecurity professionals to protect their data. Other
steps taken for securing data include:
• Data encryption
• Data segregation
• Identity and access control
• Implementation of endpoint security
• Real-time security monitoring
Solution
Companies have to solve their data integration problems by purchasing the right tools. Some
of the best data integration tools are mentioned below:
16
Notes
• IBM InfoSphere
• Xplenty
• Informatica PowerCenter
• CloverDX
• Microsoft SQL
• QlikView
• Oracle Data Service Integrator
In order to put Big Data to the best use, companies have to start doing things differently. This means hiring better
staff, changing the management, reviewing existing business policies and the technologies being used. To enhance
decision making, they can hire a Chief Data Officer – a step that is taken by many of the fortune 500 companies.
Summary
• Big data refers to massive, difficult-to-manage data quantities – both organised and unstructured – that inundate
enterprises on a daily basis. Big data may be evaluated for insights that help people make better judgments and feel
more confident about making key business decisions.
• These are the most basic and basic Big Data applications. They assist in enhancing company efficiency in almost
every industry.
• These are the big data apps of the future. They have the potential to alter businesses and boost corporate
effectiveness. Big data may be organised and analysed to uncover patterns and insights that can be used to boost
corporate performance.
• These are brand-new concepts that didn't exist before. These applications have the potential to disrupt whole
industries and generate new income streams for businesses.
• Apache Hadoop is a set of open-source software tools for solving issues involving large volumes of data and
processing utilising a network of many computers. It uses the MapReduce programming concept to create a
software framework for distributed storage and processing of massive data.
• Apache Cassandra is a distributed, wide-column store, NoSQL database management system that is designed to
handle massive volumes of data across many commodity servers while maintaining high availability and avoiding
single points of failure.
• Cloudera, Inc. is a Santa Clara, California-based start-up that offers a subscription-based enterprise data cloud.
Cloudera's platform, which is based on open-source technology, leverages analytics and machine learning to extract
insights from data through a secure connection.
• RapidMiner is a data science software platform built by the same-named firm that offers a unified environment for
data preparation, machine learning, deep learning, text mining, and predictive analytics.
• Kaggle, a Google LLC subsidiary, is an online community of data scientists and machine learning experts.
• LexisNexis Risk Solutions created HPCC, often known as DAS, an open source data-intensive computing system
platform. The HPCC platform is based on a software architecture that runs on commodity computing clusters and
provides high-performance, data-parallel processing for big data applications.
Keywords
Big Data: Big data refers to massive, difficult-to-manage data quantities – both organised and unstructured – that
inundate enterprises on a daily basis. But it's not simply the type or quantity of data that matters; it's also what businesses
do with it. Big data may be evaluated for insights that help people make better judgments and feel more confident about
making key business decisions.
Volume: Transactions, smart (IoT) devices, industrial equipment, videos, photos, audio, social media, and other sources
are all used to collect data. Previously, keeping all of that data would have been too expensive; now, cheaper storage
options such as data lakes, Hadoop, and the cloud have alleviated the strain.
17
Notes
Velocity:Data floods into organisations at an unprecedented rate as the Internet of Things grows, and it must be handled
quickly. The need to cope with these floods of data in near-real time is being driven by RFID tags, sensors, and smart
metres.
Variety: From organised, quantitative data in traditional databases to unstructured text documents, emails, movies,
audios, stock ticker data, and financial transactions, data comes in a variety of formats.
Variability: Data flows are unpredictable, changing often and altering substantially, in addition to rising velocities and
variety of data. It's difficult, but companies must recognise when something is hot on social media and how to manage
high data loads on a daily, seasonal, and event-triggered basis.
Veracity:The quality of data is referred to as veracity. Information's tough to link, match, cleanse, and convert data
across systems since it originates from so many diverse places. Relationships, hierarchies, and numerous data links must
all be connected and correlated by businesses. If they don't, their data will rapidly become out of hand.
Self Assessment
Q1: What are the fundamental elements of BIG DATA?
A. HDFS
B. YARN
C. MapReduce
D. All of these
Q2: What distinguishes BIG DATA Analytics from other types of analytics?
A. Open-Source
B. Scalability
C. Data Recovery
D. All of these
A. Volume
B. Veracity
C. Both a and b
D. Vivid
A. Hadoop is an excellent platform for extracting and analyzing tiny amounts of data.
B. Hadoop uses HDFS to store data and enables data compression and decompression.
C. To solve graph and machine learning problems, the giraph framework is less useful than a MapReduce framework.
D. None of the mentioned
A. Bare metal
B. Cross-Platform
C. Unix-Like
D. None of the mentioned
Q6: The Hadoop list includes the HBase database, the Apache Mahout ___________ System, and matrix operations.
18
Notes
A. Pattern recognition
B. HPCC
C. Machine Learning
D. SPSS
Q7: The element of MapReduce is in charge of processing one or more data chunks and providing output results.
A. MapTask
B. Mapper
C. Task execution
D. All of the mentioned
Q8: Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in
____________
A. Java
B. C
C. C#
D. None of the mentioned
Q9: Input key/value pairs are mapped to a collection of intermediate key/value pairs using _____.
A. Mapper
B. Reducer
C. Both
D. None of the mentioned
Q10: The number of maps is usually driven by the total size of ____________
A. inputs
B. outputs
C. tasks
D. None of the mentioned
Q11: The _____________ software library is a big data framework. It allows distributed processing of large data sets
across clusters of computers.
A. Apple Programming
B. R Programming
C. Apache Hadoop
D. All of above
Q12: Which big data tool was developed by LexisNexis Risk Solution?
A. SPCC System
B. HPCC System
C. TOCC System
D. None of above
Q13: Which big data tools offers distributed real-time, fault-tolerant processing system with real-time computation
capabilities.
19
Notes
A. Storm
B. HPCC
C. Qubole
D. Cassandra
Q15: ______ stores data in JSON documents that can be accessed web or query using JavaScript
A. CouchDB
B. Storm
C. Hive
D. None of above
6. C 7. C 8. A 9. A 10. A
Review Questions
1. Explain five effective characteristics of BIG DATA.
2. Write down applications of BIG DATA.
20
Notes
Introduction t ig Data
oB
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We Live,
Work, and Think. Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
or Big Data
21
Notes
ig Data
Further Readings
Objectives
• differentiate between file system (FS) and distributed file system (DFS)
• understand scalable computing over the internet.
• understand programming models for Big Data.
Introduction
The first storage mechanism used by computers to store data was punch cards. Each group of related punch
cards (Punch cards related to same program) used to be stored into a file; and files were stored in file cabinets.
This is very similar to what we do nowadays to archive papers in government intuitions who still use paper
work on daily basis. This is where the word “File System” (FS) comes from. The computer systems evolved; but
the concept remains the same.
file extension indicates the type of information stored in that file. for example; EXE extension refers to executable
files, TXT refers to text files…etc.File management system is used by the operating system to access the files and
folders stored in a computer or any external storage devices.
22
Notes
• Access transparency: Both local and remote files should be accessible in the same manner. The file system
should be automatically located on the accessed file and send it to the client’s side.
or Big Data
• Naming transparency: There should not be any hint in the name of the file to the location of the file. Once a
name is given to the file, it should not be changed during transferring from one node to another.
• Replication transparency: If a file is copied on multiple nodes, both the copies of the file and their locations
should be hidden from one node to another.
User mobility: It will automatically bring the user’s home directory to the node where the user logs in.
Performance: Performance is based on the average amount of time needed to convince the client requests. This time
covers the CPU time + time taken to access secondary storage + network access time. It is advisable that the performance
of the Distributed File System be similar to that of a centralized file system.
Simplicity and ease of use: The user interface of a file system should be simple and the number of commands in the
file should be small.
High availability: A Distributed File System should be able to continue in case of any partial failures like a link
failure, a node failure, or a storage drive crash. A high authentic and adaptable distributed file system should have
different and independent file servers for controlling different and independent storage devices.
23
Notes
ig Data
Data replication is a good way to achieve fault tolerance and high concurrency; but it’s very hard to maintain
frequent changes. Assume that someone changed a data block on one cluster; these changes need to be updated
on all data replica of this block.
• High Concurrency: avail same piece of data to be processed by multiple clients at the same time. It is done
using the computation power of each node to parallel process data blocks.
Introduction to B
24
Notes
25
Notes
26
Notes
On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and content delivery applications.
A P2P system is built over many client machines. Peer machines are globally distributed in nature. P2P, cloud computing,
and web service platforms are more focused on HTC applications than on HPC applications.Clustering and P2P
technologies lead to the development of computational grids or data grids.
High-Performance Computing
• For many years, HPC systems emphasize the raw speed performance. The speed of HPC systems has increased
from Gflops in the early 1990s to now Pflops in 2010. This improvement was driven mainly by the demands from
scientific, engineering, and manufacturing communities.For example, the Top 500 most powerful computer systems in the
world are measured by floating-point speed in Linpack benchmark results. However, the number of supercomputer users
is limited to less than 10% of all computer users. Today, the majority of computer users are using desktop computers or
large servers when they conduct Internet searches and market-driven computing tasks. High Throughput Computing
The development of market-oriented high-end computing systems is undergoing a strategic change from an HPC
paradigm to an HTC paradigm. This HTC paradigm pays more attention to high-flux computing. The main application
for high-flux computing is in Internet searches and web services by millions or more users simultaneously.The
performance goal thus shifts to measure high throughput or the number of tasks completed per unit of time. HTC
technology needs to not only improve in terms of batch processing speed, but also address the acute problems of cost,
energy savings, security, and reliability at many data and enterprise computing centers. Three New Computing
Paradigms
• Advances in virtualization make it possible to see the growth of Internet clouds as a new computing paradigm. The
maturity of radio-frequency identification (RFID), Global Positioning System (GPS), and sensor technologies has
triggered the development of the Internet of Things (IoT).
or Big Data
27
Notes
28
Notes
What is a MapReduce?
• Map takes a set of data and converts it into another set of data, where individual elements are broken down into
tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is
always performed after the map job. Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job. The Algorithm
• Generally, MapReduce paradigm is based on sending the computer to where the data resides. MapReduce
program executes in three stages, namely map stage, shuffle stage, and reduce stage.
• Map stage − The map or mapper’s job is to process the input data. Generally, the input data is in the form of file
or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line
by line. The mapper processes the data and creates several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
Figure 11 MapReduce
• The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After
completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it
back to the Hadoop server.
or Big Data
Advantages of MapReduce
• It is easy to scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called mappers and reducers.
29
Notes
In a directed graph or a digraph, each edge is associated with a direction from a start vertex to an end vertex.If we
traverse along the direction of the edges and we find that no closed loops are formed along any path, we say that there
are no directed cycles. The graph formed is a directed acyclic graph.A DAG is always topologically ordered, i.e., for each
edge in the graph, the start vertex of the edge occurs earlier in the sequence than the ending vertex of the edge.
Topological sorting for Directed Acyclic Graph (DAG) is a linear ordering of vertices such that for every directed edge u
v, vertex u comes before v in the ordering. Topological Sorting for a graph is not possible if the graph is not a DAG.For
example, a topological sorting of the following graph is “5 4 2 3 1 0”. There can be more than one topological sorting for a
graph. For example, another topological sorting of the following graph is “4 5 2 3 1 0”. The first vertex in topological
sorting is always a vertex with in-degree as 0 (a vertex with no incoming edges).
Application Areas
Some of the main application areas of DAG are −
• Routing in computer networks
• Job scheduling
• Data processing
• Genealogy
• Citation graphs
30
Notes
2.5 Five Reasons You Need a Step-by-Step Approach to Workflow Orchestration for Big
Data
Is your organization struggling to keep up with the demands of Big Data and under pressure to prove quick results? If so,
you’re not alone. According to analysts, up to 60% of Big Data projects are failing because they can’t scale at the enterprise
level. Fortunately, taking a step-by-step approach to application workflow orchestration can help you succeed. It begins
with assessing the various technologies for supporting multiple Big Data projects that relate to these four steps:
• Ingesting data
31
Notes
Improve reliability
it’s important to run Big Data workflows successfully to minimize service interruptions. Using a patchwork of tools
and processes makes it hard to identify issues and understand root cause, putting SLAs at risk. If you can manage
your entire Big Data workflow from A to Z, then if something goes wrong in the process, you’ll see it immediately
and know where it happened and what happened. Using the same solution orchestrating your entire processes and
managing them from one single plane of glass, simplifies managing your services and assuring they run
successfully.
Looking ahead
Taking a step-by-step approach to application workflow orchestration simplifies the complexity of your Big Data
workflows. It avoids automation silos and helps assure you meet SLAs and deliver insights to business users on
time. Discover how Control-M provides all ofthe capabilities to enable your organization to follow this approach
and how it easily integrates with your existing technologies to support Big Data projects.
32
Notes
Summary
• A file system is a programme that controls how and where data is saved, retrieved, and managed on a storage disc,
usually a hard disc drive (HDD). It's a logical disc component that maintains a disk's internal activities as they
relate to a computer while remaining invisible to the user.
• A distributed file system (DFS) or network file system is a type of file system that allows many hosts to share files
over a computer network. Multiple users on multiple machines can share data and storage resources as a result of
this.
• The distinction between local and remote access techniques should be indistinguishable.
• Users who have access to similar communication services at multiple locations are said to be mobile. For example,
a user can use a smartphone and access his email account from any computer to check or compose emails. The
travel of a communication device with or without a user is referred to as device portability.
• Big data refers to massive, difficult-to-manage data quantities – both organised and unstructured – that inundate
enterprises on a daily basis. Big data may be evaluated for insights that help people make better judgments and
feel more confident about making key business decisions.
• These are the most basic and basic Big Data applications. They assist in enhancing company efficiency in almost
every industry.
• These are the big data apps of the future. They have the potential to alter businesses and boost corporate
effectiveness. Big data may be organised and analysed to uncover patterns and insights that can be used to boost
corporate performance.
• The process of replicating a double-stranded DNA molecule into two identical DNA molecules is known as DNA
replication. Because every time a cell splits, the two new daughter cells must have the same genetic information, or
DNA, as the parent cell, replication is required.
• The capacity of a system to increase or decrease in performance and cost in response to changes in application and
system processing demands is known as scalability. When considering hardware and software, businesses that are
rapidly expanding should pay special attention to scalability.
• In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous scalability across hundreds
or thousands of computers. MapReduce, as the processing component, lies at the heart of Apache Hadoop. The
reduction job is always carried out after the map job, as the term MapReduce implies.
Keywords
MapReduce: MapReduce is a framework that allows us to create applications that reliably process enormous volumes of
data in parallel on vast clusters of commodity hardware.
Map Stage:The map's or mapper's job is to process the data that is given to them. In most cases, the input data is stored in
the Hadoop file system as a file or directory (HDFS). Line by line, the input file is supplied to the mapper function. The
mapper divides the data into little bits and processes it.
Reduce Stage:This level is the result of combining the Shuffle and Reduce stages. The Reducer's job is to take the data
from the mapper and process it. It generates a new set of outputs after processing, which will be stored in the HDFS.
Data Node:Data is supplied in advance before any processing takes occur at this node.
Directed Cyclic Graph:A directed cycle graph is a cycle graph with all edges pointing in the same direction.
Message Passing:Message passing is a way for invoking activity (i.e., running a programme) on a computer in computer
science. The calling programme delivers a message to a process (which could be an actor or an object), and that process
and its supporting infrastructure choose and run relevant code.
Bulk Synchronous Parallel:Bulk Synchronous Parallel (BSP) is a parallel computing programming model and processing
framework. The computation is broken down into a series of supersteps. A group of processes running the same code
executes concurrently in each superstep and generates messages that are delivered to other processes.
Replication:The process of replicating a double-stranded DNA molecule into two identical DNA molecules is known as
DNA replication. Because every time a cell splits, the two new daughter cells must have the same genetic information, or
DNA, as the parent cell, replication is required.
33
Notes
Review Questions
Q1: The EXE extension stands for _________________
A. executable files
B. extension files
C. extended files
D. None of above
Q3: Data replication is a good way to achieve ________ and high concurrency; but it’s very hard to maintain frequent
changes.
A. fault tolerance
B. detection tolerance
C. both
D. none of above
A. RFID
B. Sensor technologies
C. GPS
D. All of the above
A. Filename
B. File identifier
C. File extension
D. None of the mentioned.
Q6: What computer technology is used to describe services and applications that run on a dispersed network using
virtualized resources?
A. Distributed Computing
B. Cloud Computing
C. Soft Computing
D. Parallel Computing
Q7: Which one of the following options can be considered as the Cloud?
A. Hadoop
B. Intranet
C. Web Applications
D. All of the mentioned
34
Notes
Q9: The name of three stages in which MapReduce program executes are:
Q10: A directed acyclic graph (DAG) refers to a directed graph which has no ______ cycles.
A. Infinite
B. Directed
C. Direction
D. Noe of above
A. Map
B. Reduce
C. Both
D. None of above
A. Genealogy
B. Citation graphs
C. Job Scheduling
D. All of above
Q14: Map takes a set of data and converts it into another set of data, where individual elements are broken down into
________
A. Tables.
B. Tuples (key/value pairs).
C. Reduce stage.
D. None of above.
35
Notes
6. B 7. A 8. D 9. A 10. B
Review Questions
1. Differentiate between file system and distributed file system.
2. Write down features of Distributed file system.
3. Write down popular models of BIG DATA.
4. Write down challenges of BIG DATA.
5. Write note on the following.
a. The Age of Internet Computing
b. High Throughput Computing
6. What are the advantages and disadvantages of distributed file system?
36
Notes
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We
Live, Work, and Think. Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier for Innovation,
Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practicesof Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark.
OReilley.White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
37
Notes
Introduction t
Unit 03: Data Models
Objectives
• Understand what is data mart
• Understand data format
• Understand data model
• Differentiate between data warehouse and data mart
• Understand what is data stream
38
Notes
o Big Data
A computer programmer typically uses a wide variety of tools to store and work with data in the programs they
build.They may use simple variables (single value), arrays (multiple values), hashes (key-value pairs), or even
custom objects built in the syntax of the language they’re using.Portable format is required.
• Another program may have to communicate with this program in a similar way, and the programs may not even
be written in the same language, as is often the case with something like traditional client-server communications
as shown in Figure 1.
• This is all perfectly standard within the confines of the software being written. However, sometimes a more
abstract, portable format is required. For instance, a non-programmer may need to move data in and out of these
programs.
• For example, many third-party user interfaces (UIs) are used to interface with public cloud providers. This is
made possible (in a simplified fashion) thanks to standard data formats. The moral of the story is that we need a
standard format to allow a diverse set of software to communicate with each other, and for humans to interface
with it.
39
Notes
Introduction t
40
Notes
41
Notes
Introduction t
42
Notes
43
Notes
•
Figure 3 Data streaming
44
Notes
streaming data they receive, they can get real-time insights to understand exactly what is happening at any given
point in time. This enables better decision-making as well as provide customers with better and more personalized
services. Nearly every company is or can use streaming data.
Predictive Maintenance
When companies can identify maintenance issues prior to breakdowns or system failure, they will save time, money, and
other potentially catastrophic effects on the business. Any company that has equipment of any kind that has sensors or
cameras—again, that’s most equipment these days— will create streaming data. From monitoring the performance of
trucks, and airplanes, to predicting issues with complex manufacturing equipment, real-time data and analytics is
becoming critical to modern enterprises today.
Healthcare
Just like in a manufacturing environment, wearables, and healthcare equipment such as glucometers, connected scales,
heart rate and blood pressure monitors have sensors that monitor a patient’s vitals and essential body functions. This
equipment is also crucial for effective remote
45
Notes
patient monitoring that supports clinicians who don’t have the bandwidth to be everywhere all the time. It’s literally a
matter of life or death. Immediate insights can improve patient outcomes and experiences. Retail
Real-time data streaming from IoT sensors and video are driving a modern retail renaissance.
Brick-and-mortar retail stores can engage customers in the moment thanks to streaming data. Location-based marketing,
trend insights, and improvements to operational efficiency, such as product movement or product freshness, are all
possible with real-time insights. Understanding what a consumer wants when they want it “in the moment” is not only
valuable in retail. Any company that is able to understand and respond immediately to what its customer wants in
micromoments will have a better chance of being successful, whether it's to deliver something a consumer wants to learn,
discover, watch or buy.
Social media
With cries of “fake news” and instances of social media bullying continuing to rise, the need for real-time monitoring of
posts to quickly take action on offensive and “fake news” is more important than ever. Under mounting pressure, social
media platforms are creating tools to be able to process the huge volume of data created quickly and efficiently to be able
to take action as immediately as possible, especially to prevent bullying.
Finance
On the trading floor, it's easy to see how understanding and acting on information in real-time is vital, but streaming data
also helps the financial functions of any company by processing transactional information, identify fraudulent actions,
and more. For example, MasterCard is using data and analytics to helping financial organizations quickly and easily
identify fraudulent merchants to reduce risk. Similarly, by gaining the ability to process real-time data, Rabobank is able
to detect warning signals in extremely early stages of where clients may go into default.
KPIs
Leaders can make decisions based on real-time KPIs such as financial, customer, or operational performance data.
Previously, this analysis was reactive and looked back at past performance. Today, real-time data can be compared with
historical information to give leaders a perspective on business that informs real-time decisions.As you can see, streaming
data is increasingly important to most companies in most industries. Successful companies are integrating streaming
analytics to move their data analytics from a reactive to a more proactive real-time approach. The best ones will be
thinking about integrating their real-time data with predictive models and scenario analysis to gain strategic
foresight.However, in order to harness fast and streaming data, organizations today need an end-to-end data
management and analytics platform that can collect, process, manage, and analyze data in real-time to drive insights and
enable machine learning to implement some of the most compelling use cases. Most importantly, they need to be able to
do these with the robust security, governance, data protection, and management capabilities that enterprises require.
3.10 Data Lake
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You
can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards
and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions as shown in
Figure 7.
46
Notes
47
Notes
Data movement
Data Lakes allow you to import any amount of data that can come in real-time. Data is collected from multiple sources,
and moved into the data lake in its original format. This process allows you to scale to data of any size, while saving time
of defining data structures, schema, and transformations.
Machine Learning
Data Lakes will allow organizations to generate different types of insights including reporting on historical data, and
doing machine learning where models are built to forecast likely outcomes, and suggest a range of prescribed actions to
achieve the optimal result.
48
Notes
49
Notes
The distributed, wireless and battery-powered nature of sensor networks will force data management to take sensor
failure, network latency and loss into account. At the other hand, there will be a lot of redundant (or, in statistical terms,
highly correlated) data to counter these negative features. A couple of remarks to sketch the situation. At the other hand,
there will be a lot of redundant (or, in statistical terms, highly correlated) data to counter these negative features. A couple
of remarks to sketch the situation.
• Sensors come and sensors go. They can fail because their battery runs out, and start up again when it is replaced.
They can be disconnected, moved and connected at a different place. They can be replaced altogether by a newer
model. They can have wireless connections which do not work all the time.
• Sensors do not produce clean data. Averages have to be taken, noise filters have to be applied, environmental
influences (e.g.echos) have to be accounted for.
• The same sensor may be used for different purposes. Different algorithms are applied on the raw data depending on
what you want to know, e.g. using a microphone for speaker identification, speaker positioning or estimation of the
environmental noise level.
• The data rate and latency may differ greatly between sensors/algorithms, and over time: In some cases, it may be
parameterizable (i.e. a sensor or algorithm can be configured to produce output at several rates). In some cases, the
term “data rate” might not even apply at all (e.g. RFID readers which produce a reading (or a burst of readings)
whenever a tag is detected).
• They might only produce data “on demand” because of the cost associated with it. This cost may be power, but it
may also be money if the sensor belongs to another party (think of weather or traffic sensors).
50
Notes
• Applications come and go. They can be turned on and off at will; they are duplicated for each new user; they are
upgraded. They are disconnected at one place and connected at another, and might be interested in what
happened in the meantime.
• They might want to know what kind of sensors are around, and adapt their information demands to this.
51
Notes
Introduction t
o Big Data
• They might be totally decoupled from sensors, and just want to know e.g. which person is at a certain desk.
• They might have (static or dynamic) requirements about the rate at which data is delivered to them. This rate
may vary greatly from application to application.
• They might demand a ‘memory’ from the environment to discover details of specific events in the past.
• They might be interested in trends or summaries rather than in specifics.
Weather data
Many satellites provide real-time weather data streaming in order to capture critical signals for the weather. This
information is used to forecast the weather.
Summary
• A data mart is a structure / access pattern used to get client-facing data in data warehouse setups. A data mart is a
subset of a data warehouse that is often focused on a single business line or team.
• A data mart is a subset of a data warehouse that is focused on a certain business line, department, or topic area. Data
marts make specialised data available to a designated group of users, allowing them to rapidly obtain key insights
without having to sift through a whole data warehouse.
• A dependent data mart enables data from several organisations to be sourced from a single Data Warehouse. It is an
example of a data mart that provides the benefit of centralization. You must setup one or more physical data marts
as dependent data marts if you need to create them.
• Without the usage of a central data warehouse, an independent data mart is formed. This type of Data Mart is best
suited for smaller groups inside a company.
• A data lake is a system or repository that stores data in its original/raw form, which is often object blobs or files.
• Data that is continually created by several sources is referred to as streaming data. Without having access to all of the
data, such data should be handled sequentially utilising stream processing techniques.
Keywords
Predictive maintenance: Predictive maintenance is the application of data-driven, proactive maintenance approaches to
examine equipment status and anticipate when repair should be conducted.
Unit 03: Data Models
Dependent data marts:An enterprise data warehouse is used to establish a reliant data mart. It's a top-down technique
that starts with keeping all company data in one single area and then extracting a clearly defined piece of the data for
analysis as needed.
Independent data marts: An independent data mart is a stand-alone system that concentrates on a single topic area or
business activity without the usage of a data warehouse. Data is retrieved from internal or external data sources (or both),
processed, and then deposited into a data mart repository, where it is kept until it is required for business analytics.
52
Notes
Hybrid data marts: Data from an existing data warehouse and other operational source systems is combined in a hybrid
data mart. It combines the speed and end-user emphasis of a top-down strategy with the benefits of the bottom-up
method's enterprise-level integration.
Maintenance: In industrial, commercial, and residential settings, maintenance include functioning inspections, servicing,
repairing, or replacing essential devices, equipment, machinery, building structures, and supporting utilities.
Data Lake:A data lake is a storage repository that stores a large amount of raw data in its original format until it is
required for analytics applications. A data lake differs from a standard data warehouse in that it stores data in flat
architecture, mostly in files or object storage, rather than hierarchical dimensions and tables.
Data ware house: A data warehouse is a huge collection of corporate data used to aid decisionmaking inside a company.
The data warehouse idea has been around since the 1980s, when it was created to aid in the transfer of data from being
used to power operations to being used to feed decision support systems that disclose business insight.
Self Assessment
Q1: Which of the following is an aim of data mining?
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
Q5: __________ is built by drawing data from central data warehouse that already exists.
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
Q6: ___________ is built by drawing from operational or external sources of data or both.
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
53
Notes
Introduction t
Q7: A ________ data mart combines input from sources apart from Data warehouse
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
Q8: Big data streaming is a process in which big data is quickly processed in order to extract _________ insights from it.
A. real-time
B. streaming data
C. both a and b
D. None of above
Q9: Dynamic data that is generated continuously from a variety of sources is considered _____________
A. real-time
B. steaming data
C. both a and b
D. None of above
Q10: ____________ is using data and analytics to helping financial organizations quickly and easily identify fraudulent
merchants to reduce risk.
A. Debit Card
B. Credit Card
C. MasterCard
D. None of the above
Q12: The increasing availability of cheap, small, low-power sensor hardware and has led to the prediction that
___________ will arise in the near future.
A. Small-environment
B. Supply side
C. both a and b
D. None of above
Q13: The ________ of a smart environment consists of a myriad of sensors that produce data at possibly very high rates
real-time
A. streaming data
B. supply side
C. both a and b
D. None of above
54
Notes
A. streaming data
B. supply side
C. PocketLab
D. None of the above
Q15: ____________________ can fail because their battery runs out, and start up again when it is replaced
Review Questions
1. Difference between data mart and data ware house.
2. Write down the Tips for Creating Effective Big Data Models.
3. Explain different types of data mart.
4. Write down advantages and disadvantages of data mart.
5. What do you understand by data streaming? Explain Use Cases for Real-Time and Streaming Data.
55
Notes
Introduction t ig Data
oB
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We Live, Work,
and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation, Competition, and Productivity.
Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practice of Scalable Realtime Data Systems.
Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley White, Tom (2014). Mastering Hadoop.
OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
56
Notes
Objectives
• identify key differences between NOSQL and relational databases
• appreciate the architecture and types of NOSQL databases
• describe the major types of NOSQL databases and their features
• learn distribute data models
• learn hadoop partitioner
Introduction
• A NOSQL database is a clever way of cost-effectively organizing large amounts of heterogeneous data for efficient
access and updates. An ideal NOSQL database is completely aligned with the nature of the problems being
solved, and is superfast in accomplishing that task.This is achieved by relaxing many of the integrity and
redundancy constraints of storing data in relational databases. Data is thus stored in many innovative formats
closely aligned with business need. The diverse NOSQL databases will ultimately collectively evolve into a
holistic set of efficient and elegant knowledge stored at the heart of a cosmic computer.
57
Notes
NOSQL databases are next-generation databases that are non-relational in their design. The name NOSQL is meant to
differentiate it from antiquated, ‘pre-relational’ databases.Today, almost every organization that must gather customer
feedback and sentiments to improve their business, uses a NOSQL database. NOSQL is useful when an enterprise needs
to access, analyze, and utilize massive amounts of either structured or unstructured data that’s stored remotely in virtual
servers across the globe.
• NOSQL database is useful when
• The constraints of a relational database are relaxed in many ways. For example, relational databases require that
any data element could be randomly accessed and its value could be updated in that same physical location.
However, the simple physics of storage says that it is simpler and faster to read or write sequential blocks of
data on a disk. Therefore, NOSQL database files are written once and almost never updated in place. If a new
version of a part of the data becomes available, it would be appended to the respective files. The system would
have the intelligence to link the appended data to the original
58
Notes
Unit 0
4: NOSQL Data Management
file.These differ from each other in many ways. First, NOSQL databases do not support relational schema or SQL
language. The term NOSQL stands mostly for “Not only SQL”.
• Second, their transaction processing capabilities are fast but weak, and they do not support the ACID
(Atomicity, Consistency, Isolation, Durability) properties associated with transaction processing using relational
databases. Instead, they support BASE properties (Basically Available, Soft State, and Eventually Consistent).
NOSQL databases are thus approximately accurate at any point in time, and will be eventually consistent.
Third, these databases are also distributed and horizontally scalable to manage web-scale databases using
Hadoop clusters of storage. Thus, they work well with the write-once and readmany storage mechanism of
Hadoop clusters. Table 6.1 lists comparative features of RDBMS and NOSQL.
59
Notes
60
Notes
Unit 0
61
Notes
Columnar Databases:
These are database structures that include only the relevant columns of the dataset, along with the key-identifying
information. These are useful in speeding up some oft-sought queries from very large data sets. Suppose there is an
extremely large data warehouse of web log access data, which is rolled up by the number of web access by the hour. This
needs to be queried, or summarized often, involving only some of the data fields from the database. Thus the query could
be speeded up by organizing the database in a columnar format. This is useful for content management systems, blogging
platforms, maintaining counters, expiring usage, heavy write volume such as log aggregation. Column family databases
for systems well when the query patterns have stabilized.
HBase and Cassandra are the two of the more popular Columnar database offerings. HBase was developed at Yahoo, and
comes as part of the Hadoop ecosystem. Cassandra was originally developed at Facebook to serve its exponentially
growing user base, which is now close to 2 billion people. It was open sourced in 2008.
62
Notes
Unit 0
4: NOSQL Data Management
storage system that has properties of both databases and distributed hash tables. Amazon DynamoDB is a fully managed
NOSQL database service that provides fast and predictable performance with seamless scalability.
DynamoDB automatically spreads the data and traffic for your tables over enough servers to handle your throughput and
storage requirements, while maintaining consistent and fast performance.
Document Database
These databases store an entire document of any size, as a single value for a key element. Suppose one is storing a 10GB
video movie file as a single object. An index could store the identifying information about the movie, and the address of
the starting block. The system could handle the rest of storage details. This storage format would be a called document
store format. Document databases are generally useful for content management systems, blogging platforms, web
analytics, real-time analytics, ecommerce-applications. Document databases would not be useful for systems that need
complex transactions spanning multiple operations or queries against varying aggregate structures.
MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.
A record in MongoDB is a document, which is a data structure composed of field and value pairs. The values of fields
may include other documents, arrays, and arrays of documents.
63
Notes
64
Notes
Unit 0
65
Notes
Aggregates make it easier for the database to manage data storage over clusters, since the unit of data now could reside
on any machine and when retrieved from the database gets all the related data along with it. Aggregate-oriented
databases work best when most data interaction is done with the same aggregate, for example when there is need to get
an order and all its details, it better to store order as an aggregate object but dealing with these aggregates to get item
details on all the orders is not elegant.
We can use this scenario to model the data using a relation data store as well as NOSQL data stores and talk about their
pros and cons.. For relational model, we start with a data model shown in this figure. As we are good relational soldiers,
everything is properly normalized, so that no data is repeated in multiple tables. We also have referential integrity.
A realistic order system would naturally be more involved than this, but this is the benefit of the rarefied air of a
book.Let’s see how this model looks when we think in more aggregate-oriented terms:
66
Notes
67
Notes
//Customer
{
"id": 1,
"name": "Fabio",
"billingAddress": [
{
"city": "Bari"
}
]
}
//Orders
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 34,
"productName": “NoSQL Distilled”
} ],
"shippingAddress": [
{
"city": "Bari”} ],
"orderPayment": [
{ "ccinfo": "100-432423-545-134",
"txnId": "afdfsdfsd",
"billingAddress": [ {"city": “Chicago” }]
}]
}
important thing to notice here isn’t the particular way we have drawn the aggregate boundary so much as the fact that
you have to think about accessing that data- and make that part of your thinking when developing the application data
model. Indeed, we could draw aggregate boundaries differently, putting all the orders for a customer into the customer
aggregate.
Figure 14: Embed all the objects for customer and the customer’s order
68
Notes
Like most things in modelling, there’s no universal answer for how to draw your aggregate boundaries. It depends
entirely in how you tend to manipulate your data. If you tend to access a customer together with all of that customer’s
order at once, then you would prefer a single aggregate. However, if you tend to focus on accessing a single order at a
time, then you should prefer having separate aggregates for each other. Naturally, this is very context-specific; some
applications will prefer one or the other, even within a single system, which is exactly why many people prefer aggregate
ignorance.
69
Notes
Running on a Cluster
It gives several advantages on computation power and data distribution. However, it requires minimizing the number of
nodes to query when gathering data.By explicitly including aggregates, we give the database an important of which
information should be stored together.
70
Notes
NOSQL databases are capable of storing and processing big data which is characterized by various properties such as
volume, variety and velocity. Such databases are used in a variety of user applications that need large volume of data
which is highly available and efficiently accessible. But they do not enforce or require strong data consistency nor do they
support transactions. For example, social media such as Twitter and Facebook [5] generate terabytes of daily data which
is beyond the processing capabilities of relational databases. Such applications need high performance but may not need
strong consistency. Different vendors design and implement NOSQL databases differently. Indeed, there are different
types of NOSQL databases such as document databases, key-value databases, column stores and graph databases. But
their common objective is to use data replication in order to ensure high efficiency, availability and scalability of data.
71
Notes
72
Notes
Unit 0
4: NOSQL Data Management
Features of NOSQL
• Non-relational
• Never follow the relational model
• Never provide tables with flat fixed-column records
• Work with self-contained aggregates or BLOBs
• Doesn't require object-relational mapping and data normalization
• No complex features like query languages, query planners, referential integrity joins,
ACID
• Schema-free
NOSQL databases are either schema-free or have relaxed schemas. Do not require any sort of
definition of the schema of the data. Offers heterogeneous structures of data in the same domain
• Simple API
Offers easy to use interfaces for storage and querying data provided. APIs allow low-level data
manipulation & selection methods. Text-based protocols mostly used with HTTP REST with JSON.
Mostly used no standard based NOSQL query language. Web-enabled databases running as
internet-facing services
• Distributed
73
Notes
74
Notes
Column-Based
Column-oriented databases work on columns and are based on BigTable paper by Google. Every
column is treated separately. Values of single column databases are stored contiguously.They
deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the data is
readily available in a column. Column-based NOSQL databases are widely used to manage data
warehouses, business intelligence, CRM, Library card catalogs,HBase, Cassandra, HBase, Hyper
table are NOSQL query examples of column based database.A column of a distributed data store is
a NOSQL object of the lowest level in a keyspace. It is a tuple (a key-value pair) consisting of three
elements:
Unique name: Used to reference the column
Value: The content of the column. It can have different types, like AsciiType, LongType,
TimeUUIDType, UTF8Type among others.
Timestamp: The system timestamp used to determine the valid content.
{
street: {name: "street", value: "1234 x street", timestamp: 123456789},
city: {name: "city", value: "sanfrancisco", timestamp: 123456789},
zip: {name: "zip", value: "94107", timestamp: 123456789},
}
Document-Oriented
In this diagram on your left you can see we have rows and columns, and in the right, we have a
document database which has a similar structure to JSON.
Now for the relational database, you have to know what columns you have and so on. However,
for a document database, you have data store like JSON object. You do not require to define which
make it flexible.The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications. It should not use for complex transactions which require
multiple operations or queries against varying aggregate structures.Amazon SimpleDB, CouchDB,
MongoDB, Riak, Lotus Notes, MongoDB, are popular Document originated DBMS systems
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes. Every
node and edge have a unique identifier.Compared to a relational database where tables are loosely
connected, a Graph database is a multi-relational in nature. Traversing relationship is fast as they
75
Notes
are already captured into the DB, and there is no need to calculate them. Graph base database
mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.A graph
database is a database that uses graph structures for semantic queries with nodes, edges, and
properties to represent and store data. A graph database is any storage system that provides index-
free adjacency. This means that every element contains a direct pointer to its adjacent elements and
no index lookups are necessary. General graph databases that can store any graph are distinct from
specialized graph databases such as triplestores and network databases.
Introduction to Big Data
Depending on your distribution model, you can get a data store that will give you the ability to
handle larger quantities of data, the ability to process a greater read or write traffic, or more
availability in the face of network slowdowns or breakages. These are often important benefits, but
they come at a cost. Running over a cluster introduces complexity—so it’s not something to do
unless the benefits are compelling.
76
Notes
These are often important benefits, but they come at a cost. Running over a cluster introduces
complexity—so it’s not something to do unless the benefits are compelling.Broadly, there are two
paths to data distribution: replication and sharding.
There are two styles of
distributing data:
Sharding:
Master –slave
Replication
Peer -to-peer
Replication takes the same data and copies it over multiple nodes. Sharding puts different data on
different nodes.
77
Notes
Sharding puts different data on different nodes. Replication and sharding are orthogonal
techniques: You can use either or both of them.
Introduction to Big Data
Replication comes into two forms: master-slave and peer-to-peer. We will now discuss these
techniques starting at the simplest and working up to the more complex: first single-server, then
master-slave replication, then sharding, and finally peer-to-peer replication.
Single Server
The first and the simplest distribution option is the one we would most often recommend—no
distribution at all. Run the database on a single machine that handles all the reads and writes to the
data store. We prefer this option because it eliminates all the complexities that the other options
introduce; it’s easy for operations people to manage and easy for application developers to reason
about.
Although a lot of NOSQLdatabases are designed around the idea of running on a cluster, it can
make sense to use NOSQLwith a single-server distribution model if the data model of the NOSQL
store is more suited to the application. Graph databases are the obvious category here—these work
best in a single-server configuration
78
Notes
Sharding
Often, a busy data store is busy because different people are accessing different parts of the dataset.
In these circumstances we can support horizontal scalability by putting different parts of the data
onto different servers—a technique that’s called sharding.
79
Notes
advantage of all the compute resources across your cluster for every query. Because the
Introduction to Big Data
individual shards are smaller than the logical table as a whole, each machine has to scan
fewer rows when responding to a query.for example, if we have ten servers, each one only
has to handle 10% of the load. Of course the ideal case is a pretty rare beast. In order to get
close to it we have to ensure that data that’s accessed together is clumped together on the
same node and that these clumps are arranged on the nodes to provide the best data
access.
How to Clump the Data up so that One User Mostly gets her Data from a Single Server?
• The first part of this question is how to clump the data up so that one user mostly gets her
data from a single server. This is where aggregate orientation comes in really handy. The
whole point of aggregates is that we design them to combine data that’s commonly
accessed together—so aggregates leap out as an obvious unit of distribution. When it
comes to arranging the data on the nodes, there are several factors that can help improve
performance. If you know that most accesses of certain aggregates are based on a physical
location, you can place the data close to where it’s being accessed. If you have orders for
someone who lives in Boston, you can place that data in your eastern US data centre.
Another factor is trying to keep the load even. This means that you should try to arrange
aggregates so they are evenly distributed across the nodes which all get equal amounts of
the load. This may vary over time, for example if some data tends to be accessed on certain
days of the week—so there may be domain-specific rules you’d like to use.In some cases,
it’s useful to put aggregates together if you think they may be read in sequence. The
Bigtable paper [Chang etc.] described keeping its rows in lexicographic order and sorting
web addresses based on reversed domain names (e.g., com.martinfowler). This way data
for multiple pages could be accessed together to improve processing efficiency.
• Historically most people have done sharding as part of application logic. You might put
all customers with surnames starting from A to D on one shard and E to G on another.
This complicates the programming model, as application code needs to ensure that
queries are distributed across the various shards.
• Furthermore, rebalancing the sharding means changing the application code and
migrating the data. Many NOSQLdatabases offer auto-sharding, where the database takes
on the responsibility of allocating data to shards and ensuring that data access goes to the
right shard. This can make it much easier to use sharding in an application.
80
Notes
Master-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is designated as
the master, or primary. This master is the authoritative source for the data and is usually
responsible for processing any updates to that data. The other nodes are slaves, or secondaries. A
replication process synchronizes the slaves with the master (see Figure 4.2).Master-slave replication
is most helpful for scaling when you have a read-intensive dataset. You can scale horizontally to
handle more read requests by adding more slave nodes and ensuring that all read requests are
routed to the slaves. You are still, however, limited by the ability of the master to process updates
and its ability to pass those updates on. Consequently, it isn’t such a good scheme for datasets with
heavy write traffic, although offloading the read traffic will help a bit with handling the write
load.A second advantage of master-slave replication is read resilience: Should the master fail, the
slaves can still handle read requests. Again, this is useful if most of your data access is reads. The
failure of the master does eliminate the ability to handle writes until either the master is restored or
a new master is appointed. However, having slaves as replicates of the master does speed up
recovery after a failure of the master since a slave can be appointed a new master very quickly
Peer-to-Peer Replication
Master-slave replication helps with read scalability but doesn’t help with scalability of writes. It
provides resilience against failure of a slave, but not of a master. Essentially, the master is still a
bottleneck and a single point of failure. Peer-to-peer replication (see Figure 4.3) attacks these
problems by not having a master. All the replicas have equal weight, they can all accept writes, and
the loss of any of them doesn’t prevent access to the data store. The prospect here looks mighty
fine. With a peer-to-peer replication cluster, you can ride over node failures without losing access
to data. Furthermore, you can easily add nodes to improve your performance. There’s much to like
here—but there are complications. The biggest complication is, again, consistency. When you can
write to two different places, you run the risk that two people will attempt to update the same
record at the same time—a write-write conflict. Inconsistencies on read lead to problems but at
least they are relatively transient. Inconsistent writes are forever
81
Notes
Replication and sharding are strategies that can be combined. If we use both master-slave replication and sharding (see
Figure 4.4), this means that we have multiple masters, but each data item only has a single master. Depending on your
configuration, you may choose a node to be a master for some data and slaves for others, or you may dedicate nodes for
master or slave duties.Using peer-to-peer replication and sharding is a common strategy for column-family databases. In
a scenario like this you might have tens or hundreds of nodes in a cluster with data sharded over them. A good starting
point for peer-to-peer replication is to have a replication factor of 3, so each shard is present on three nodes. Should a
node fail, then the shards on that node will be built on the other nodes (see Figure 4.5).
82
Notes
83
Notes
• Then, framework sends the map output to reduce task. Reduce processes the user-defined reduce function on
map outputs. Before reduce phase, partitioning of the map output take place on the basis of the key.Hadoop
Partitioning specifies that all the values for each key are grouped together. It also makes sure that all the values
of a single key go to the same reducer. This allows even distribution of the map output over the reducer.
Partitioner in a MapReduce job redirects the mapper output to the reducer by determining which reducer
handles the particular key.
• Partitioner makes sure that same key goes to the same reducer! Hadoop Default Partitioner
• Hash Partitioner is the default Partitioner. It computes a hash value for the key. It also assigns the partition
based on this result.
84
Notes
Unit 0
4: NOSQL Data Management
Summary
• Users may manage preset data relationships across various databases using standard relational databases.
Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2 are all examples of typical relational databases.
• Non-tabular databases (sometimes known as "not simply SQL") store data differently from relational tables.
NOSQL databases are classified according to their data model. Document, key-value, wide-column, and graph are
the most common kinds. They have adaptable schemas and can handle big volumes of data and heavy user loads
with ease.
• The software that allows users to update, query, and administer a relational database is known as an RDBMS, or
relational database management system. The primary programming language for accessing databases is Structured
Query Language (SQL).
• ACIDITY (Atomicity, Consistency, Isolation, Durability) When analysing databases and application architectures,
Database Professionals check for ACID (acronym for Atomicity, Consistency, Isolation, and Durability).
• The capacity of a system to increase or decrease in performance and cost in response to changes in application and
system processing demands is known as scalability. When considering hardware and software, businesses that are
rapidly expanding should pay special attention to scalability.
• In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous scalability across hundreds
or thousands of computers. MapReduce, as the processing component, lies at the heart of Apache Hadoop. The
reduction job is always carried out after the map job, as the term MapReduce implies.
• A column-oriented database management system, often known as a columnar database management system, is a
database management system that stores data tables by column rather than row. In the realm of relational DBMS,
the practical application of a column store vs a row store differs little.
• A graph database is a database that represents and stores data using graph structures for semantic searches, such
as nodes, edges, and attributes. The graph is an important notion in the system.
Keywords
Relational Database: A relational database is a collection of data elements that are linked together by pre-defined
connections. These elements are laid down in a tabular format with columns and rows. Tables store data about the things
that will be represented in the database. A field keeps the actual value of an attribute, while each column in a table carries a
specific type of data.
NOSQL Database:Rather than relational tables, NOSQL databases store data as documents. As a result, we categorise them
as "not simply SQL" and divide them into several flexible data models. Pure document databases, key-value stores, wide-
column databases, and graph databases are examples of NOSQL databases. NOSQL databases are designed from the
bottom up to store and handle large volumes of data at scale, and they are increasingly used by modern enterprises.
Relational Database:Relational DataBase Management Systems (RDBMS) is an acronym for Relational DataBase
Management Systems. It's an application that lets us build, remove, and update relational databases. A relational database
is a database system that stores and retrieves data in the form of rows and columns in a tabular format. It is a minor subset
of DBMS that was created in the 1970s by E.F Codd.
Key/Value Store:A key-value store, sometimes known as a key-value database, is a simple database that employs an
associative array (think of a map or dictionary) as its basic data model, with each key corresponding to one and only one
item in a collection. A key-value pair is the name for this type of connection.
Introduction to Big Data
Columnar Database:A column-oriented database management system, often known as a columnar database management
system, is a database management system that stores data tables by column rather than row. In the realm of relational
DBMS, the practical application of a column store vs a row store differs little.
Graph Database:A graph database is a single-purpose, specialised platform for constructing and managing graphs. Graphs
are made up of nodes, edges, and attributes, which are all utilised to represent and store data in a way that relational
databases can't.
85
Notes
Aggregate Models:An aggregate is a group of data with which we interact as a whole. The boundaries for ACID operations
with the database are formed by these units of data or aggregates. Key-value, Document, and Column-family databases are
all examples of aggregate-oriented databases.
Self Assessment
1. A NOSQL database is defined as which of the following?
A. SQLServer
B. MongoDB
C. Cassandra
D. None of the mentioned
2. NOSQL databases is used mainly for handling large volumes of ________ data.
A. Unstructured
B. Structured
C. Semi-structured
D. All of the mentioned
3. NOSQL is useful when an enterprise needs to access, analyze, and utilize massive amounts of either structured or
unstructured data
A. Access
B. Analyze
C. Utilize
D. All of the above
86
Notes
Unit 0
8. In which year Carlo Strozzi use the term NOSQL for his lightweight, open-source relational database
A. 1998 B. 2000
C. 2004
D. None of the above
87
Notes
7. C 8. A 9. D 10. A
Review Questions
1. Explain types of NOSQL.
2. Write down features of NOSQL.
3. Write down about data models.
4. Difference between RDBMS vs NOSQL.
5. What is the major purpose of using a NOSQL database.
6. What are the advantages and disadvantages of NOSQL.
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We
Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Real time Data
Systems. Manning Publications. Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-
Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
88
Notes
Objectives
• Learn introduction about Hadoop.
• Learn benefits of Hadoop for bigdata
• Learn Open-Source Software Related to Hadoop
• Learn what is big data
• Learn why big data in the cloud makes perfect sense
• Learn Big opportunities, big challenges
Introduction
Hadoop is a framework that allows us to store and process large datasets in parallel and distributed fashion.Two major
problems in dealing with BIG DATA
• Storage
• Processing
Storage problem resolved by
• HDFS
All big amount of data that we are dumping is gets distributed over different machine.These machines are interconnected
Processing problem resolved by
89
Notes
• mapReduce
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short,
Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.Hadoop
is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of
computers using simple programming models.The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of computers. . Hadoop is designed to scale up from single
server to thousands of machines, each offering local computation and storage.
By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and
saved across clusters of commodity servers. As these commodity servers are built with simple hardware configurations,
these are economical and easily scalable as the data grows.
HDFS is the pillar of Hadoop that maintains the distributed file system. It makes it possible to store and replicate data
across multiple servers.HDFS has a NameNode and DataNode. DataNodes are the commodity servers where the data is
actually stored. The NameNode, on the other hand, contains metadata with information on the data stored in the different
nodes. The application only interacts with the NameNode, which communicates with data nodes as required.
• Speed: Hadoop stores and retrieves data faster.
Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. So, when a
query is sent to the database, instead of handling data sequentially, tasks are Introduction to Hadoop
split and concurrently run across distributed servers. Finally, the output of all tasks is collated and sent back to the
application, drastically improving the processing speed.
90
Notes
Unit 05:
Why is Hadoop Important?
Ability to store and process huge amounts of any kind of data, quickly.
With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's
a key consideration.
• Computing Power
Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing
power you have.
• Fault Tolerance.
Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically
redirected to other nodes to make sure the distributed computing does not fail.
Multiple copies of all data are stored automatically.
• Flexibility.
Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as
you want and decide how to use it later. That includes unstructured data like text, images and videos.
• Low Cost.
The open-source framework is free and uses commodity hardware to store large quantities of data.
• Scalability.
You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
91
Notes
Low cost — As Hadoop is an open-source framework, with no license to be procured, the costs are significantly lower
compared to relational database systems. The use of inexpensive commodity hardware also works in its favor to keep the
solution economical.
Speed — Hadoop’s distributed file system, concurrent processing, and the MapReduce model enable running complex
queries in a matter of seconds.
Data diversity — HDFS has the capability to store different data formats such as unstructured (e.g. videos), semi-
structured (e.g. XML files), and structured.
While storing data, it is not required to validate against a predefined schema. Rather, the data can be dumped in any
format. Later, when retrieved, data is parsed and fitted into any schema as needed. This gives the flexibility to derive
different insights using the same data.
Apache Spark
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably. Spark is best suited for real-time data
whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies
interchangeably
Introduction to Hadoop
PIG
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to
SQL.It is a platform for structuring the data flow, processing and analyzing huge data sets. Pig does the work of executing
commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the
result in HDFS.Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java
runs on the JVM. Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
92
Notes
Unit 05:
HIVE
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its
query language is called as HQL (Hive Query Language).It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier.Similar to
the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line.JDBC,
along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command
line helps in the processing of queries.
Hbase
It’s a NOSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It
provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.At times where we need to search
or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick
span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data.
Mahout
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the
system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms.It provides
various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but
concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.
Solr, Lucene
These are the two services that perform the task of searching and indexing with the help of some java libraries, especially
Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr.
Zookeeper
There was a huge issue of management of coordination and synchronization among the resources or the components of
Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization,
inter-component based communication, grouping, and maintenance.
Oozie
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is
two kinds of jobs.i.e., Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in
a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.
93
Notes
1. Lucene is an open-source Java based search library. It is very popular and a fast search library. It is used in Java
based applications to add document search capability to any kind of application in a very simple and efficient
way.Lucene is a simple yet powerful Javabased Search library. It can be used in any application to add search
capability to it. Lucene is an open-source project. It is scalable. This high-performance library is used to index
and search virtually any kind of text. Lucene library provides the core operations which are required by any
search application. Indexing and searching.
How Search Application works?
94
Notes
Build Query
When a user requests to search for a text, the application should create a query object based on that text, which may be
used to query the index database for relevant information.
Search Query
The index database is then examined using a query object to obtain the necessary information and content documents.
Render Results
Once the result has been obtained, the programme must select how to provide the information to the user through the
user interface. How much information should be displayed?
2. Eclipse
Eclipse is a Java IDE that is one of the 3 biggest and most popular IDE’s in the world. It was written mostly in Java but it
can also be used to develop applications in other programming languages apart from Java using plug-ins. Some of the
features of Eclipse are as follows:
• PDE (Plugin Development Environment) is available in Eclipse for Java programmers that want to create specific
functionalities in their applications.Eclipse flaunts powerful tools for the various processes in application
development such as charting, modeling, reporting, testing, etc. so that Java developers can develop the
application as fast as Introduction to Big Data
possible.Eclipse can also be used to create various mathematical documents with LaTeX using the TeXlipse plug-
in as well as packages for the Mathematica software.Eclipse can be used on platforms like Linux, macOS, Solaris
and Windows.
95
Notes
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is
horizontally scalable.HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).It is a
part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.One
can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly
using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column
families, which are the key value pairs. A table have multiple column families and each column family can have any
number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a
timestamp. In short, in an HBase:
96
Notes
HDFS or HBASE
The data storage strategies used to store data in a file system are Hadoop distributed file system or
HBASE.
97
Notes
JSON has found wide use in Web and mobile applications, including large-scale big data and enterprise data
warehouse applications. JAQL can run in local mode on individual systems and in cluster mode, in the latter
case supporting Hadoop applications. It automatically generates MapReduce jobs and parallel queries on
Hadoop systems. JAQL was created by workers at IBM Research Labs in 2008 and released to open source. While
it continues to be hosted as a project on Google Code, where a downloadable version is available under an
Apache 2.0 license, the major development activity around JAQL has remained centered at IBM. The company
offers the query language as part of the tools suite associated with InfosphereBig Insights, its Hadoop platform.
Working together with a workflow orchestrator, JAQL is used in Big Insights to exchange data between storage,
processing and analytics jobs. It also provides links to external data and services, including relational databases
and machine learning data.
This language provides various operators using which programmers can develop their own functions for
reading, writing, and processing data.To analyze data using Apache Pig, programmers need to write scripts
using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a
component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
• Pig Latin
Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in
Java.
• Multi-query Approach
Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that
would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by almost 16 times.
• SQL-like Language
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
• Built-in Operators
Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In
addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Rich set of operators
− It provides many operators to perform operations like join, sort, filer, etc.
Ease of Programming
98
Notes
Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization Opportunities
The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only
on semantics of the language.
Extensibility
Using the existing operators, users can develop their own functions to read, process, and write data.
User-defined Functions
Pig provides the facility to create User-defined Functions in other programming languages such as Java
and invoke or embed them in Pig Scripts.
Handles all Kinds of Data
Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in
HDFS.
7. ZooKeeper
Zookeeper is the easiest way for effective configuration management. It has two main benefits. First, it can be
accessed from anywhere as it is stored centrally. This also reduces the issue with data integrity. Second, dynamic
configuration management can be done as configuration data is stored centrally. This allows adjusting the
system settings without restarting the system. Thus creating “znode” and storing configuration data is a handy
way for configuration management.
This is a simplified version of how we are going to setup Zookeeper.Zookeeper stores data in a tree of ZNodes similar to
Linux file system structure, a ZNode may contain another ZNodes or may have a value.App1 and App2 are sharing data
from / and config znodes.However db.host, db. username and db.password are specific to App1.Zookeeper is one of the
best centralized services for maintaining configuration, it is widely used by many other solutions like: Apache Hadoop,
Kafka, SolrCloud
8. Avro [Data Serialization System]
Doug Cutting
99
Notes
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of
Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with
data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in
Hadoop.Avro has a schema-based system. A language-independent schema is associated with its read and write
operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary
format, which can be deserialized by any application.Avro uses JSON format to declare the data structures.
Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.
Features of AVRO
• language-neutral • processed by many languages
• compressible and splittable.
• rich data structures
• Avro schemas defined in JSON
• self-describing file named Avro Data File
• Remote Procedure Calls (RPCs).
Avro is a language-neutral data serialization system.It can be processed by many languages (currently C, C++,
C#, Java, Python, and Ruby).Avro creates binary structured format that is both compressible and splittable.
Hence it can be efficiently used as the input to Hadoop MapReduce jobs.Avro provides rich data structures. For
example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes
can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.Avro
schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.Avro
creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata
section.Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in
the connection handshake.
100
Notes
An example UIM application might ingest plain text and identify entities, such as persons, places, organizations;
or relations, such as works-for or located-at.UIMA enables applications to be decomposed into components, for
example "language identification" => "language specific segmentation" => "sentence boundary detection" =>
"entity detection (person/place names etc.)". Each component implements interfaces defined by the framework
and provides self-describing metadata via XML descriptor files. The framework manages these components and
the data flow between them. Components are written in Java or C++; the data that flows between components is
designed for efficient mapping between these languages.
Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast
analytic queries against data of any size. It supports both nonrelational sources, such as the Hadoop Distributed
File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL,
PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata. Presto can query data where it is stored,
without needing to move data into a separate analytics system. Query execution runs in parallel over a pure
memory-based architecture, with most results returning in seconds. You’ll find it used by many well-known
companies like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq.Presto is an open source, distributed SQL
query engine designed for fast, interactive queries on data in HDFS, and others. Unlike Hadoop/HDFS, it does
not have its own storage system. Thus, Presto is complimentary to Hadoop, with organizations adopting both to
solve a broader business challenge. Presto can be installed with any implementation of Hadoop, and is packaged
in the Amazon EMR Hadoop distribution.
101
Notes
The Concept of Big Data and What it Encompasses can be Better Understood with Four Vs:
• Volume
The amount of data accumulated by private companies, public agencies, and other organizations on a daily basis
is extremely large. This makes volume the defining characteristic for big data.
• Velocity
It’s a given that data can and will pile up really fast. But what matters is the speed with which you can process
and examine this data so that it becomes useful information.
• Variety
The types of data that get collected can be very diverse. Structured data contained in databases, and
unstructured data such as tweets, emails, images, videos, and more, need to be consumed and processed all the
same.
• Veracity
Because of its scale and diversity, big data can contain a lot of noise. Veracity thus refers to the the certainty of
the data and how your big data tools and analysis strategies can separate the poor quality data from those that
really matter to your business.
• Technology leaders also name a fifth V – value. But this one isn’t inherent within the huge amounts of raw data.
Instead, the true value of big data can only be realized when the right information is captured and analyzed to
gain actionable insights.To get a better idea of how big big data is, let’s review some statistics:
• Over 1 billion Google searches are made and 294 billion emails are sent everyday
• Every minute, 65,972 Instagram photos are posted, 448,800 tweets are composed, and 500 hours-worth of
YouTube videos are uploaded.
• By 2020, the number of smartphone users could reach 6.1 billion. And taking Internet of Things (IoT) into
account, there could be 26 billion connected devices by then. For sure, big data is really big.
Why Should Big Data and its Exponential Growth Matter to your Business?
For one, an Accenture study (PDF) reveals that 79 percent of corporate executives surveyed believe that ‘companies that
do not embrace big data will lose their competitive position and may even face extinction’. Furthermore, an
overwhelming 83 percent have taken on big data projects with the aim of outperforming others in their respective
industries.Big data projects can impact almost any aspect of an organization. But as this survey by New Vantage Partners
(PDF) shows, where it delivers most value to enterprises is in reducing costs (49.2%) and driving innovation (44.3%).
102
Notes
103
Introduction to Big Data
104
Notes
105
Introduction to Big Data
5.8 Overview of Big Data
Due to the advent of new technologies, devices, and communication means like social networking
sites, the amount of data produced by mankind is growing rapidly every year.The amount of data
produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data
in the form of disks it may fill an entire football field. The same amount was created in every two
days in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though all
this information produced is meaningful and can be useful when processed, it is being neglected.
What is Big Data?
Big data is a collection of large datasets that cannot be processed using traditional computing
techniques.It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, techniques and frameworks. What Comes Under Big Data?
• Black Box Data- It is a component of helicopter, airplanes, and jets, etc. It captures voices
of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
• Social Media Data-The social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
• Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
• Power Grid Data − The power grid data holds information consumed by a particular
node with respect to a base station.
• Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
• Search Engine Data − Search engines retrieve lots of data from different databases.
Thus, Big Data includes huge volume, high velocity, and extensible variety of data
• Structured Data − Relational data.
• Semi Structured Data − XML data.
• Unstructured Data − Word, PDF, Text, Media Logs.
Operational
Big Data
Two classes
of technology
106
Notes
•Analytical
Big Data
Operatio Analyti
nal cal
Latency 1 ms - 100 1 min -
ms 100 min
Concurrenc 1000 - 1 - 10
y 100,000
Pattern Reads
107
Introduction to Big Data
Solution
Workshops and seminars on big data should be offered at firms for everyone. All staff that handle
data on a regular basis and are involved in Big Data projects should receive basic training. All
levels of the company must have a fundamental awareness of data ideas.
Solution
Companies are using current approaches like compression, tiering, and deduplication to handle
these massive data collections. Compression reduces the number of bits in data, resulting in a
smaller total size. The process of deleting duplicate and unnecessary data from a data set is known
as deduplication.
Companies can store data in separate storage levels via data tiering. It guarantees that the data is
stored in the best possible location. Depending on the size and relevance of the data, data tiers
might include public cloud, private cloud, and flash storage.
Companies are also turning to Big Data technologies like Hadoop, NOSQL, and others.
This brings us to the third issue with Big Data.
108
Notes
Companies are facing a scarcity of Big Data experts. This is due to the fact that data processing
tools have advanced fast, but most experts have not. In order to close the gap, concrete efforts must
be done.
Solution
Companies are devoting greater resources to the recruitment of talented workers. They must also
provide training programmes for current employees in order to get the most out of them.
Another key move made by businesses is the procurement of artificial intelligence/machine
learning-powered data analytics solutions. These tools may be used by professionals who aren't
data scientists but have a rudimentary understanding of the subject. This stage allows businesses to
save a significant amount of money on recruitment.
Securing Data
One of the most difficult aspects of Big Data is securing these massive data collections. Companies
are frequently so preoccupied with comprehending, preserving, and analyzing their data sets that
data security is pushed to the back burner. Unprotected data stores, on the other hand, may
become breeding grounds for malevolent hackers.
A stolen record or a data breach may cost a company up to $3.7 million.
Solution
To secure their data, businesses are hiring more cybersecurity workers. Other measures made to
protect data include: Encrypting data
Separation of data
Control of identity and access
Endpoint security implementation
Security monitoring in real time
Make use of Big Data security technologies like IBM Guardian.
Summary
• Apache Hadoop is a set of open-source software tools for solving issues involving large
volumes of data and processing utilising a network of many computers. It's a MapReduce
programming model-based software framework for distributed storage and processing of
massive data.
• Big data refers to massive, difficult-to-manage data quantities – both organised and
unstructured – that inundate enterprises on a daily basis. Big data may be evaluated for
insights that help people make better judgments and feel more confident about making
key business decisions.
• HDFS, or Hadoop Distributed File System, is a distributed file system that runs on
commodity hardware. It has a lot in common with other distributed file systems.
However, there are considerable distinctions between it and other distributed file systems.
HDFS is meant to run on low-cost hardware and is extremely fault-tolerant. HDFS is a file
system that allows high-throughput access to application data and is well-suited to
109
Introduction to Big Data
applications with huge data collections. To provide streaming access to file system data,
HDFS relaxes a few POSIX criteria.
• In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous
scalability over hundreds or thousands of computers. MapReduce, as the processing
component, lies at the heart of Apache Hadoop
• Hadoop Ecosystem is a platform or a suite that offers a variety of services to address big
data issues. It consists of Apache projects as well as a variety of commercial tools and
solutions. HDFS, MapReduce, YARN, and Hadoop Common are the four core
components of Hadoop.
• Apache Pig is a high-level framework for developing Hadoop-based apps. Pig Latin is the
name of the platform's language. Pig's Hadoop tasks may be run in MapReduce, Apache
Tez, or Apache Spark.
• Eclipse is a robust Java programming environment. Because Hadoop and Mapreduce
programming is done in Java, we should use an Integrated Development Environment
with a lot of features (IDE)
• Jaql is one of the languages used to abstract the intricacies of Hadoop's MapReduce
programming architecture. It's a functional language with a weakly typed syntax and lazy
evaluation.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
BigData: Big Data is a massive collection of data that continues to increase dramatically over time.
It is a data set that is so huge and complicated that no typical data management technologies can
effectively store or process it. Big data is similar to regular data, except it is much larger.
HDFS: Hadoop File System was built on distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with
lowcost hardware in mind.
Name Node:The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Data Node: The data node is a commodity computer with the GNU/Linux operating system and
data node software installed. In a cluster, there will be a data node for each node (common
hardware/system).
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
JAQL: Any software package that is used in connection with databases for searching, processing,
or even generating JavaScript Object Notion (JSON)-based documents is known as JSON query
language (JAQL).
Self Assessment
Q1: A parallel computer system can do a lot of things.
A. Decentralized computing
B. Parallel computing
C. Centralized computing
D. All of these
A. Parallel computation
110
Notes
A. Two
B. Three
C. Four
D. Five
A. Twitter
B. Facebook
C. Google
D. Yahoo
A. Open-Source tool
B. Commercial tool
C. House tool
D. Vendor tool
A. Pig
B. HBase
C. Hive
D. All of above
A. Search
B. Reporting
C. Both
D. None of above
111
Introduction to Big Data
Q10: _________ is a
Java IDE that is one of the 3 biggest and most popular IDE’s in the world
A. Paint
B. Notebook
C. Eclipse
D. All of above
Q11: The concept of big data and what it encompasses can be better understood with four Vs.
Those are:
A. Volume
B. Velocity
C. Veracity
D. All of above
Q12: _________ refers to the certainty of the data and how your big data tools and analysis
strategies can separate the poor-quality data from those that really matter to your business.
A. Volume
B. Velocity
C. Veracity
D. All of above
A. Terra
B. Mega
C. Giga
D. Peta
A. Structured Data
B. Unstructured Data
C. Semi-structured Data
D. All of the above
112
Notes
Unit 05:
Introduction to Hadoop
Q15: ________ is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information of the aircraft.
Review Questions
1. Difference between data mart and data ware house.
2. Writedown the Tips for Creating Effective Big Data Models.
3. Explain different types of data mart.
4. Write down advantages and disadvantages of data mart.
5. What do you understand by data streaming? Explain Use Cases for Real-Time and Streaming Data.
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We
Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
113
Notes
Objectives
• Learn Hadoop installation step by step
• Learn HDFS
• Learn about HDFS Architecture
• Learn Goals of HDFS
• Learn basic commands in HDFS
Introduction
Hadoop is primarily supported by the Linux operating system and its features. If you're using Windows, you can use
Cloudera VMware, which comes with Hadoop preconfigured, or Oracle VirtualBox, or VMware Workstation. In this
chapter, we will learn how to install Hadoop on VMware Workstation 12 using VMware Workstation. This will be
accomplished by installing CentOS on my virtual machine.
Prerequisites
You can use any of these to install the operating system. VirtualBox/VMWare/Cloudera
You can use any of these to install the operating system.
Operating System:
On Linux-based operating systems, Hadoop may be installed. Ubuntu and CentOS are two of the most popular operating
systems among them. We'll be using CentOS for this course.
Java
On your computer, you must install the Java 8 package.
Hadoop
The Hadoop 2.7.3 package is required.
• You can download the VMWare workstation by using the below link
https://customerconnect.vmware.com/en/downloads/info/slug/desktop_end_user_computing/
vmware_workstation_pro/15_0
• Open the.exe file after it has been downloaded and change the path to the desired location.
• Follow the installation instructions to the letter.
114
Notes
Select Create a New Virtual Machine from the drop-down menu as shown in Figure 2.
115
Unit 06: Hadoop Administration
1. Browse to the location of the CentOS file you downloaded, as seen in the image above. It is
important to note that it must be a disc image file.
2. Click on Next
3. Choose the name of your machine. 4. Then, click Next
Figure 4: Options
8. You may see three options in the image above: I Finished Installing, Change Disc, and Help.
You don't have to touch any of them until your CentOS installation is complete.
9. Your system is currently being tested and prepared for installation as shown in Figure 5.
116
Notes
10. When the checking percentage hits 100%, you will be brought to the following screen:
117
Unit 06: Hadoop Administration
11. You may select your preferred language here. English is the default language, and that is what
I have chosen. Then, click on continue.
Step4:
The login screen will look like this:
118
Notes
Figure 8: Login
• The Java 8 Package may be downloaded by clicking here. This file should be saved in your
home directory.
• Using the following command, extract the Java tar file:
tar -xvf jdk-8u101-linux-i586.tar.gz
119
Unit 06: Hadoop Administration
Step 8: To make an entry on sudoers file for Hadoop users. edits the sudoers file, which is used by
the sudo command. To change what users and groups are allowed to run sudo, run visudo $ visudo
We want hadoop3 must be allowed to run any command anywhere.
Step 9: Set up key-based ssh to its account.ssh-keygen is a tool for creating new authentication key
pairs for SSH. Such key pairs are used for automating logins, single sign-on, and for authenticating
hosts. This is the key you need to copy into your remote device to get successful SSH
authentication. Generating public/private RSA key pair.Remember to set the correct permissions
on the file system using command chmod.To authenticate to a remote server we should now copy
120
Notes
121
Unit 06: Hadoop Administration
On the terminal, use the following command to extract the Hadoop file: tar -xvf hadoop-3.2.2.tar.gz
Extracting hadoop file as shown in figure below:
Step13:Editing and Configuring Hadoop you must first set the path in the ~/.bashrc file. The
command ~ /.bashrc can be used to set the path from the root user. You should check your Java
configurations before editing ~/.bashrc. update-alternatives-config java
122
Notes
To get to this, hit the Insert key on your keyboard, and then start typing the following code to set a
Java path:
fi
#HADOOP VARIABLES START
export JAVA_HOME= (path you copied in the previous step)
export HADOOP_HOME=/home/(your username)/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/HOME/lib/native
export HADOOP_OPTS=”Djava.library.path”=$HADOOP_HOME/lib”
#HADOOP VARIABLES END
After writing the code, click on Esc on your keyboard and write the command: wq!
This will save and exit you from the vi editor. The path has been set now as it can be seen in the
image below:
Step14:Using the vi editor, open hadoop-env.sh. To inform Hadoop which path to use, replace this
path with the Java path. You will be presented with the following window:
123
Unit 06: Hadoop Administration
Step15:There are multiple XML files that need to be modified now, and you must specify the
property and path for each one. All configuration files are shown in the image below:
124
Notes
125
Unit 06: Hadoop Administration
o Exit from this window by pressing Esc and then writing the command: wq!
Step16:Create a directory namenode, datanode, and secondary using the command below:
126
Notes
Step 18: To check permissions of all the file that comes under Hadoop_datawarehouse, following
command will be executed:
All the files that comes into this folder their permission has changed.
Step22:Lets go to the Hadoop directory and run the command as shown below to format the name
node.
hadoopnamenode -format
127
Unit 06: Hadoop Administration
So, we will get a message that namenode has been successfully formatted.
Step 19:To start all the services or start Hadoop daemons. To start services, we will go to sbin
folder and will see all services.
start dfs.sh
128
Notes
Step20:Checking Hadoop
You must now verify whether the Hadoop installation was completed successfully on your
machine.
Go to the directory where you extracted the Hadoop tar file, right-click on the bin, and select Open
in Terminal from the menu.
Now type the command ls. If you get a window like the one below, that means Hadoop has been
successfully installed!
129
Unit 06: Hadoop Administration
6.1 HDFS
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulted tolerant and designed using
low-cost hardware.HDFS holds a very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines. These files are stored redundantly to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
• It is suitable for distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of the namenode and datanode help users to easily check the status of
the cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
130
Notes
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
• HDFS follows the master-slave architecture
Namenode
The NameNode is the centerpiece of an HDFS file system. It keeps thedirectory tree of all files in
the file system, and tracks where across the cluster the file data iskept. It does not store the data of
these files itself.Clientapplications talk to the Name Node whenever they wish to locate a file, or
when they want to add/copy/move/delete a file. The Name Node respondsthe successful requests
by returning a list of relevant Data Node servers wherethe data lives.The Name Node is a Single
Point of Failure for the HDFS Cluster.HDFS is not currently a High Availability system.When the
Name Node goesdown, the file system goes offline. There is an optional Secondary Name The node
can be hosted on a separate machine.It only creates checkpoints of thenamespace by merging the
edits file into the fsimage file and does not provide anyreal redundancy. Hadoop 0.21+ has a
Backup Name Node that is part of a plan to have an HA name service, but it needs active
contributions from thepeople who want it (i.e. you) to make it Highly Available.Tracks where
acrossthe cluster the file data is kept. It does not store the data of these files itself. Client
applications talk to the Name Node . Name Node responds the successful requests. Name Node
works as Master in Hadoop cluster. Below listed are the main function performed by Name Node:
1. Stores metadata of actual data.
2. Manages File system namespace.
Regulates client access request for actual file data file. Assign work toSlaves(DataNode). Executes
file system name space operation like opening/closing files, renaming files and directories.As
Name node keep metadata in memory for fast retrieval, the huge amount of memory is required
for its operation. This should be hosted on reliable hardware.
Data node
Data Node works as Slave in Hadoop cluster . Below listed are the main function performed by
Data Node:
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
Goal of HDFS
• Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
• Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
131
Notes
• ls
Using the ls command, we can check for the directories in HDFS.
List the HDFS contents. Hadoop Prefix
Every command in Hadoop have prefix:
hadoop fs
Or
hdfsdfs
List the Contents that are in Hadoop
hadoop fs -ls
Display the Contents of Directory
• Syntax
$hadoop fs –ls directoryname
Create a Directory in hdfs
• Syntax
hadoop fs –mkdirabc
Or
132
Notes
put(Copy single src, or multiple srcs from local file system to the destination file
system).
Create a File
To check either f1.txt is available or not. To check the contents of the file.
$hadoop fs –cat abc/f1.txt
Sending a File from hdfs to Local File System
Syntax
Hadoop fs –copyToLocalabc/f1.txt ~/desktop/
Verify either it is available in desktop location or not:
cd Desktop/
ls
Verify either it’s the same File
cat f1.txt
Copy Command
• Sending file from hdfs to hdfs directory • Syntax
is hadoop fs –mkdir abc1
Hadoop fs –cp abc/f1.txt abc1/
–cp abc/f1.txt - Source
abc1/ - Destination
Verify
$hadoop –fs –cat abc1/f1.txt
Move the Directory from one hdfs to other hdfs
Firstly, create another directory with abc2 hadoop fs –
mkdir abc2
Hadoop fs –mv abc/f1.txt abc2/
Check the Content with cat Command
hadoop fs –cat abc2/f1.txt
Which Directory is Taking More Space?
hadoop fs –du abc1
133
Notes
To Check Either Given File have Some Content or its Empty File
$hadoop fs –test –z destination
To print result
Echo $?
0 means it is zero content file
1 means it is non-zero content file.
Checksum Command: to Verify the Integrity of the File. Whether File is Modifies or
Not.
• Syntax
134
Notes
Summary
• Apache Hadoop is a Java-based open-source software framework for managing data
processing and storage in large data applications. Hadoop works by breaking down huge
data sets and analytical jobs into smaller workloads that can be handled in parallel across
nodes in a computing cluster. Hadoop can handle both organised and unstructured data,
and it can scale up from a single server to thousands of servers with ease.
• Java is an object-oriented programming language with a high level of abstraction and as
few implementation dependencies as feasible.
• Ssh-keygen is an utility that allows you to generate fresh SSH authentication key pairs.
This type of key pair is used to automate logins, provide single sign-on, and authenticate
hosts.
• The GNOME desktop environment's official text editor is gedit. Gedit is a strong
generalpurpose text editor that aims for simplicity and ease of use. It can produce and
modify a wide range of text files.
• A bashrc file is a shell script file that Linux uses to load modules and aliases into your
profile when it boots up. In your /home directory, you'll find your bashrc file. With a text
editor like nano, you can make modifications to it. Adding parts to your bashrc file, such
as modules to load when you sign in or aliases for commonly used commands, can help
you save time in your workflows.
• The Hadoop daemon receives information from the core-site.xml file about where
NameNode is located in the cluster. It provides Hadoop Core configuration parameters,
such as I/O settings shared by HDFS and MapReduce.
• The configuration parameters for HDFS daemons; the NameNode, Secondary NameNode,
and DataNodes, are included in the hdfs-site. xml file.... xml to provide default block
replication and permission checking on HDFS. When the file is generated, the number of
replications can also be selected.
135
Notes
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Java is a platform as well as a programming language. Java is a high-level programming language
that is also robust, object-oriented, and secure.
Process is what daemons stand for. Hadoop Daemons are a collection of Hadoop processes.
Because Hadoop is a Java platform, all of these processes are Java Processes.
NameNode is a component of the Master System. Namenode's main function is to manage all of
the MetaData. The list of files saved in HDFS is known as metadata (Hadoop Distributed File
System). In a Hadoop cluster, data is stored in the form of blocks, as we all know.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with
lowcost hardware in mind.
Data node: The data node is a commodity computer with the GNU/Linux operating system and
data node software installed. In a cluster, there will be a data node for each node (common
hardware/system).
Map-red:It is one of the most significant configuration files for Hadoop's runtime environment
settings. It includes MapReduce's setup options. By setting the MapReduce.framework.name
variable in this file, we may give MapReduce a name.
Data dependability refers to the completeness and accuracy of data, and it is a critical basis for
establishing data confidence within an organisation. One of the key goals of data integrity
programmes, which are also used to maintain data security, data quality, and regulatory
compliance, is to ensure data dependability.
Fault tolerance: Because it replicates data across several DataNodes, HDFS is fault-tolerant. A block
of data is duplicated on three DataNodes by default. Different DataNodes are used to hold the data
blocks. Data can still be obtained from other DataNodes if one node fails.
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
Blocks:Large files were broken into little segments known as Blocks in Hadoop HDFS. The physical
representation of data is called a block. Except for the final block, which might be the same size or
Introduction for Big Data
less, all HDFS blocks are the same size. Hadoop divides files into 128 MB blocks before storing
them in the Hadoop file system.
Self Assessment
1. _________ is the main prerequisite for Hadoop.
136
Notes
A. Java
B. HTML
C. C#
D. None of above
6. Hadoop cluster operate in three supported modes. Those modes are __________
A. Local/Standalone mode
B. Psuedo Distributed mode
C. Fully Distributed mode
D. All of above
137
Notes
11. When a computer is designated as a datanode, the disc space available to it is reduced.
A. Can be used only for HDFS storage
B. Can be used for both HDFS and non-HDFs storage
C. Cannot be accessed by non-hadoop commands
D. Cannot store text files.
138
Notes
15. When the Primary Name Node fails, the ___________ Name Node is utilized.
A. Data
B. Primary
C. Secondary
D. None of above
6. D 7. B 8. A 9. A 10. D
Review Questions
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We
Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley. White, Tom (2014).
Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
139
Notes
140
Notes
Unit 0
7: Hadoop Architecture
Objectives
• Learn what is hadoop
• Understand the Hadoop Core components
• Learn How Hdfs Works.
• What is Hadoop Cluster
• Learn Architecture of Hadoop Cluster
• HDFS Architecture and Hadoop features
Introduction
Apache Hadoop is an open-source software framework that stores data in a distributed manner and process that data in
parallel. Hadoop provides the world’s most reliable storage layer – HDFS, a batch processing engine – MapReduce and a
resource management layer – YARN.
141
Notes
core components)
Hadoop consists of
and layers
three they are
-
HDF( –Hadoo
Distributed
Syste
MapRedu
Yar –Yet
Resource r
• HDFS – Hadoop Distributed File System provides for the storage of Hadoop. As the name suggests it stores
the data in a distributed manner. The file gets divided into a number of blocks which spreads across the cluster
of commodity hardware.This, however, is transparent to the user working on HDFS. To them, it seems like
storing all the data onto a single machine.These smaller units are the blocks in HDFS. The size of each of these
blocks is 128MB by default, you can easily change it according to requirement. So, if you had a fileof size 512MB,
it would be divided into 4 blocks storing 128MB each.If, however, you had a file of size 524MB, then, it would be
divided into 5 blocks. 4 of these would store 128MB each, amounting to 512MB. And the 5th would store the
remaining 12MB. That’s right! This last block won’t take up the complete 128MB on the disk.
•
7: Hadoop Architecture
142
Notes
Unit 0
Well, the amount of data with which we generally deal with in Hadoop is usually in the order of petra bytes or
higher.Therefore, if we create blocks of small size, we would end up with a colossal number of blocks. This would mean
we would have to deal with equally large metadata regarding the location of the blocks which would just create a lot of
overhead. And we don’t really want that!.The file itself would be too large to store on any single disk alone. Therefore, it
is prudent to spread it across different machines on the cluster.It would also enable a proper spread of the workload and
prevent the choke of a single machine by taking advantage of parallelism.
HDFS operates in a master-slave architecture, this means that there are one master node and several slave nodes in the
cluster. The master node is the Namenode.
• Namenode is the master node that runs on a separate node in the cluster.Manages the filesystem namespace which
is the filesystem tree or hierarchy of the files and directories.Stores information like owners of files, file
permissions, etc for all the files. It is also aware of the locations of all the blocks of a file and their size.
Introduction for Big Data
Namenode in HDFS
143
Notes
All this information is maintained persistently over the local disk in the form of two files: Fsimage and Edit Log.
Fsimage
Information is maintained
persistently ov er the local disk
in the form of two files:
and Edit Log .
• Fsimage stores the information about the files and directories in the filesystem. For files, it stores the replication
level, modification and access times, access permissions, blocks the file is made up of, and their sizes. For
directories, it stores the modification time and permissions.
• Edit Log on the other hand keeps track of all the write operations that the client performs. This is regularly
updated to the in-memory metadata to serve the read requests.
Whenever a client wants to write information to HDFS or read information from HDFS, it connects with the Namenode.
The Namenode returns the location of the blocks to the client and the operation is carried out.Yes, that’s right, the
Namenode does not store the blocks. For that, we have separate nodes.Datanodes are the worker nodes. They are
inexpensive commodity hardware that can be easily added to the cluster.They periodically send heartbeats to the
Namenode so that it is aware of their health. With that, a DataNode also sends a list of blocks that are stored on it so that
the Namenode can maintain the mapping of blocks to Datanodes in its memory. But in addition to these two types of
nodes in the cluster, there is also another node called the Secondary Namenode.Datanodes are responsible for storing,
retrieving, replicating, deletion, etc. of blocks when asked by the Namenode.They periodically send heartbeats to the
Namenode so that it is aware of their health. With that, a DataNode also sends a list of blocks that are stored on it so that
the Namenode can maintain the mapping of blocks to Datanodes in its memory.But in addition to these two types of
nodes in the cluster, there is also another node called the Secondary Namenode.
Suppose we need to restart the Namenode, which can happen in case of a failure. This would mean that we have to copy
the Fsimage from disk to memory. Also, we would also have to copy the latest copy of Edit Log to Fsimage to keep track
of all the transactions. But if we restart the node after a long time, then the Edit log could have grown in size. This would
mean that it would take a lot of
144
Notes
Unit 0
7: Hadoop Architecture
time to apply the transactions from the Edit log. And during this time, the filesystem would be offline. Therefore, to solve
this problem, we bring in the Secondary Namenode.Secondary Namenode is another node present in the cluster whose
main task is to regularly merge the Edit log with the Fsimage and produce check‐ points of the primary’s in-memory file
system metadata. This is also referred to as Checkpointing.
Figure 6: MapReduce
The input dataset is first split into chunks of data. In this example, the input has three lines of text with three separate
entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then split into three chunks, based on these
entities, and processed parallelly.In the map phase, the data is assigned a key and a value of 1. In this case, we have one
bus, one car, one ship, and one train.These key-value pairs are then shuffled and sorted together based on their keys. At
the reduce phase, the aggregation takes place, and the final output is obtained.
145
Notes
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is
available as a component of Hadoop version 2. Hadoop YARN acts like an OS to Hadoop. It is a file system that is built
on top of HDFS.It is responsible for managing cluster resources to make sure you don't overload one machine. It performs
job scheduling to make sure that the jobs are scheduled in the right place.
Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource
manager (Hadoop Yarn), which is responsible for resource allocation and management.In the node section, each of the
nodes has its node managers. These node managers manage the nodes and monitor the resource usage in the node. The
containers contain a collection of physical resources, which could be RAM, CPU, or hard drives. Whenever a job request
comes in, the app master requests the container from the node manager. Once the node manager gets the resource, it goes
back to the Resource Manager.
Hadoop Daemons
The Hadoop Daemons arethe processes that run in the background. These 4 daemons run for Hadoop to be functional.
The Hadoop Daemons are:
7. The reduce function summarizes the output of the mapper and generates the output. The outputof the reducer is
stored on HDFS.
8. For multiple reduce functions, the user specifies the number of reducers. When there are multiplereduce tasks,
the map tasks partition their output, creating one partition for each reduce task.
146
Notes
Unit 0
How HDFS Works?
YARN is the resource management layer in Hadoop. It schedules the task in the Hadoop cluster and assigns resources to
the applications running in the cluster. It is responsible for providing the computational resources needed for executing
the applications.There are two YARN daemons running in the Hadoop cluster for serving YARN core services. They are:
a. ResourceManager: It is the master daemon of YARN. It runs on the master node per cluster to manage the
resources across the cluster. The ResourceManager has two major components that are Scheduler and
ApplicationManager.
• The scheduler allocates resources to various applications running in the cluster.
• ApplicationManager takes up the job submitted by the client, and negotiates thecontainer for
executing the application-specific ApplicationMaster, and restarts the
ApplicationMaster container on failure.
b. Node Manager: NodeManager is the slave daemons of YARN. It runs on all the slave nodes in the cluster. It
is responsible for launching and managing the containers on nodes. Containers execute the application-
specific processes with a constrained set of resources such as memory, CPU, and so on.When NodeManager
starts, it announces himself to the ResourceManager. It periodically sends a heartbeat to the
ResourceManager. It offers resources to the cluster.
c. Application Master: The per-application ApplicationMaster negotiates containers form schedulers and
tracks container status and monitors the container progress.A client submits an application to the
ResourceManager. The ResourceManager contacts the NodeManager that launches and monitors the
compute containers on nodes in the cluster. The container executes the ApplicationMaster.
The MapReduce task and the ApplicationMaster run in containers which are scheduled by the ResourceManager
and managed by the NodeManagers.
Introduction for Big Data
147
Notes
high-end machine which acts as a master. These master and slaves implement distributed computing over distributed
data storage. It runs opensource software for providing distributed functionality.
Architecture of Hadoop
It is a machine with a good configuration of memory and CPU. There are two daemons running on the master and they
are Name Node and Resource Manager.
i. Functions of Name Node
Manages file system namespace, Regulates access to files by clients, Stores metadata of actual data for example –
file path, number of blocks, block id, the location of blocks etc.
Executes file system namespace operations like opening, closing, renaming files and directories, The Name Node
stores the metadata in the memory for fast retrieval. Hence, we should configure it on a high-end machine.
148
Notes
Unit 0
There are two
on Slave daemons
machines andrunning
they are
DataNode
149
Introduction for Big Data
Failover is a process in which the system transfers control to a secondary system in an event of
failure.
• Graceful Failover – In this type of failover the administrator manually initiates it. We use
graceful failover in case of routine system maintenance. There is a need to manually
transfer the control to standby NameNode it does not happen automatically.
• Automatic Failover – In Automatic Failover, the system automatically transfers the
control to standby NameNode without manual intervention. Without this automatic
failover if the NameNode goes down then the entire system goes down.
Hence the feature of Hadoop high availability is available only with this automatic failover, it
acts as your insurance policy against a single point of failure.
Distributed
Blocks Replication
Storage
High
Data Reliability Fault Tolerance
Availability
Scalability High
•Vertical Scaling throughput
•Horizontal access to
Scaling
application data
Figure 14: Hadoop HDFS Features
150
Notes
Distributed Storage
HDFS stores data in a distributed manner. It divides the data into small pieces and stores
it on different Data Nodes in the cluster. In this manner, the Hadoop Distributed File
System provides a way to MapReduce to process a subset of large data sets broken into
blocks, parallelly on several nodes. MapReduce is the heart of Hadoop, but HDFS is the
one who provides it all these capabilities.
Blocks
HDFS splits huge files into small chunks known as blocks. Block is the smallest unit of data in a
filesystem. We (client and admin) do not have any control on the block like block location.
NameNode decides all such things.HDFS default block size is 128 MB. We can increase or decrease
the block size as per our need. This is unlike the OS filesystem, where the block size is 4 KB. If the
data size is less than the block size of HDFS, then block size will be equal to the data size.For
example, if the file size is 129 MB, then 2 blocks will be created for it. One block will be of default
size 128 MB, and the other will be 1 MB only and not 128 MB as it will waste the space (here block
size is equal to data size). Hadoop is intelligent enough not to waste the rest of 127 MB. So it is
allocating 1 MB block only for 1 MB data. The major advantage of storing data in such block size is
that it saves disk seek time and another advantage is in the case of processing as mapper processes
1 block at a time. So 1 mapper processes large data at a time.
Replication
Hadoop HDFS creates duplicate copies of each block. This is known as replication. All blocks are
replicated and stored on different DataNodes across the cluster. It tries to put at least 1 replica in a
different rack.
High Availability
Replication of data blocks and storing them on multiple nodes across the cluster provides high
availability of data. As seen earlier in this Hadoop HDFS tutorial, the default replication factor is 3,
and we can change it to the required values according to the requirement by editing the
configuration files (hdfs-site.xml).
Data Reliability
As we have seen in high availability in this HDFS tutorial, data is replicated in HDFS; It is stored
reliably as well. Due to replication, blocks are highly available even if some node crashes or some
hardware fails. If the DataNode fails, then the block is accessible from other DataNode containing a
replica of the block. Also, if the rack goes down, the block is still available on the different rack.
This is how data is stored reliably in HDFS and provides fault-tolerant and high availability.
Fault Tolerance
HDFS provides a fault-tolerant storage layer for Hadoop and other components in the
ecosystem.HDFS works with commodity hardware (systems with average configurations) that has
high chances of getting crashed at any time. Thus, to make the entire system highly fault-tolerant,
HDFS replicates and stores data in different places.
Scalability
Scalability means expanding or contracting the cluster. We can scale Hadoop HDFS in 2 ways.
1. Vertical Scaling: We can add more disks on nodes of the cluster. For doing this, we need
to edit the configuration files and make corresponding entries of newly added disks. Here
we need to provide downtime though it is very less. So people generally prefer the second
way of scaling, which is horizontal scaling.
2. Horizontal Scaling: Another option of scalability is of adding more nodes to the cluster on
the fly without any downtime. This is known as horizontal scaling. We can add as many
151
Introduction for Big Data
nodes as we want in the cluster on the fly in real-time without any downtime. This is a
unique feature provided by Hadoop.
Summary
• Apache Hadoop is a Java-based open-source software framework for managing data
processing and storage in large data applications. Hadoop works by breaking down huge
data sets and analytical jobs into smaller workloads that can be handled in parallel across
nodes in a computing cluster.
• Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.
• YARN is one of Apache Hadoop's main components, and it's in charge of assigning system
resources to the many applications operating in a Hadoop cluster, as well as scheduling jobs
to run on different cluster nodes.
• MapReduce is well suited to iterative computations with massive amounts of data that
require parallel processing. Rather than a method, it depicts a data flow. MapReduce may be
used to process a graph in parallel. The map, shuffle, and reduce stages of graph algorithms
all follow the same pattern.
• An HDFS file system is built around the NameNode. It maintains the directory tree of all
files in the file system and records where the file data is stored across the cluster.... In
response to successful queries, the NameNode returns a list of relevant DataNode servers
where the data is stored.
• Hadoop YARN stands for Yet Another Resource Negotiator (YARN). There is a requirement
to manage resources at both a global and a node level in a Hadoop cluster.
• In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous
scalability over hundreds or thousands of computers. MapReduce, as the processing
component, lies at the heart of Apache Hadoop
• Hadoop Ecosystem is a platform or a suite that offers a variety of services to address big
data issues. It consists of Apache projects as well as a variety of commercial tools and
solutions. HDFS, MapReduce, YARN, and Hadoop Common are the four core components
of Hadoop.
Keywords
• Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing
power, and it can perform almost unlimited concurrent processes or jobs.
• Failover:If the primary system fails or is taken down for maintenance, failover is a backup
operational mode that immediately switches to a standby database, server, or network.
152
Notes
Failover technology smoothly sends requests from a downed or failing system to a backup
system that replicates the operating system environment.
• HDFS: Hadoop File System was built on a distributed file system architecture. It runs on
standard hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and
built with low-cost hardware in mind.
• Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on
standard hardware.
• Data node: The data node is a commodity computer with the GNU/Linux operating system
and data node software installed. In a cluster, there will be a data node for each node (common
hardware/system).
• Data dependability refers to the completeness and accuracy of data, and it is a critical basis for
establishing data confidence within an organisation. One of the key goals of data integrity
programmes, which are also used to maintain data security, data quality, and regulatory
compliance, is to ensure data dependability.
• Fault tolerance: Because it replicates data across several DataNodes, HDFS is fault-tolerant. A
block of data is duplicated on three DataNodes by default. Different DataNodes are used to
hold the data blocks. Data can still be obtained from other DataNodes if one node fails.
• HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
• Blocks:Large files were broken into little segments known as Blocks in Hadoop HDFS. The
physical representation of data is called a block. Except for the final block, which might be the
same size or less, all HDFS blocks are the same size. Hadoop divides files into 128 MB blocks
before storing them in the Hadoop file system.
Self Assessment
1. Filesystems that manage the storage across a network of machines are called
_________________
A. Distributed file systems
B. Distributed field systems
C. Distributed file switch
D. None of above
3. HDFS operates in a master-slave architecture, this means that there are one master node and
several slave nodes in the cluster. The master node is the __________.
A. Datanode
B. Namenode
C. Both
D. All of the above
4. In which of the following files information is maintained persistently over the local disk.
A. Fsimage and Edit log
153
Introduction for Big Data
B. Edit log and Fedit
C. Fsimage and Fedit
D. All of the above
10. Slave computers have two daemons operating, and they are
A. Nodemanager and edgenode
B. Edgenode and datanode
C. Factnode and datanode
D. Datanode and node manager
11. Hadoop manages the jobs by breaking them down into _____________. A. Smaller chats.
B. Smaller chunks.
C. Sink chunks.
D. None of the above
12. Failover is a process in which the system transfers control to a secondary system in an event
of failure.
A. Graceful Failover
B. Failover
C. Automatic failover
154
Notes
13. HDFS splits huge files into small chunks known as ________
A. File
B. Blocks
C. Both
D. None of the above
15. Each block in Hadoop HDFS is duplicated twice. This is referred to as ___________.
A. Job tracker
B. Replication
C. Both
D. None of the above
155
Notes
6. A 7. C 8. C 9. D 10. D
15.
11. B 12. B 13. B 14. C B
Review Questions
1. Explain architecture of Hadoop.
2. Explain all Hadoop HDFS features.
3. Write down HDFS components.
4. Difference between yarn and MapReduce.
5. Write note on
A. Data reliability
B. Replication
C. Fault tolerance
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We
Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
156
Notes
8.2 Hadoop-Streaming
8.3 Installing Java
8.4 Creating User Account
8.5 Installing Hadoop
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further Readings
Objectives
• Learn Hadoop-MapReduce
• Learn Hadoop – Streaming
• Learn setup of the Hadoop Multi-Node cluster on a distributed environment.
• Learn how to create a system user account
Introduction
Hadoop's MapReduce architecture is used to process massive amounts of data in parallel on enormous clusters of
hardware in a secure way. It allows an application to store data in a distributed format and process large datasets across
groups of computers using simple programming models, which is why MapReduce is a programming model for
processing large amounts of data distributed across a number of clusters using steps such as input splits, Map, Shuffle,
and Reduce.
• Input splits
• Map
• Shuffle
• Reduce
157
Notes
Terminology
Payload:The Map and Reduce functions are implemented by PayLoad Applications, which are at the heart of the work.
Mapper: The input key/value pairs are mapped to a collection of intermediate key/value pairs by the Mapper.
Namenode:Node that administers the Hadoop Distributed File System is known as NamedNode (HDFS).
DataNode is a node in which data is delivered in advance of any processing.
JobTrackeroperates on the MasterNode, which takes work requests from clients.
SlaveNode: This is the node where the Map and Reduce programmes are performed.
JobTrackeris a programme that schedules jobs and assigns them to Task Tracker.
Task Tracker keeps track of the task and updates JobTracker on its progress.
Job:A job is a programme that runs a Mapper and Reducer on a dataset.
Task:A Mapper or Reducer task is the execution of a Mapper or Reducer on a slice of data.
Task Attempt is a specific instance of a task execution attempt on a SlaveNode.
158
Notes
Unit 0
8: Hadoop Master Slave Architecture
Advantages of MapReduce
Scalable:Hadoop is very scalable because to MapReduce, which allows big data sets to be stored in distributed form
across numerous servers. Because it is distributed over different servers, it may run in parallel.
Cost-effective solution:MapReduce is a very cost-effective option for organisations that need to store and process large
amounts of data in a very cost-effective way, which is a current business requirement.
Flexibility: Hadoop is incredibly adaptable when it comes to multiple data sources and even different types of data, such
as structured and unstructured data, thanks to MapReduce. As a result, it gives you a lot of flexibility when it comes to
accessing and processing organised and unstructured data.
Fast: Because Hadoop stores data in a distributed file system, which stores data on a cluster's local disc, and MapReduce
algorithms are often stored on the same servers, data processing is faster because there is no need to retrieve data from
other servers.
Parallel processing: Because Hadoop stores data in a distributed file system and runs a MapReduce algorithm, it separates
jobs into map and reduce tasks that may run in parallel. Furthermore, due of the simultaneous execution, the overall run
time is reduced.
8.2 Hadoop-Streaming
It is a Hadoop distribution feature that lets developers and programmers to construct Map-Reduce programmes in a
variety of programming languages such as Ruby, Perl, Python, C++, and others. Any language that can read from
standard input (STDIN), such as keyboard input, and write to standard output can be used (STDOUT). Although the
Hadoop Framework is designed entirely in
Java, Hadoop apps do not have to be written in the Java programming language. Hadoop Streaming is a functionality that
has been available since Hadoop version 0.14.1.
159
Notes
160
Notes
On the terminal, use the following command to extract the Hadoop file: tar -xvf hadoop-
3.2.2.tar.gz Extracting hadoop file as shown in figure below:
Step13: Editing and Configuring HadoopYou must first set the path in the ~/.bashrc file. The command ~ /.bashrc can be
used to set the path from the root user. You should check your Java configurations before editing ~/.bashrc. update-
alternatives-config java
You'll now be able to see all of the Java versions installed on the computer. Because I only have one version of Java, which
is the most recent, it is displayed below:
161
Notes
To get to this, hit the Insert key on your keyboard, and then start typing the following code to set a Java path:
fi
#HADOOP VARIABLES START
export JAVA_HOME= (path you copied in the previous step) export
HADOOP_HOME=/home/(your username)/hadoop export PATH=$PATH:
$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME export
YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/HOME/lib/native export
HADOOP_OPTS=”Djava.library.path”=$HADOOP_HOME/lib”
#HADOOP VARIABLES END
After writing the code, click on Esc on your keyboard and write the command: wq! This will save and exit you from the vi
editor. The path has been set now as it can be seen in the image below:
Step14:Using the vi editor, open hadoop-env.sh. To inform Hadoop which path to use, replace this path with the Java
path. You will be presented with the following window:
162
Notes
Step15:There are multiple XML files that need to be modified now, and you must specify the property and path for each
one. All configuration files are shown in the image below:
163
Notes
o Exit from this window by pressing Esc and then writing the command: wq!
Step16: Create a directory namenode, datanode, and secondary using the command below:
Step 17: As we can see in the above image, permissions have been given only to the root. So, next, we will use chown
command as shown in image below to change permission to hadoop2.
Step 18: To check permissions of all the file that comes under Hadoop_datawarehouse, following command will be
executed:
All the files that comes into this folder their permission has changed.
Step22: Lets go to the Hadoop directory and run the command as shown below to format the name node.
hadoopnamenode -format
164
Notes
So, we will get a message that namenode has been successfully formatted.
Step 19: To start all the services or start Hadoop daemons. To start services, we will go to sbin folder and will see all
services.
start dfs.sh
165
Notes
Summary
• Hadoop MapReduce is a programming paradigm used by Apache Hadoop to provide tremendous scalability
across hundreds or thousands of Hadoop clusters running on cheap hardware. On a Hadoop cluster, the
MapReduce paradigm uses a distributed algorithm to process huge unstructured data sets.
• The fundamental components of Apache Hadoop, which are incorporated into CDH and supported by a
Cloudera Enterprise subscription, allow you to store and handle a limitless amount of data of any sort on a single
platform.
• YARN is an open source resource management framework for Hadoop that allows you to go beyond batch
processing and expose your data to a variety of workloads such as interactive SQL, sophisticated modelling, and
real-time streaming.
• Hadoop's shuffle phase passes map output from a Mapper to a Reducer in MapReduce.
• In MapReduce, the sort phase is responsible for combining and sorting map outputs. The mapper's data is
aggregated by key, distributed across reducers, then sorted by key. All values associated with the same key are
obtained by each reducer.
• The JobTracker is a Hadoop service that distributes MapReduce tasks to specified nodes in the cluster, preferably
those that hold the data or are in the same rack. Jobs are submitted to the Job Tracker by client apps. The
JobTracker sends the work to the TaskTracker nodes that have been selected.
• Hadoop streaming is a feature included in the Hadoop distribution. You may use this programme to construct
and run Map/Reduce tasks using any executable or script as the mapper and/or reducer.
• Ssh-keygen is an utility that allows you to generate fresh SSH authentication key pairs. This type of key pair is
used to automate logins, provide single sign-on, and authenticate hosts.
• Hadoop is a Java-based Apache open source platform that allows big datasets to be processed across clusters of
computers using simple programming techniques. The Hadoop framework application runs in a clustered
computing environment that allows for distributed storage and computation.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on commodity hardware
clusters. It has a lot of storage for any sort of data, a lot of processing power, and it can perform almost unlimited
concurrent processes or jobs.
166
Notes
Unit 0
8: Hadoop Master Slave Architecture
Job Tracker:The job tracker is a master daemon that runs on the same node as the data nodes and manages all of the jobs.
This data will be stored on multiple data nodes, but it is the task tracker's responsibility to keep track of it.
Process is what daemons stand for. Hadoop Daemons are a collection of Hadoop processes. Because Hadoop is a Java
platform, all of these processes are Java Processes.
Resource Manager:The Resource Manager in YARN is basically a scheduler. In essence, it is confined to dividing the
system's available resources among competing applications. It optimises optimal cluster utilisation (keeping all resources
in use at all times) while taking into account different limitations such as capacity guarantees, fairness, and service level
agreements (SLAs). The Resource Manager contains a pluggable scheduler that permits other algorithms, such as
capacity, to be utilised as needed to accommodate varied policy restrictions. The "yarn" user is used by the daemon.
Application Master: The Application Master is a framework-specific library that is in charge of negotiating resources
with the Resource Manager and working with the Node Manager(s) to execute and monitor Containers and their resource
usage. It is in charge of negotiating suitable resource Containers with the Resource Manager and keeping track of their
progress. The Resource Manager monitors the Application Master, which operates as a single Container.
Job history service:This is a daemon that keeps track of jobs that have been finished. It's best to run it as a separate
daemon. Because it maintains task history information, running this daemon uses a lot of HDFS space. The "mapred" user
runs this daemon.
Container: A Container is a resource allocation that occurs as a result of a Resource Request being granted by the
Resource Manager. A Container allows a programme to access a certain amount of resources (memory, CPU, etc.) on a
certain host. To make use of Container resources, the Application Master must take the Container and offer it to the Node
Manager in charge of the host to which the Container has been assigned. To guarantee that Application Master(s) cannot
fake allocations in the cluster, the Container allocation is checked in secure mode.
Node Manager:A Container is a resource allocation that occurs as a result of a Resource Request being granted by the
Resource Manager. A Container allows a programme to access a certain amount of resources (memory, CPU, etc.) on a
certain host. To make use of Container resources, the Application Master must take the Container and offer it to the Node
Manager in charge of the host to which the Container has been assigned. To guarantee that Application Master(s) cannot
fake allocations in the cluster, the Container allocation is checked in secure mode.
NameNode is a component of the Master System. Namenode's main function is to manage all of the MetaData. The list of
files saved in HDFS is known as metadata (Hadoop Distributed File System). In a Hadoop cluster, data is stored in the
form of blocks, as we all know.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard hardware. HDFS,
unlike other distributed systems, is extremely fault-tolerant and built with lowcost hardware in mind.
Data node: The data node is a commodity computer with the GNU/Linux operating system and data node software
installed. In a cluster, there will be a data node for each node (common hardware/system).
Map-red:It is one of the most significant configuration files for Hadoop's runtime environment settings. It includes
MapReduce's setup options. By setting the MapReduce.framework.name variable in this file, we may give MapReduce a
name.
Self Assessment
1. Which of the following are major pre-requisites for MapReduce programming.
A. The application must lend itself to parallel programming
B. The data for the applications can be expressed in key-value pairs
C. Both
D. None of above
Introduction for Big Data
167
Notes
3. Input key/value pairs are mapped to a collection of intermediate key/value pairs using __________ .
A. Mapper
B. Reducer
C. Both Mapper and Reducer
D. None of the mentioned
4. The master is a ________, and each cluster has only one NameNode.
A. Data Node
B. NameNode
C. Data block
D. Replication
9. Commands to create a system user account on both master and slave systems
A. useraddhadoop
B. adduserhadoop
C. useridhadoop
D. addidHadoop
168
Notes
Unit 0
10. Hadoop Streaming uses standard ____ streams as the interface between Hadoop and user program.
A. Unix
B. Linux
C. C++
D. None of above
6. A 7. B 8. D 9. A 10. A
169
Notes
Review Questions
1. Difference between job tracker and task tracker.
2. Write down steps to install hadoop.
3. Write down HDFS components.
4. What do you understand by resource manager.
5. What is function of. /bashrc file?
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We
Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation, Competition, and
Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practice of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley. White, Tom (2014).
Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
170
Notes
Objectives
• Starting HDFS
• Creating User Account
• Configuring Key Based Login
• Configuring Hadoop on Master Server
• Understand how to add a New Data Node in the Hadoop Cluster Learn Adding
User and Ssh Access.
Introduction
Starting HDFS
Format the configured HDFS file system and then open the namenode (HDFS server) and execute
the following command.
hadoopnamenode -format
Namenode is the node in the Hadoop Distributed File System which keeps track of all the data
stored in the Datanode. Namenode has metadata related to the data stored on the Datanodes and
has information related to the location of the data stored. So, when you run the hadoopnamenode
format command, all these information is deleted from the namenode which means that the system
does not know where the data is stored hence losing all data stored in the Hadoop File
System.Formatting the namenode deletes the information from namenode directory. The
NameNode directory is specified in hdfs-site.xml file in
dfs.namenode.name.dir property.
Formatting the file system means initializing the directory
specified by the dfs.name.dir variable.After you have logged in as the
dedicated user for Hadoop(in my case it is hduser) that you must have created while installation,
go to the installation folder of Hadoop(in my case it is /usr/local/hadoop).
start-dfs.sh
Inside the directory Hadoop, there will be a folder 'sbin', where there will be several files like
startall.sh, stop-all.sh, start-dfs.sh, stop-dfs.sh, hadoop-daemons.sh, yarn-daemons.sh, etc.
171
Notes
Introduction for Big Data
Executing these files can help us start and/or stop in various ways. start-all.sh & stop-all.sh: Used
to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the
daemons on all the nodes of a cluster. These commands are now deprecated. start-dfs.sh, stop-
dfs.sh and startyarn.sh, stop-yarn.sh: Same as above but start/stop HDFS and YARN daemons
separately on all the nodes from the master machine. It is advisable to use these commands instead
of start-all.sh & stop-all.sh. To start individual daemons on an individual machine manually. You
need to go to a particular node and supply these commands. hadoop-daemon.sh
namenode/datanode and yarndeamon.sh resourcemanager.
172
Notes
Multi-Node Cluster
A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop
environment. This is practically used in organizations to store and analyze their Petabytes and
Exabytes of data. Here, we are taking two machines – master and slave. On both the machines, a
Datanode will be running.Installing Java
Syntax of java version command
$ java –version
173
Notes
Introduction for Big Data
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2
The wget command is a command line utility for downloading files from the Internet. It supports
downloading multiple files, downloading in the background, resuming downloads, limiting the
bandwidth used for downloads and viewing headers
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz
The tar command is used to create compressed archives which represent a file or collection of
files. A tar file, commonly known as a “tarball,” a gzip, or a bzip file, will have an extension ending
with . tar or . tar. gz
# tar -xzf hadoop-1.2.0.tar.gz
mv stands for move. mv is used to move one or more files or directories from one place to another
in a file system
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
174
Notes
Configuring Hadoop
Hadoop server must be configured in core-site.xml and should be edited where ever required.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
The core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It
contains the configuration settings for Hadoop Core such as I/O settings that are common
to HDFS and MapReduce.
• The hdfs-site.xml file contains the configuration settings for HDFS daemons; the NameNode,
the Secondary NameNode, and the DataNodes. Here, we can configure hdfssite.xml to
specify default block replication and permission checking on HDFS. The actual number of
replications can also be specified when the file is created. The default is used if
175
Notes
Introduction for Big Data
<configuration> <property>
<property> <name>dfs.name.dir</name>
<name>dfs.data.dir</name> <value>/opt/hadoop/hadoop/dfs/name</v
alue>
<value>/opt/hadoop/hadoop/dfs/na
me/data</value> <final>true</final> <final>true</final>
</property> </property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
replication is not specified in create time.Hadoop server must be configured in hdfs-site.xml and
should be edited where ever required.
Hadoop server must be configured in mapred-site.xml and should be edited where ever required.
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>
</property>
</configuration>
The mapred-site.xml file contains the configuration settings for MapReduce daemons; the job
tracker and the task-trackers.
hadoop-env.sh
JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited as follows:
export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
176
Notes
177
Notes
Introduction for Big Data
start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh : Same as above but start/stop HDFS and
YARN daemons separately on all the nodes from the master machine. It is advisable to use these
commands now over start-all.sh & stop-all.sh hadoop-daemon.sh namenode/datanode and yarn-
deamon.sh resourcemanager : To start individual daemons on an individual machine manually.
You need to go to a particular node and issue these commands.
In a multi-node hadoop cluster, all the essential daemons are up and run on different
machines/hosts. A multi-node hadoop cluster setup has a master slave architecture where in one
machine acts as a master that runs the NameNode daemon while the other machines acts as slave
or worker nodes to run other hadoop daemons. Usually in a multi-node hadoop cluster there are
cheaper machines (commodity computers) that run the TaskTracker and DataNode daemons while
other services are run on powerful servers. For a multi-node hadoop cluster, machines or
computers can be present in any location irrespective of the location of the physical server.
Networking
Add new nodes to an existing Hadoop cluster with some appropriate network configuration.
Assume the following network configuration.
178
Notes
179
Notes
Introduction for Big Data
Examine the output of the jps command on slave2.in. After some time has passed, you will notice
that the DataNode process has been automatically terminated.
Summary
• Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary
data storage system. HDFS is a distributed file system that uses a NameNode and
DataNode architecture to allow high-performance data access across highly scalable
Hadoop clusters.
• Hadoop is an Apache Foundation open source framework for processing huge volumes of
heterogeneous data sets in a distributed way across clusters of commodity computers and
hardware using a simplified programming style. Hadoop is a dependable distributed
storage and analysis solution.
• By duplicating data over numerous nodes, it provides extremely stable and distributed
storage that assures reliability even on commodity hardware. When data is submitted to
HDFS, it is automatically divided into numerous blocks (adjustable parameter) and
stored/replicated across many data nodes, unlike a traditional file system. As a result,
high availability and fault tolerance are ensured.
180
Notes
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Job Tracker:The job tracker is a master daemon that runs on the same node as the data nodes and
manages all of the jobs. This data will be stored on multiple data nodes, but it is the task tracker's
responsibility to keep track of it.
Failover:If the primary system fails or is taken down for maintenance, failover is a backup
operational mode that immediately switches to a standby database, server, or network. Failover
technology smoothly sends requests from a downed or failing system to a backup system that
replicates the operating system environment.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with
lowcost hardware in mind.
Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Resource Manager:The Resource Manager in YARN is basically a scheduler. In essence, it is
confined to dividing the system's available resources among competing applications. It optimises
optimal cluster utilisation (keeping all resources in use at all times) while taking into account
different limitations such as capacity guarantees, fairness, and service level agreements (SLAs). The
Resource Manager contains a pluggable scheduler that permits other algorithms, such as capacity,
to be utilised as needed to accommodate varied policy restrictions. The "yarn" user is used by the
daemon.
Application Master: The Application Master is a framework-specific library that is in charge of
negotiating resources with the Resource Manager and working with the Node Manager(s) to
execute and monitor Containers and their resource usage. It is in charge of negotiating suitable
resource Containers with the Resource Manager and keeping track of their progress. The Resource
Manager monitors the Application Master, which operates as a single Container.
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
Blocks:Large files were broken into little segments known as Blocks in Hadoop HDFS. The physical
representation of data is called a block. Except for the final block, which might be the same size or
less, all HDFS blocks are the same size. Hadoop divides files into 128 MB blocks before storing
them in the Hadoop file system.
Key/Value Store:A key-value store, sometimes known as a key-value database, is a simple database
that employs an associative array (think of a map or dictionary) as its basic data model, with each
key corresponding to one and only one item in a collection. A key-value pair is the name for this
type of connection.
181
Notes
Introduction for Big Data
Review Questions
1. Select the command to format the configured HDFS file system
A. hadoopnamenode -format
B. hadoop -format namenode
C. hadoop name -format
D. hadoop node -format
4. Select the command to starts the hadoop map/reduce daemons, the jobtracker and
tasktrackers.
A. start-mapred.sh
B. start-dfs.sh
C. start-env.sh
D. start-daemons.sh
7. The data nodes are used as ________ for blocks by all the namenodes
A. Common points
B. common storage
C. Both
D. None of above
182
Notes
11. Select the command to format the configured HDFS file system
A. hadoopnamenode -format
B. hadoop -format namenode
C. hadoop name -format
D. hadoop node -format
12. Select the command to starts the hadoopdfs daemons, the namenode and datanodes.
A. start-mapred.sh
B. start-dfs.sh
C. hadoop-env.sh
D. start-daemons.sh
14. Select the command to starts the hadoop map/reduce daemons, the jobtracker and
tasktrackers.
A. start-mapred.sh
B. start-dfs.sh
C. start-env.sh
D. start-daemons.sh
183
Notes
Introduction for Big Data
D. start-daemons.sh
16. Write down commands and explanation to insert and retrieve data into HDFS.
17. Explain HDFS operation to read and write the data.
18. Write down steps for learning adding user and ssh access.
19. Write down command to explain how to create user account.
20. Explain HDFS commands.
6. A 7. B 8. C 9. B 10. A
184
Notes
Unit 0
9: Hadoop Node Commands
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We Live,
Work, and Think. Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier for Innovation,
Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
7. https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm
185
Notes
Objectives
• Learn what is unit testing
• Explore concepts of MRUnit
• Learn Developing and testing MapReduce jobs with MRUnit
• Learn Anatomy of a MapReduce Job Run
• explore and learn the concepts of Shuffle and Sort
Introduction
Unit testing is a testing technique in which individual modules are checked by the developer to see if there are any flaws.
It is concerned with the standalone modules' functional soundness.The basic goal is to isolate each component of the
system in order to detect, analyse, and correct any flaws.
Sum
186
Notes
{
c=a+b
return c
}
JUNIT
JUnit is a unit testing framework for the Java programming language. JUnit has been important in the development of
test-driven development, and is one of a family of unit testing frameworks collectively known as xUnit that originated
with JUnit. JUnit promotes the idea of "first testing then coding", which emphasizes on setting up the test data for a piece
of code that can be tested first and then implemented. This approach is like "test a little, code a little, taest a little, code a
little." It increases the productivity of the programmer and the stability of program code, which in turn reduces the stress
on the programmer and the time spent on debugging.
• MapDriver: The MapDriver class is in charge of calling the Mapper's map() method.
• ReducerDriver: This is the driver class that calls the Reducer's reduce() method.
• MapReduceDriver: The combined MapReduce driver is in charge of first invoking the Mapper's map() function, then
performing an in-memory Shuffle phase. The Reducer's reduce() function is called at the end of this phase.
Each of the classes listed above provides methods for providing test inputs and anticipated outcomes. The setup() method
of the JUnit API is responsible for establishing fresh instances of the Mapper, Reducer, and relevant MRUnit drivers for
each test.To include MRUnit in a Hadoop MapReduce project, include it as a test dependent in the project POM file (of
course, the project is a Maven project, and I'm sure you're not going to skip Maven for this type of project):
<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.1.0</version>
<classifier>hadoop2</classifier>
187
Notes
188
Notes
reduceDriver.withInput(firstMapKey, firstMapValues)
.withInput(secondMapKey, secondMapValues)
.withOutput(firstMapKey, new IntWritable(2))
.withOutput(secondMapKey, new IntWritable(3))
.runTest();
}
and the overall MapReduce flow
@Test public void testWordCountMapReducer() throws IOException
{ mapReduceDriver.withInput(new LongWritable(1), new Text(firstTestKey))
.withInput(new LongWritable(2), new Text(firstTestKey))
.withInput(new LongWritable(3), new Text(secondTestKey))
.withOutput(new Text(firstTestKey), new IntWritable(2))
.withOutput(new Text(secondTestKey), new IntWritable(1))
.runTest();
}
The only thing that changes is the precise driver to utilise. Any MRUnit test case may be run in the same manner that
JUnit test cases are run. As a result, you may use them all together in your apps. After you've run the command, mvn test
Maven will execute all available unit tests (both JUnit and MRUnit) for the supplied MapReduce application and create
execution results.
189
Notes
190
Notes
This example only showed the testing of a mapper. MRUnit also provides a ReduceDriver class that can be used
in the same way as MapDriver for testing reducers.
191
Notes
Unit 10:
Step6:
Step7: Open the Counters project in Eclipse, and set up a new remote debug configuration. Create a new breakpoint and
debug.
A MapReduce job that is configured to execute in local mode runs entirely in one JVM instance. Unlike the pseudo-
distributed mode, this mode makes it possible to hook up a remote debugger to debug a job.
Apache Pig also provides a local mode for development and testing. It uses the same LocalJobRunner class as a local
mode MapReduce job. It can be accessed by starting Pig with the following command:
pig –x local
192
Notes
A MapReduce task may be started with only one line of code: JobClient.runJob (conf).
The whole process is illustrated in belowFigure . At the highest level, there are four independent entities:
The JobClient'srunJob() function generates a new JobClient object and executes submitJob() on it as a convenience
method. After submitting the work, runJob() polls the job's progress every second and, if it has changed since the last
report, reports it to the console. The job counters are presented when the work is finished and whether it was successful.
Otherwise, the console is recorded with the error that caused the job to fail.
The submitJob() function of JobClient implements the following job submission process:
• Requests a new job ID from the jobtracker (through getNewJobId() on JobTracker) (step 2).
• Checks the job's output specification. The task is not submitted if the output directory is not given or if it already
exists, for example, and an error is issued to the MapReduce application.
• Calculates the job's input splits. The task is not submitted and an error is issued to the MapReduce software if the
splits cannot be computed, for example because the input pathways do not exist.
193
Notes
Unit 10:
• Copies the resources needed to perform the task to the jobtracker's filesystem in a directory named after the job
ID, including the job JAR file, the configuration file, and the computed input splits. The job JAR is duplicated
with a high replication factor (set by the mapred.submit.replication parameter, which defaults to 10) so that
tasktrackers may access many copies when running tasks for the job.
• By invoking submitJob() on JobTracker, tells the jobtracker that the job is ready to be executed.
Job Initialization
When the JobTracker gets a call to its submitJob() function, it places it in an internal queue, where it will be picked up and
initialised by the job scheduler. Initialization entails constructing an object to represent the job being executed, which
encapsulates its activities, as well as accounting information to track the state and progress of the tasks (step 5).
The job scheduler gets the input splits computed by the JobClient from the shared filesystem before creating the list of
jobs to perform (step 6). For each split, it then produces a separate map task. The scheduler simply produces this amount
of reduction tasks to perform, which is defined by the mapred.reduce.tasks property in the JobConf, which is set by the
setNumReduce Tasks() function. At this moment, task IDs are assigned.
Task Assignment
Tasktrackers use a simple loop to deliver heartbeat method calls to the jobtracker on a regular basis. Heartbeats not only
inform the jobtracker that a tasktracker is alive, but they also serve as a messaging channel. A tasktracker will signal if it is
ready to run a new task as part of the heartbeat, and if it is, the jobtracker will assign it a task, which it will send to the
tasktracker using the heartbeat return value (step 7).
The jobtracker must choose a job from which to select a work for the tasktracker before it may choose a task for the
tasktracker. There are other scheduling methods, as detailed later in this chapter (see "Work Scheduling"), but the default
one merely keeps track of job priorities. The jobtracker now selects a task for the job after deciding on a job.
For map tasks and reduce tasks, tasktrackers have a set number of slots: for example, a tasktracker may be able to execute
two map tasks and two reduce tasks at the same time. (The exact number is determined on the number of cores and
memory available on the tasktracker; see "Memory") Because the default scheduler fills empty map task slots before
reduce task slots, the jobtracker will choose a map task if the tasktracker has at least one empty map task slot; otherwise,
it will select a reduce task.
Because there are no data locality constraints, the jobtracker simply chooses the next reduce task from its list of yet-to-be-
run reduce jobs. For a map job, however, it considers the tasktracker's network position and selects a task with an input
split that is as close to the tasktracker as feasible. In the best case scenario, the job is data-local, meaning it runs on the
same node as the split. Alternatively, the job might be rack-local, meaning it's on the same rack as the split but not on the
same node. Some jobs are neither data-local nor rack-local, and their data is retrieved from a rack other than the one on
which they are operating. Looking at a job's counters will reveal the proportion of each sort of work.
Task Execution
After the tasktracker has been given a task to complete, the next step is for it to complete the task. First, it copies the job
JAR from the common filesystem to the tasktracker's filesystem to localise it. It also moves any files required by the
programme from the distributed cache to the local disc (see "Distributed Cache") (step 8). Second, it creates a task-specific
local working directory and un-jars the JAR's contents into it. Third, it creates a TaskRunner object to carry out the task.
MapReduce Applications
TaskRunner creates a new Java Virtual Machine (step 9) to perform each task in (step 10), ensuring that any faults in the
user-defined map and reduce routines have no impact on the tasktracker (by causing it to crash or hang, for example).
However, the JVM can be reused across jobs (see "Task JVM Reuse").
The umbilical interface is used by the child process to interact with its parent. This manner, it keeps the parent updated
on the task's progress every few seconds until it's finished.
194
Notes
Job Completion
The status of a job is changed to "successful" when the jobtracker receives news that the last task for the job has been
completed. The JobClient then learns that the work has completed successfully when it polls for status, so it produces a
message to inform the user and then exits the runJob() function.
If the jobtracker is enabled to do so, it will additionally send an HTTP job notice. Clients that want to get callbacks can set
this up using the job.end.notifica tion.url parameter. Finally, the jobtracker cleans up its working environment for the job,
and tasktrackers are instructed to do the same (so intermediate output is deleted, for example).
scheduling monitoring
tasks them
re-executes
the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic idea behind the YARN
introduction is to split the functionalities of resource management and job scheduling or monitoring into
separate daemons that are ResorceManager, ApplicationMaster, and NodeManager. ResourceManager is the
master daemon that arbitrates resources among all the applications in the system. NodeManager is the slave
daemon responsible for containers, monitoring their resource usage, and reporting the same to
ResourceManager or Schedulers. ApplicationMaster negotiates resources from the ResourceManager and works
with NodeManager in order to execute and monitor the task.The ResourceManager has two main components
that are Schedulers and ApplicationsManager as shown in Error! Reference source not found..
Schedulers
The ResourceManager
has two main
components that are
and
ApplicationsManager.
195
Notes
Unit 10:
ResourceManager is a pureschedulerwhich is responsible for allocating resources to the various running applications.It is
not responsible for monitoring or tracking the status of an application. Also, the scheduler does not guarantee about
restarting the tasks that are failed either due to hardware failure or application failure.
The scheduler performs scheduling based on the resource requirements of the applications. It has some pluggable policies
that are responsible for partitioning the cluster resources among the various queues, applications, etc. The FIFO
Scheduler, CapacityScheduler, and FairScheduler are such pluggable policies that are responsible for allocating resources
to the applications.
196
Notes
It is not suitable for shared clusters. If the large application comes before the shorter one, then the large application will
use all the resources in the cluster, and the shorter application has to wait for its turn. This leads to starvation.It does not
take into account the balance of resource allocation between the long applications and short applications.
Capacity Scheduler
• The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It is designed to run
Hadoop applications in a shared, multi-tenant cluster while maximizing the throughput and the utilization of the
cluster. It supports hierarchical queues to reflect the structure of organizations or groups that utilizes the cluster
resources. A queue hierarchy contains three types of queues that are root, parent, and leaf. The root queue
represents the cluster itself, parent queue represents organization/group or suborganization/sub-group, and the
leaf accepts application submission. Also, when there is a demand for the free resources that are available on the
queue who has completed its task, by the queues running below capacity, then these resources will be assigned
to the applications on queues running below capacity. This provides elasticity for the organization in a cost-
effective manner. Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the cluster.
root
A queue hierarchy
contains three types parent
of queues that are
and leaf.
To ensure fairness and stability, it also provides limits on initialized and pending apps from a single user and
queue.It maximizes the utilization of resources and throughput in the Hadoop cluster. Provides elasticity for groups
or organizations in a cost-effective manner.It also gives capacity guarantees and safeguards to the organization
utilizing cluster.It is complex amongst the other scheduler.
Fair Scheduler
FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters. With FairScheduler, there is
no need for reserving a set amount of capacity because it will dynamically balance resources between all running
applications. It assigns resources to applications in such a way that all applications get, on average, an equal amount of
resources over time. The FairScheduler, by default, takes scheduling fairness decisions only on the basis of memory. We
can configure it to schedule with both memory and CPU.When the single application is running, then that app uses the
197
Notes
198
Notes
199
Notes
Failure of JobTracker
• The final case can be job trackerfailure. It is the most serious failure in classic mapreduce.
Nothing much can be dine in this case. Job tracker is single point failure in MapReduce.
• So it is recommended to be run on the better hardware so as to avoid the scenario as much
as possible. We need to resubmit all the jobs in progress once the jobtracker is brought up
again. In YARN,this situation is little improved.
200
Notes
• Task Failure
Failure of the running task is similar to the classic case. Runtime exceptions and sudden
exits of the JVM are propagated back to the application master and the task attempt is
marked as failed. Likewise, hanging tasks are noticed by the application master by the
absence of a ping over the umbilical channel (the timeout is set by mapreduce.task.time
out), and again the task attempt is marked as failed.The configuration properties for
determining when a task is considered to be failed are the same as the classic case: a task is
marked as failed after four attempts (set by mapreduce.map.maxattempts for map tasks
and mapreduce.reduce.maxattempts for reducer tasks).
201
Notes
An application master sends periodic heartbeats to the resource manager, and in the event of
application master failure, the resource manager will detect the failure and start a new instance of
the master running in a new container (managed by a node manager)
In the case of the MapReduce application master, it can recover the state of the tasks that had
already been run by the (failed) application so they don’t have to be rerun. By default, recovery is
not enabled, so failed application masters will not rerun all their tasks, but you can turn it on by
setting yarn.app.mapreduce.am.job.recovery.enable to true.
The client polls the application master for progress reports, so if its application master fails the
client needs to locate the new instance.
During job initialization the client asks the resource manager for the application master’s address,
and then caches it, so it doesn’t overload the the resource manager with a request every time it
needs to poll the application master.
If the application master fails, however, the client will experience a timeout when it issues a status
update, at which point the client will go back to the resource manager to ask for the new
application master’s address.
202
Notes
10.6 Shuffling
In Hadoop, the process by which the intermediate output from mappers is transferred to the
reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis of
reducers. Intermediated key-value generated by mapper is sorted automatically by key. What is
Shuffling and Sorting in Hadoop MapReduce?
Before we start with Shuffle and Sort in MapReduce, let us revise the other phases of MapReduce
like Mapper, reducer in MapReduce, Combiner, partitioner in MapReduce and inputFormat in
MapReduce.
203
Notes
Mapper task is the first phase of processing that processes each input record (from RecordReader)
and generates an intermediate key-value pair. Hadoop Mapper store intermediate-output on the
local disk. In this Hadoop mapper tutorial, we will try to answer what is a MapReduce Mapper
how to generate key-value pair in Hadoop, what is InputSplit and RecordReader in Hadoop, how
mapper works in Hadoop.
Reducer in MapReduce
Reducer in Hadoop MapReduce reduces a set of intermediate values which share a key to a
smaller set of values. In MapReduce job execution flow, Reducer takes a set of an intermediate
keyvalue pair produced by the mapper as the input.In Hadoop, Reducer takes the output of the
Mapper (intermediate key-value pair) process each of them to generate the output. The output of
the reducer is the final output, which is stored in HDFS. Usually, in the Hadoop Reducer, we do
aggregation or summation sort of computation.
Combiner
MapReduce - Combiners. A Combiner, also known as a semi-reducer, is an optional class that
operates by accepting the inputs from the Map class and thereafter passing the output key-value
pairs to the Reducer class. The main function of a Combiner is to summarize the map output
records with the same key.
Partitioner in MapReduce
The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper
output. By hash function, key (or a subset of the key) is used to derive the partition. A total number
of partitions depends on the number of reduce task.
InputFormat in MapReduce
Hadoop can process many different types of data formats, from flat text files to databases. In this
section, we will explore the different formats available in next chapter Working of Shuffle and
Sort.
Shuffle and sort steps which are core and hard to every MapReduce job. Every mapreduce goes
through a shuffle and sort phase.
Map:- process input key and value then map output id sorted and is transferred to reducer and that
is known as shuffle.
204
Notes
1)The input
2) Output is not directly written to disk but is written to memory buffer.
3) Size of the buffer is decided by the property io.sort.mb.
4)Default size is 100MB
5) If map has more of output, it may fill up the buffer and in that case map would be paused for a
while till the spill empties the buffer.
6) After spill completed of map may again reach to the threshold.
7)In that case,another spill would be writte in round robin fashion.
8) These are written to the directly specified in the property map.local.dir
9) So there can be many spills before the last key value pair has been written by the map tasks.
10) Each spill is partitioned and sorted by the key and this run through a combiner, if the
combiner function is designed for the job.
11) Once map has finished to process all the records. All the splits are merged to an output file
which is partitioned and sorted.
12) If more than 3 splits are merged together, combiner function is again run through the final
output. 13) Remember that the combiner functions can run many times without changing the
final results.
14) Combiner function reduces the size of output which is advantages as there will be less amount
of data that would required to be transferred to the reducer machine.
15) If the maps output is going to be very large it is recommended to compress the maps output to
reduce the amount of data.
16) This can be done by setting up the property method mapred.compress.map.output to true.
And compression scheme can be specified by the property. Mapred.map.output.compression.codec
18) After this comes the copy phase there would be many map tasks running and they may
finish at different times.
19) As soon as they finish, they notified the job tracker or the app master which asks the
reducer to copy the result to the local disk and so the partitions are copied by the reducer from the
network.
• After this comes the sort phase, reducer merges the map output which are then filled into
reducer to create the final result. The mechanism in sort phase is little more involved.Let’s
look at the sort phase the property which plays an important role is merge factor.
io.sort.factor. Its default value is 10. It signifies how many files can be merge at one
go.Suppose if Reducer receives 30 files from different maps. then these can be merged
into batches of 10 and in three rounds it will create the intermediate merge file and in the
final round it would be fed directly into the reducer. Just note that the most files need to
be sorted by the keys as well. To increase disk i/o efficiency, the actual algorithms behave
differently. It picks first three files and merge into one and then picks up the next patches
of ten. In the final round it would take the remaining six files. Doing like this, increases the
I/O efficiency.
205
Unit 10:
206
Notes
Objective
In Hadoop, the process by which the intermediate output from mappers is transferred to the
reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis of
reducers. Intermediated key-value generated by mapper is sorted automatically by key. In this
blog, we will discuss in detail about shuffling and Sorting in Hadoop MapReduce.Shuffle phase in
Hadoop transfers the map output from Mapper to a Reducer in MapReduce. Sort phase in
MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by
the key, split among reducers, and sorted by the key. Every reducer obtains all values associated
with the same key. Shuffle and sort phase in Hadoop occur simultaneously and are done by the
MapReduce framework. Shuffle phase in Hadoop transfers the map output from Mapper to a
Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs.
Data from the mapper are grouped by the key, split among reducers and sorted by the key. Every
reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur
simultaneously and are done by the MapReduce framework.
Shuffling in MapReduce
The process of transferring data from the mappers to reducers is known as shuffling i.e. the process
by which the system performs the sort and transfers the map output to the reducer as input. So,
MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input
(or input from every mapper). As shuffling can start even before the map phase has finished so this
saves some time and completes the tasks in lesser time.
Sorting in MapReduce
The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before
starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper
get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in
any order.Sorting in Hadoop helps reducer to easily distinguish when a new reduce task should
start. This saves time for the reducer. Reducer starts a new reduce task when the next key in the
sorted input data is different than the previous. Each reduce task takes key-value pairs as input
and generates key-value pair as output.Note that shuffling and sorting in Hadoop MapReduce is
not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce
job stops at the map phase, and the map phase does not include any kind of sorting (so even the
map phase is faster).
Summary
• Hadoop The unit tests are all built to run on a single computer rather than a Hadoop cluster.
Apache Bigtop has a running search for that. The unit tests function by generating miniDFS,
MiniYARN, and MiniMR clusters, as needed. All of them execute the code for the respective
services.
• Hadoop has been utilising JUnit4 for some time, yet it appears that many new tests for JUnit
v3 are still being produced.
• Apache MRUnit TM is a Java package for unit testing Apache Hadoop map reduce tasks.
The post's example uses the Weather dataset, and it works with the year and temperature
retrieved from it. Obviously, you can simply adapt the example to your own data.
207
Unit 10:
• Hadoop includes a RecordReader that transforms input splits into key-value pairs using
TextInputFormat. In the mapping process, the key-value pairs are utilised as inputs. The
only data format that a mapper can read and understand is this one.
• Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.
• Hadoop is an Apache Foundation open source framework for processing huge volumes of
heterogeneous data sets in a distributed way across clusters of commodity computers and
hardware using a simplified programming style. Hadoop is a dependable distributed
storage and analysis solution.
• If a job fails, Hadoop will identify the failure and reschedule replacements on healthy
computers. It will only end the task if it fails four times, which is the default setting that may
be changed, and it will kill terminate the job. to be finished
• Namenode stores information about all other nodes in the Hadoop Cluster, files in the
cluster, file component blocks and their positions in the cluster, and other information that is
necessary for the Hadoop Cluster's functioning.
• Job Tracker manages the sharing of information and outcomes by keeping track of the
specific tasks/jobs allocated to each of the nodes.
• The CapacityScheduler was created to allow huge clusters to be shared while ensuring that
each organisation has a minimum capacity guarantee. The key principle is that the Hadoop
Map-Reduce cluster's available resources are partitioned among various companies that
finance the cluster collectively based on computation demands.
Keywords
Apache MRUnit: Apache MRUnit TM is a Java package for unit testing Apache Hadoop map
reduce tasks. The post's example uses the Weather dataset, and it works with the year and
temperature retrieved from it. Obviously, you can simply adapt the example to your own data.
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Job Tracker:The job tracker is a master daemon that runs on the same node as the data nodes and
manages all of the jobs. This data will be stored on multiple data nodes, but it is the task tracker's
responsibility to keep track of it.
FIFO: As the name implies, FIFO stands for First In First Out, which means that the tasks or
applications that arrive first are served first. In Hadoop, this is the default Schedular. The jobs are
placed in a queue and completed in the order in which they were submitted.
Capacity Scheduler:The CapacityScheduler was created to allow huge clusters to be shared while
ensuring that each organisation has a minimum capacity guarantee. The key principle is that the
Hadoop Map-Reduce cluster's available resources are partitioned among various companies that
finance the cluster collectively based on computation demands.
Fair scheduling is a method of allocating resources to apps in such a way that each app receives an
equal proportion of resources over time. Hadoop NextGen can schedule a variety of resource kinds.
The Fair Scheduler's scheduling fairness judgments are based only on memory by default.
Task Failure: If a job fails, Hadoop will identify the failure and reschedule replacements on healthy
computers. It will only end the task if it fails four times, which is the default setting that may be
changed, and it will kill terminate the job. to finalise
Child JVM: The parent MRAppMaster's environment is passed down to the child job.
Themapreduce.map.java.optsandmapred.reduce.java.opts configuration arguments in theJob can
be used to provide the child JVM extra options. -Djava.library.path=>, for example, can be used to
specify non-standard paths for the runtime linker to look for shared libraries. If the symbol(taskid)
property is present in the mapreduce.map.java.opts or mapred.reduce.java.opts properties, it is
interpolated with the taskid value of the MapReduce task.
208
Notes
Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Shuffling and sorting:The process of transferring intermediate output from the mapper to the
reducer is known as shuffle. On the basis of reducers, a reducer receives one or more keys and their
associated values. The mapper's intermediated key – value is automatically ordered by key.
Self Assessment
1. Testing the entire system's end-to-end functioning is characterized as
A. Functional testing
B. Unit Testing
C. Stress Testing
D. Load Testing
2. What is testing?
A. Finding broken code.
B. Evaluating deliverable to find errors.
C. A stage of all projects.
D. All of the above.
6. _________ is a processing technique and a program model for distributed computing based
on java.
A. Composing
B. Decomposing
209
Unit 10:
C. MapReduce
D. None of above
7. __________ a data processing application into mappers and reducers
sometimes nontrivial.
A. Composing
B. Decomposing
C. MapReduce
D. None of above
8. Which of the following method causes call returns only when the job gets finished, and it returns
with its success or failure status which can be used to determine that further steps are to be run
or not?
A. Waitforfinished()
B. waitForCompletion()
C. Both
D. None of the above
9. Which of the following specifies the environment variables that affect the JDK used by Hadoop
Daemon (bin/hadoop).
A. core-site.xml
B. hadoop-env.sh
C. hdfs-site.xml
D. mapred-site.xml
10. Which of the followings are important configuration files which is required for runtime
environment settings of a Hadoop cluster that also informs Hadoop daemons where the
NAMENODE runs in the cluster.
A. core-site.xml
B. hadoop-env.sh
C. hdfs-site.xml
D. mapred-site.xml
11. _________ is responsible for scheduling tasks, monitoring them, and re-executes the failed task.
A. Hadoop MapReduce
B. Yarn
C. Hive
D. Pig
13. In Hadoop, the process by which the intermediate output from mappers is transferred to the
reducer is called _________.
A. Shuffling
210
Notes
B. Sorting
C. Both
D. None of above
14. Which of the following’s tasks are the first phase of processing that processes each input
record and generates an intermediate key-value pair?
A. Reducer task
B. Mapper Task
C. Compress Task
D. None of above
6. C 7. B 8. B 9. B 10. A
Review Questions
1. Explain all unit testing techniques.
2. Explain three core classes of MRUNIT.
3. Explain Developing and testing MapReduce jobs with MRUnit 4. Diagrammatically explain
shuffle and sort concepts
5. Explain three kinds of failure in MapReduce.
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of
Scalable Realtime Data Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
211
Unit 10:
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
212
Notes
Objectives
• explore the concepts of HIVE.
• understand architecture of HIVE.
• explore concepts and architecture of Apache pig
• understand Pig-Latin data types, applications and features of Pig
• learn operators in Apache pig.
• learn services offered by HIVE
• Learn fundamentals of Hbase
• Explore concepts of ZooKeeper
• understand IBM InfoSphere Streams
• learn a new paradigm for information processing, learn powerful, real-time analytic processing made simple
• Explore Enterprise integration and concepts of scale-out architecture.
213
Notes
Introduction
Apache Hive is an opensource data warehousing solution built on the Hadoop platform. Hiv e is a
database that may be used to analyse and query huge datasets contained in Hadoop files. Hive may be used to process
both structured and semi-
structured data.On the BigData landscape, Apache Hive is one of the most used datawarehouse co mponents. Its interface
is mostly used to supplement the Hadoop file system.Hive was created by F acebook and is currently maintained by the
Apache Software Foundation as Apache Hive.Netflix a nd Amazon are among the companies that utilise and develop
it.Hive is a Hadoop data warehouse architecture solution that allows you to handle structured data. It resides on top of
Hadoop to summaries Big Data and facilitate searching and analysis.It resides on top of Hadoop to summaries Big Data
and facilitate searching and analysis.Initially created by Facebook, Hive was eventually taken up by the Apache Software
Foundation and maintained as an open-source project under the name Apache Hive. It is utilized by a variety of
businesses. For example, Amazon utilizes it in Amazon Elastic MapReduce.
Hive is not
• A language for real-time queries
• and row-level updates
214
Notes
• User Interface:This is where the end user interacts with Hive in order for the data to be processed. We offer various
methods to interface with Hive, including the Web UI and the
Hive CLI, which is included with the Hive package
We may also use Thrift Client, JDBC Client, and ODBC Client. Hive also offers services such as Hive CLI, Beeline, and
others.
• Hive Query process engine: The query entered via the user interface is parsed by the Hive compiler. It uses
information contained in the metastore to verify for semantic and syntactic accuracy. Finally, it generates an execution
plan in the form of a DAG (Directed Acyclic Graph), with each stage representing a mapreduce task as shown in
Figure 4.
Execution Engine: Execution Engine is where the actual processing of the data will start. After compiler checking the
syntax, performs the optimizations of the execution. Finally, this execution plan will be given to Execution Engine.We
have several execution engines that can be used with Hive. MapReduce is one of the execution engines which slower
compared other engines. We can change to this execution engine to Tez or Spark. To change the execution engine we
can use the below command:
set hive.execution.engine=spark;
set hive.execution.engine=tez;
set hive.execution.engine=mr;
Introduction for Big Data
Metastore:
215
Notes
Remote mode
Metastore can
be used in two
modes. Embedded
Mode
Figure 5 Two modes of metastore
• Remote mode: In this mode meta-store is a Thrift Service which can be used in case nonJava applications.
• Embedded Mode: In this case client can directly interact with meta-store using JDBC.
HDFS/Hbase Storage:
Hive is unable to store data directly. Hive can only analyses data and enter it into tables; the data itself is kept on storage
systems such as HDFS, HBase, or S3. Hive will create tables that point to the location of the data in any of the above
storage systems, and the data will be accessed from there.
• UI (User Interface): The user interface is for users to submit queries and other operations to the system.
• Driver: This is the component that receives the requests. This component supports the concept of session handles
and provides JDBC/ODBC-style execute and fetch APIs.
• Compiler: The component that parses the query does semantic analysis on the various query blocks and query
expressions, and then creates an execution plan using information from the Metastore for table and partition
metadata.
• Metastore: The component that holds all of the structural information for the warehouse's different tables and
partitions, including column and column type information, serializers and de-serializers for reading and writing
data, and the HDFS files where the data is kept.
• Execution Engine: The component responsible for carrying out the compiler's execution plan. The strategy is
organized as a DAG of phases. The execution engine coordinates the dependencies between these various plan
stages and performs them on the relevant system components.
• This metadata is used to type check the expressions in the query tree as well as to prune partitions based on query
predicates. The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a
metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees
(operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers)., a
metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees
(operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
216
Notes
job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees
(operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
• Step 6: The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3). In each task
(mapper/reducer) the deserializers associated with the table or intermediate outputs is used to read the rows from
HDFS files and these are passed through the associated operator tree. Once the output is generated, it is written to a
temporary HDFS file though the serializers (this happens in the mapper in case the operation does not need a
reduce). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML
operations the final temporary file is moved to the table’s location.
• Step 7 & 8 & 9: For queries, the contents of the temporary file are read by the execution engine directly from HDFS
as part of the fetch call from the Driver.Once the output is generated, it is written to a temporary HDFS file though
the serializers (this happens in the mapper in case the operation does not need a reduce). The temporary files are
used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is
moved to the table’s location.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
217
Notes
In Pig, there is a language we use to analyze data in Hadoop. That is what we call Pig Latin. Also, it
is a high-level data processing language that offers a rich set of data types and operators to perform
several operations on the data.Moreover, in order to perform a particular task, programmers need
to write a Pig script using the Pig Latin language and execute them using any of the execution
218
Notes
mechanisms (Grunt Shell, UDFs, Embedded) using Pig. To produce the desired output, these
scripts will go through a series of transformations applied by the Pig Framework, after execution.
Further, Pig converts these scripts into a series of MapReduce jobs internally. Therefore, it makes
the programmer’s job easy. Here, is the architecture of Apache Pig.
a. Parser: At first, all the Pig Scripts are handled by the Parser. Parser basically checks the
syntax of the script, does type checking, and other miscellaneous checks. Afterwards,
Parser’s output will be a DAG (directed acyclic graph) that represents the Pig Latin
statements as well as logical operators.The logical operators of the script are represented
as the nodes and the data flows are represented as edges in DAG (the logical plan)
Figure 9 Parser
b. Optimizer: Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries
out the logical optimizations further such as projection and push down.
c. Compiler: Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries
out the logical optimizations further such as projection and push down.
d. Execution Engine: Eventually, all the MapReduce jobs are submitted to Hadoop in a
sorted order. Ultimately, it produces the desired results while these MapReduce jobs are
executed on Hadoop.
219
Notes
Figure 10 Atom
a. Atom: Atom is defined as any single value in Pig Latin, irrespective of their data. Basically,
we can use it as string and number and store it as the string. Atomic values of Pig are int,
long, float, double, char array, and byte array. Moreover, a field is a piece of data or a
simple atomic value in Pig. For Example − ‘Shubham’ or ‘25’
b. Bag:An unordered set of tuples is what we call Bag. To be more specific, a Bag is a
collection of tuples (non-unique). Moreover, each tuple can have any number of fields
(flexible schema). Generally, we represent a bag by ‘{}’.It is similar to a table in RDBMS. It
is not necessary that every tuple contain the same number of fields in the same
position(column) have the same type.
Figure 11 Bag
Des
Data Type ription Example
c
220
Notes
3. Tuple is basically a single row which will be in a round brackets. In the bag you have got curly
braces. Curly braces contain lots of other tuples inside it. Tuple is represented using round
bracket and bag is represented using a curly bracket
4. A map is basically is a combination of key-value pair which makes up of your key-values pairs
which is separated by a hash tag. A hash tag is a separator between y our map key value pairs.
5. Values for all the above data types can be NULL. Apache Pig treats null values in a similar way
as SQL does. A null can be an unknown value or a non-existent value. It is used as a
placeholder for optional values. These nulls can occur naturally or can be the result of an
operation.
221
Notes
222
Notes
Local Mode
The source data for 'Local Mode' would be taken from your computer's local directory. The ‘pig –x
local’ command can be used to specify the MapReduce mode.
MapReduce Mode
You'll need access to a Hadoop cluster and an HDFS installation to run Pig in MapReduce mode.
The 'pig' command may be used to specify the MapReduce mode.
223
Notes
Relational Operators
Pig Latin's main tools for manipulating data are relational operators. Sorting, grouping, merging,
projecting, and filtering the data are all options. The basic relational operators are:-
LOAD
The LOAD operator is used to insert data into a Pig relation from a file system or HDFS storage.
FOREACH
Based on data columns, this operator creates data transformations. It's used to make changes to a
relation's fields. To interact with data columns, use the FOREACH-GENERATE method.
FILTER
This operator uses a condition to pick tuples from a relation.
JOIN
The JOIN operator is used to accomplish an inner, equijoin join of two or more relations based on
field values that are the same. An inner join is always performed by the JOIN operator. Null keys
are ignored by inner joins, therefore filtering them out before the join makes logical.
ORDER BY
Order By is a feature that allows you to sort a relation by one or more fields. Using the ASC and
DESC keywords, you may sort in ascending or descending order.
DISTINCT
In a relation, distinct eliminates duplicate tuples. Consider the following input file, which contains
amr,crap,8 and amr,myblog,10 twice. Duplicate items are deleted when distinct is applied to the
data in this file.
STORE
The term "store" refers to the process of saving results to a file system.
GROUP
The GROUP operator joins tuples that have the same group key (key field). If the group key
includes more than one field, the key field will be a tuple; otherwise, it will be the same type as the
group key. A GROUP operation produces a relation that has one tuple per group.
CROSS
To compute the cross product (Cartesian product) of two or more relations, use the CROSS
operator.
224
Notes
LIMIT
To restrict the number of output tuples, use the LIMIT operator. The output will include all tuples
in the relation if the provided number of output tuples is equal to or more than the number of
tuples in the relation.
SPLIT
The SPLIT operator divides a relation's contents into two or more relations depending on an
expression. In accordance with the requirements indicated in the statement.
Diagnostic Operators
DUMP
The DUMP operator executes Pig Latin commands and displays the results on the screen.
DESCRIBE
To study the schema of a specific relation, use the DESCRIBE operator. When troubleshooting a
script, the DESCRIBE operator is very useful.
ILLUSTRATE
The ILLUSTRATE operator is used to see how data is changed using Pig Latin statements. When it
comes to debugging a script, the ILLUSTRATE command is your best friend. This command alone
could be enough to convince you to use Pig instead of something else.
EXPLAIN
The logical and physical planes are printed by the EXPLAIN operator.
11.10Introduction to HIVEQL
Apache Hive is an open-source data warehousing platform that may be used to execute distributed
processing and analysis. It was created by Facebook to make creating the Java MapReduce
application easier. The Hive Query language, which is a declarative language comparable to SQL,
is used by Apache Hive. Hive is a programme that converts hive searches into MapReduce
programmes. It allows developers to process and analyse structured and semi-structured data by
removing the need for sophisticated Java Maps. Using hive queries, you can cut down on the
number of programmes you run. The hive queries are simple to create for someone who is familiar
with SQL commands. The hive queries are simple to create for someone who is familiar with SQL
commands.
• Hive makes it simple to conduct procedures such as
• Analysis of huge datasets, Ad-hoc queries, Data encapsulation The major
components of Apache Hive are:
• Hive Client, Hive Services, Processing and Resource Management, Distributed
Storage
For running queries on the Hive, Hive supports applications written in any language,
including Python, Java, C++, Ruby, and others, using JDBC, ODBC, and Thrift drivers. As a
result, writing a hive client application in any language of one's choice is simple.
225
Unit 11: Hadoop Ecosystem
Beeline
Beeline is a command line interface of hive server2 a new launched product of hive. ...
Recently, the Hive community introduced HiveServer2 which is an enhanced Hive server
designed for multi-client concurrency and improved authentication that also provides better
support for clients connecting through JDBC and ODBC
Hive Server2
HiveServer2 is HiveServer1's successor. Clients can use HiveServer2 to run queries against
Hive. It enables numerous clients to send Hive requests and retrieve the results. Its primary
purpose is to provide the greatest possible support for open API clients such as JDBC and
ODBC.The Thrift-based Hive service is at the heart of HS2 and is in charge of handling Hive
queries (e.g., from Beeline). Thrift is a cross-platform RPC framework for creating services.
Server, Transport, Protocol, and Processor are the four levels that make up its stack.
Hive Driver
The Hive driver accepts the HiveQL statements entered into the command shell by the user. It
generates the query's session handles and sends it to the compiler. Hive Driver is a Java Script
driver for connection to Apache Hive via Thrift API. This driver can connect with SASL
authentication mechanisms (such as LDAP, PLAIN, Kerberos) using both HTTP and TCP transport.
Hive Compiler
The query is parsed by the Hive compiler. It uses the metadata stored in the metastore to do
semantic analysis and type-checking on the various query blocks and query expressions, and then
generates an execution plan.Hive Compiler is a tool that allows you to compile data in Hive. The
DAG (Directed Acyclic Graph) is the execution plan generated by the compiler, with each step
consisting of a map/reduce job, an HDFS action, and a metadata operation.Optimizer separates the
job and performs transformation operations on the execution plan to increase efficiency and
scalability.
Compiler communicating with Driver with the proposed plan to execute the query.
Optimize
Optimizer performs the transformation operations on the execution plan and splits the task to
improve efficiency and scalability.optimize. bucketmapjoin=true. This setting hints to Hive to do
bucket level join during the map stage join. It also reduces the scan cycles to find a particular key
because bucketing ensures that the key is present in a specific bucket. It optimises and transforms
an execution plan in order to produce an optimal Directed Acyclic Graph (DAG). To improve
performance and scalability, transformations such as transforming a pipeline of joins to a single
join and job separation, such as putting a transformation on data before a reduce operation, are
used. Optimization Techniques:
Let's go over each of the Hive optimization strategies for Hive Performance Tuning one by one:
226
Notes
an optimized way than the other file formats. To be more specific, ORC reduces the size of the
original data up to 75%. Hence, data processing speed also increases. On comparing to Text,
Sequence and RC file formats, ORC shows better performance.Basically, it contains rows data in
groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing
the data ORC format improves the performance.To be more specific, ORC reduces the size of
the original data up to 75%. Hence, data processing speed also increases. On comparing to Text,
Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data
in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is
processing the data ORC format improves the performance.
c. Hive Partitioning: Hive Optimization Techniques, Hive reads all the data in the directory
Without partitioning. Further, it applies the query filters on it. Since all data has to be read this
is a slow as well as expensive. Also, users need to filter the data on specific column values
frequently. Although, users need to understand the domain of the data on which they are doing
analysis, to apply the partitioning in the Hive.Basically, by Partitioning all the entries for the
various columns of the dataset are segregated and stored in their respective partition. Hence,
While we write the query to fetch the values from the table, only the required partitions of the
table are queried. Thus, it reduces the time taken by the query to yield the result.
d. Bucketing in Hive: Hive Optimization Techniques, let’s suppose a scenario. At times, there is a
huge dataset available. However, after partitioning on a particular field or fields, the partitioned
file size doesn’t match with the actual expectation and remains huge.Still, we want to manage
the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers
Bucketing concept. Basically, that allows the user to divide table data sets into more
manageable parts.Hence, to maintain parts that are more manageable we can use Bucketing.
Through it, the user can set the size of the manageable parts or Buckets too.
e. Vectorization In Hive: Hive Optimization Techniques, to improve the performance of
operations we use Vectorized query execution. Here operations refer to scans, aggregations,
filters, and joins. It happens by performing them in batches of 1024 rows at once instead of
single row each time.However, this feature is introduced in Hive 0.13. It significantly improves
query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true set
hive.vectorized.execution.enabled = true
f. Cost-Based Optimization in Hive (CBO):Hive Optimization Techniques, before submitting for
final execution Hive optimizes each Query’s logical and physical execution plan. Although,
until now these optimizations are not based on the cost of the query.However, CBO, performs,
further optimizations based on query cost in a recent addition to Hive. That results in
potentially different decisions: how to order joins, which type of join to perform, the degree of
parallelism and others.To use CBO, set the following parameters at the beginning of your
query: set hive.cbo.enable=true;
set hive.compute.query.using.stats=true; set
hive.stats.fetch.column.stats=true; set
hive.stats.fetch.partition.stats=true; Then,
prepare the data for CBO by running Hive’s
“analyze” command to collect various
statistics on the tables for which we want to
use CBO.
g. Hive Indexing:Hive Optimization Techniques, one of the best ways is Indexing. To increase
your query performance indexing will definitely help. Basically, for the original table use of
indexing will create a separate called index table which acts as a reference. As we know, there
227
Unit 11: Hadoop Ecosystem
are many numbers of rows and columns, in a Hive table. Basically, it will take a large amount
of time if we want to perform queries only on some columns without indexing. Because queries
will be executed on all the columns present in the table. Moreover, there is no need for the
query to scan all the rows in the table while we perform a query on a table that has an index, it
turned out as the major advantage of using indexing. Further, it checks the index first and then
goes to the particular column and performs the operation.Hence, maintaining indexes will be
easier for Hive query to look into the indexes first and then perform the needed operations
within less amount of time. Well, time is the only factor that everyone focuses on, eventually.
h. Execution Engine: The execution engine uses Hadoop to execute the execution plan created by
the compiler in order of their dependencies following the compilation and optimization
processes.
i. MetaStore: It also holds serializer and deserializer metadata, which is essential for read/write
operations, as well as HDFS files where data is kept. In most cases, this metastore is a relational
database.For searching and altering Hive metadata, Metastore provides a Thrift interface.It also
holds serializer and deserializer metadata, which is essential for read/write operations, as well
as HDFS files where data is kept. In most cases, this metastore is a relational database.For
searching and altering Hive metadata, Metastore provides a Thrift interface.Metastore can be
configured in one of two ways:
Remote: Metastore is a Thrift service in remote mode, which is suitable
for non-
Java applications.
Embedded: In embedded mode, the client can use JDBC to interface directly with the metastore.
j. HCatalog: Hadoop's table and storage management layer is HCatalog. It allows users to read
and write data on the grid using various data processing tools such as Pig, MapReduce, and
others.It is based on Hive metastore and exposes Hive metastore's tabular data to other data
processing tools.HCatalog is a Hadoop table and storage management layer that allows users to
read and write data on the grid more simply using various data processing tools such as Pig
and MapReduce. Users can have a relational view of data in the Hadoop distributed file system
(HDFS) thanks to HCatalog's table abstraction, which means they don't have to worry about
where or how their data is stored — RCFile format, text files, SequenceFiles, or ORC
files.HCatalog supports reading and writing files in any format for which a SerDe (serializer-
deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and
SequenceFile, and ORC file formats. To use a custom format, you must provide the
InputFormat, OutputFormat, and SerDe.
k. WebHCat: For HCatalog, WebHCat is the REST API. It's a Hive metadata operations HTTP
interface. It lets users perform Hadoop MapReduce (or YARN), Pig, and Hive jobs.Developers
use HTTP requests to access Hadoop MapReduce (or YARN), Pig, Hive, and HCatalog DDL
from within applications, as demonstrated in the diagram below. HDFS stores the data and
code utilised by this API. When HCatalog DDL commands are requested, they are immediately
executed. WebHCat (Templeton) servers queue MapReduce, Pig, and Hive jobs, which may be
monitored for progress or cancelled as needed. Pig, Hive, and MapReduce results are stored in
HDFS, and developers select where they should be stored.
11.11HIVEQL
Hive Query Language (HiveQL) is a query language for Hive that allows you to process and
analyse structured data in a Metastore.A table's data is retrieved using the SELECT command.
The WHERE clause functions in the same way as a condition. It applies the condition to the
data and returns a finite result. The built-in operators and functions provide an expression that
meets the criteria.The SELECT query's syntax is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
228
Notes
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
JOIN clause
JOIN is a clause that is used for combining specific fields from two tables by using values common
to each one. It is used to combine records from two or more tables in the database . Syntax:
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_referencejoin_condition
| table_reference CROSS JOIN table_reference [join_condition]
HBase
Since 1970, RDBMS has been the go-to option for data storage and maintenance issues. Companies
discovered the value of processing big data after the introduction of big data and began to use
technologies such as Hadoop.Hadoop stores massive data in a distributed file system and
processes it using MapReduce. Hadoop excels at storing and processing large amounts of data in a
variety of formats, including random, semi-structured, and unstructured data.Tables in HBase are
229
Unit 11: Hadoop Ecosystem
divided into regions and served by region servers. Column families divide regions vertically into
"Stores." In HDFS, stores are saved as files. HBase's architecture is depicted below.
Architecture of HBase
HBase provides low-latency random reads and writes on top of HDFS. In HBase, tables are
dynamically distributed by the system whenever they become too large to handle (Auto Sharding).
The simplest and foundational unit of horizontal scalability in HBase is a Region. A continuous,
sorted set of rows that are stored together is referred to as a region (subset of table data). HBase
architecture has a single HBase master node (HMaster) and several slaves i.e. region servers. Each
region server (slave) serves a set of regions, and a region can be served only by a single region
server. Whenever a client sends a write request, HMaster receives the request and forwards it to the
corresponding region server. HBase can be run in a multiple master setup, wherein there is only
single active master at a time. HBase tables are partitioned into multiple regions with every region
storing multiple table’s rows.
MasterServer
• This process is accomplished with the help of Apache ZooKeeper, which assigns regions to
region servers.Handles region load balancing across area servers. It moves the regions to
less crowded servers after unloading the congested servers.By negotiating load balancing,
it keeps the cluster in good shape.Is in charge of schema modifications and other metadata
actions like table and column family formation.
Regions
• Tables are broken up and distributed across area servers to form regions.
• There are regions on the region servers that -
Handle data-related actions and communicate with the client.
For all the areas beneath it, handle read and write requests.
Follow the region size thresholds to determine the region's size.
When we take a closer look at the region server, we can see that it has the following regions and
stores:
Memory and HFiles are both stored in the store. Memstore works similarly to cache memory.
Everything that is entered into HBase is initially saved here. The data is then transported and saved
as blocks in Hfiles, and the memstore is flushed.
230
Notes
HMaster
Base architecture
has 3 important Region Server
components
and ZooKeeper.
Region Server
HBase Tables are separated into Regions horizontally by row key range. Regions are the
fundamental building blocks of an HBase cluster, consisting of a distribution of tables and Column
families. The Region Server runs on an HDFS DataNode in the Hadoop cluster. Region Server
regions are responsible for a variety of tasks, including handling, administering, and performing
HBase operations on that set of regions. A region's default size is 256 MB.
Block Cache
MemStore
Region Server runs on
HDFS DataNode and
consists of the following
components–
Write Ahead Log (WAL)
HFile
• Block Cache – The read cache is located here. The read cache stores the most frequently
read data, and when the block cache is full, recently accessed material is evicted.
• MemStore- This is the write cache, which keeps track of new data that hasn't yet been
written to disc. A MemStore exists for each column family in a region.
• Write Ahead Log (WALs a file that keeps temporary data that isn't persisted to a
permanent storage location.
• HFile is the storage file that contains the rows on a disc as sorted key values.
11.12ZOOKEEPER
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.Zookeeper has ephemeral nodes
representing different region servers. Master servers use these nodes to discover available
servers.In addition to availability, the nodes are also used to track server failures or network
partitions.Clients communicate with region servers via zookeeper.In pseudo and standalone
modes, HBase itself will take care of zookeeper.
In HBase, it's similar to a coordinator. It offers features such as configuration information
management, naming, distributed synchronization, and server failure notification, among others.
Clients use zookeeper to communicate with region servers.Can store large data sets,Database can
be shared, Cost-effective from gigabytes to petabytes, High availability through failover and
231
Unit 11: Hadoop Ecosystem
replication. No support SQL structure, no transaction support, Sorted only on key, Memory issues
on the cluster
232
Notes
Unit 11:
Hadoop Ecosystem
As a key enabler for this new generation of analytic processing methods, IBM® InfoSphere®
Streams provides a state-of-the-art computing platform that can help companies transform
burgeoning data into actionable information and business insights. InfoSphere Streams is a critical
component of the IBM Big Data Platform and delivers a highly scalable, agile software
infrastructure to perform in-motion analytics on a wide variety of relational and non-relational
data types entering the system at unprecedented volumes and speeds—and from thousands of
real-time sources. With InfoSphere Streams, organizations can capture and act on key business data
just in time, all the time.
These data streams can come from both structured and unstructured data sources, and they can
contain a wide range of digital data, including:
• Text files, spreadsheets, images, video and audio recordings •
• Email, chat and instant messages, web traffic, blogs and social networking sites •
• Financial transactions, customer service records, telephone usage records, system and
application logs. Data from satellites, GPS tracking, smart devices and network traffic
sensors
233
Unit 11: Hadoop Ecosystem
Introduction for
Big Data
• InfoSphere Streams brings these disparate data kinds together on a computing platform that
allows for advanced data analysis while maintaining great speed and response times.
Getting Started also makes it simple to install, build, configure, and manage application instances
with just a few clicks. Visual programming with drag-and-drop helps to shorten the learning curve
and accelerate application development.
Scale-out architecture
InfoSphere Streams software helps organizations extend their current IT investments without a
massive infrastructure overhaul. It scales from a single server to a virtually unlimited number of
nodes to process data of any volume—from terabytes to zettabytes. InfoSphere Streams provides a
clustered runtime environment that can easily handle up to millions of events per second with
microsecond latency. Actionable results can be achieved with near-zero latency. For improved
speed, the Advanced Compiler combines parts of the application and can distribute parts of the
application across many hardware nodes.Ethernet and InfiniBand are among the high-speed
transports it supports. Existing applications can be flexibly extended with new apps that access the
same data streams, allowing current investments to be used even further.A web-based
management console makes it easy to configure and manage the runtime and applications,
234
Notes
including automatically placing features and deploying application components. Applications and
their individual elements can be monitored for status and performance metrics to help ensure the
company attains its service-level agreements.
235
Unit 11: Hadoop Ecosystem
Complex Event
Processing (CEP), which uses patterns to detect composite events in streams of basic events,
resulting in high performance and rich analytics. Existing applications can be simply moved to an
Infosphere Streams environment to benefit from greater scalability and the capacity to handle up to
10 times more events per second on the same hardware.
Summary
• Hive is a SQL-based database that allows users to read, write, and manage petabytes of data.
Hive is based on Apache Hadoop, an open-source system for storing and processing massive
information effectively.
• Pig is a high-level programming language for Apache Hadoop. Pig allows data analysts to
create complicated data transformations without needing to know Java.
• The IBM InfoSphere Information Server is a prominent data integration technology that makes
understanding, cleansing, monitoring, and transforming data easier.
• HDFS is a distributed file system that runs on commodity hardware and can handle massive
data collections. It is used to grow an Apache Hadoop cluster from a few nodes to hundreds (or
even thousands) of nodes. HDFS is one of Apache Hadoop's primary components, along with
MapReduce and YARN.
• Hadoop includes a RecordReader that transforms input splits into key-value pairs using
TextInputFormat. In the mapping process, the key-value pairs are utilised as inputs. The only
data format that a mapper can read and understand is this one.
• Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.
• Hadoop is an Apache Foundation open source framework for processing huge volumes of
heterogeneous data sets in a distributed way across clusters of commodity computers and
hardware using a simplified programming style. Hadoop is a dependable distributed storage
and analysis solution.
• If a job fails, Hadoop will identify the failure and reschedule replacements on healthy
computers. It will only end the task if it fails four times, which is the default setting that may be
changed, and it will kill terminate the job. to be finished
• Namenode stores information about all other nodes in the Hadoop Cluster, files in the cluster,
file component blocks and their positions in the cluster, and other information that is necessary
for the Hadoop Cluster's functioning.
• Job Tracker manages the sharing of information and outcomes by keeping track of the specific
tasks/jobs allocated to each of the nodes.
236
Notes
• The IBM InfoSphere Information Server is a prominent data integration technology that makes
understanding, cleansing, monitoring, and transforming data easier.
Keywords
Apache Hive:Hive is a data warehousing and ETL solution built on top of the Hadoop Distributed
File System (HDFS). Hive makes it simple to carry out tasks such as these.Encapsulation of data,
Querying on the fly, Large-scale data analysis
Apache Pig: Pig is a high-level platform or tool for processing massive datasets. It provides a
highlevel of abstraction for MapReduce computation.
Apache MRUnit: Apache MRUnit TM is a Java package for unit testing Apache Hadoop map
reduce tasks. The post's example uses the Weather dataset, and it works with the year and
temperature retrieved from it. Obviously, you can simply adapt the example to your own data.
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
HBase: HBase is a data model that looks a lot like Google's Big Table. It is a Java-based open-source
distributed database created by the Apache Software Foundation. HBase is a critical component of
the Hadoop ecosystem. HDFS is the foundation for HBase.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with
lowcost hardware in mind.
Meta-Store:Apache Hive metadata is stored in Meta-store, a central repository. It uses a relational
database to store Hive table and partition metadata (such as schema and location).
Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Shuffling and sorting:The process of transferring intermediate output from the mapper to the
reducer is known as shuffle. On the basis of reducers, a reducer receives one or more keys and their
associated values. The mapper's intermediated key – value is automatically ordered by key.
Self Assessment
1. Which of the following is an open-source data warehouse system that has been built on top of
Hadoop?
A. Apache Hive
B. Apache Pig
C. Apache Hbase
D. None of the mentioned
237
Unit 11: Hadoop Ecosystem
D. All of the
mentioned.
4. ________ developed by Yahoo researchers executes Map Reduce jobs on extensive datasets
and provides an easy interface for developers to process the data efficiently.
A. Apache Hive
B. Apache pig
C. Both
D. None of the mentioned
6. Which of the following compiles the optimized logical plan into a series
of MapReduce jobs?
A. Parser
B. Atom
C. optimizer
D. compiles
9. ________ operator is used to perform an inner, equijoin join of two or more relations based on
common field values.
A. COMBINE
B. COMBINATION
C. JOIN
D. JOINING
10. Apache Hive was created by ____________
A. Facebook
B. Twitter
C. Both
D. None of above
238
Notes
14. HBase Tables are separated into Regions ________ by row key range.
A. Vertically
B. Horizontally
C. Diagonally
D. None of the mentioned
15. Which of the following is not a component of the HDFS data node?
A. Block Cache
B. MemStore
C. HFile
D. None of the mentioned
Review Questions
1. Explain architecture of Pig.
2. Explain working of HIVE.
3. Elaborate classification of Apache Pig operators
4. What do you understand by Infosphere streams?
239
Notes
6. D 7. B 8. A 9. C 10. A
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We Live,
Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
240
Notes
Objectives
• learn simple linear regression and multiple linear regression.
• learn visual data analysis techniques.
• learn applications of business analytics.
Introduction
Simple linear regression is a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables.
• Onevariable,denoted x, is regarded asthe predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as the response, outcome,
or dependent variable.
• Often, the objective is to predict the value of an output variable (or response) based on the value of an input (or
predictor) variable.
A line that begins in the lower left corner of the plot and ends in the upper right corner is called a positive relationship. In
a positive linear relationship, high scores on the X variable predict high scores on the Y variable. A line that begins in the
upper left corner of the plot and ends in the lower right corner (like the relationship shown above) is called a negative
relationship as shown in Figure 1. In a negative linear relationship, high scores on the X variable predict low scores on the
Y variable.
Linear Relationship
Linear regression is a linear model, e.g., a model that assumes a linear relationship between the input variables (x) and the
single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).
Scores scattered randomly around a straight line in the middle of the graph indicate no relationship between
variables.Sometimes a scatter plot will show a curvilinear relationship between two variables. If this happens, we need to
use special statistics developed for curvilinear relationships.Whereas some relationships are straightforward to
understand, explain, and detect statistically (i.e., linear relationships), curvilinear relationships are more complex because
the nature of the relationship is different at different levels of the variables. Curvilinear relationships can occur often in
241
Note
s Unit 12:
communication research, given the complex, socially and contextually dependent phenomena that are the focus of such
research.
Linear regression models are not perfect. It tries to approximate the relationship between dependent and independent
variables in a straight line. Approximation leads to errors. Some errors can be reduced.Some errors are inherent in the
nature of the problem. These errors cannot be eliminated. They are called as an irreducible error, the noise term in the
true relationship that cannot fundamentally be reduced by any model.The same equation of a line can be re-written as: β0
Predictive Analytics
and β1 are two unknown constants that represent the intercept and slope. They are the parameters. ε is the error term.
Example
• The following are the data provided to him: make: make of the car. fuelType: type of fuel used by the car. nDoor:
number of doors. engineSize: size of the engine of the car. price: the price of the car.
Make Fueltype Ndoors Enginesize Price
242
Notes
First and foremost, Fernando wants to see if he can accurately estimate automobile prices based on engine size. The
following questions are addressed in the first set of analyses:
• Is price of car price related with engine size? How strong is the relationship? Is the relationship linear? Can we
predict/estimate car price based on engine size?
243
Note
s Unit 12:
• Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly
across the values of the independent variable. Independence of observations: the observations in the dataset were
collected using statistically valid methods, and there are no hidden relationships among variables.
• Independence of observations: In multiple linear regression, it is possible that some of the independent variables
are actually correlated with one another, so it is important to check these before developing the regression model. If
two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the
regression model.
244
Notes
The following are some of the most common applications of the Data Visualization technique:
• It is a strong approach for analysing data and producing presentable and understandable findings.
• It is a fundamental stage in the pre-processing section of the data mining process.
• It aids in data cleansing by detecting inaccurate data, corrupted or missing values
• It also aids in the construction and selection of variables, which involves deciding which variables to include and
exclude from the study.
• It also plays an important part in the Data Reduction process when merging the categories.
Univariate
Analysis
Three different
types of analysis for Bivariate Analysis
Data Visualization
Multivariate
Analysis
Univariate Analysis: We will use a single characteristic to examine virtually all of its features in a univariate analysis.
Example of univariate analysis are distribution plot, box and whisker plot and violin plot.
245
Note
s Unit 12:
Bivariate Analysis: Bivariate analysis is when we compare data between two characteristics that are precisely the same.
Example of bivariate analysis are line plot, bar plot and scatter plot.
Multivariate Analysis: We shall compare more than two variables in the multivariate analysis. For example, if a marketer
wishes to compare the popularity of four adverts on a website, click rates for both men and women may be monitored,
and associations between factors could be investigated.
It's similar to bivariate, however there are more dependent variables. The methods for analysing this data are determined
by the objectives to be met. Regression analysis, path analysis, factor analysis, and multivariate analysis of variance are
some of the approaches (MANOVA).
246
Notes
Customer Targeting:Customer targeting is the process of segmenting a customer base into groups of people that are
similar in characteristics that matter to marketing, such as age, gender, hobbies, and spending patterns. It allows
businesses to send personalised marketing messages to people who are most likely to purchase their goods.
Predictive analytics has been shown to be far more effective than traditional methods in identifying new consumers.
The following are some examples of consumer targetingfactors:
Socio demographic factors:Age, occupation, marital status, education... are all sociodemographic
variables.
Predictive Analytics
247
Note
s Unit 12:
Engagement factorsFactors that influence engagement include recency, frequency, and monetary value.
Past campaign factors: Factors in previous campaigns: contact type, day, month, length...The benefits to
the firm include I improved customer communication, (ii) significant cost savings in marketing, and (iii)
a significant rise in profitability. A good example here is a banking institution's direct marketing
initiatives. The objective is to forecast which customers will sign up for a term deposit.
The benefits to the firm include
Churn Prevention
Churn prevention tries to anticipate which consumers will leave our firm, when they will leave, and why they will
go.Retaining a current client is significantly less expensive than gaining a new one. As a result, this occurrence has the
potential to be extremely expensive.Companies may create predictive models that enable preemptive action before
it's too late by using the power of large consumer data sets.
248
Notes
Sales Forecasting:
Sales forecasting examines past performance, seasonality, market-moving events, and other factors
to produce a reasonable estimate of demand for a product or service. It may be used to forecast in
the short, medium, or long future.Predictive analytics can forecast consumer reaction and shifting
sentiments by looking at all aspects in this regard.The following are some examples of variables
used in sales forecasting:Data from the calendar: season, hour, holidays, and so on.Temperature,
humidity, rainfall, and other weather dataPrice, promotions, and marketing efforts are examples of
company data.Economic and political aspects that a country is experimenting with are referred to
as social data.
Market Analysis
Market survey analysis aids businesses in meeting consumer needs, boosting profits and lowering
attrition rates.The following are some examples of quality enhancement factors:Components,
presentation, etc. are some of the product's qualities.Gender, age, and other customer
attributesTastes and preferences of customers are surveyed.After the firm has created the
predictive model, it may use it to look for qualities that match customer preferences.For example,
physicochemical testing (e.g., pH levels) can be used to predict wine quality, with the result based
on sensory data (evaluations by wine experts).
Risk assessment
Risk assessment enables businesses to identify the potential for issues in their operations.Predictive
analytics attempts to provide decision-making tools that can predict which activities are lucrative
and which are not.Risk assessment is a broad phrase that may imply a variety of things to different
people. Indeed, we may wish to assess the risk of a client, a firm, or other entity.The risk
assessment can look at the following sorts of data in the instance of a client:Gender, age, education,
marital status... are all socio-demographic characteristics to consider.
Financial modeling
Financial modelling is the process of converting a set of assumptions about market or agent
behaviour into numerical forecasts.These prediction models are used to assist businesses in making
investment and return decisions.Predicting the stock market trend using internal and external data
is one example.Predictive analytics may be used to a variety of sectors and can help you improve
your performance and predict future occurrences so you can respond accordingly.Neural Designer
is a machine learning and data science tool that makes it simple to create prediction models.
Summary
• Regression analysis is a collection of statistical techniques used to estimate the associations
between a dependent variable and one or more independent variables in statistical
modelling.
• By fitting a linear equation to observed data, linear regression seeks to model the connection
between two variables.
• It is a stand-alone variable that is unaffected by the other variables you're attempting to
measure. A person's age, for example, might be an independent variable. Other aspects, such
as what they eat, how often kids go to school, and how much television they watch, will
have no effect on their age.
• In an experiment, the dependent variable is the variable that is being measured or assessed.
The dependent variable in research looking at how tutoring affects test results, for example,
would be the participants' test scores, because that's what's being measured.
• Correlation is a statistical word that refers to the degree to which two variables move in
lockstep. When two variables move in the same direction, it is said that they have a positive
correlation. A negative correlation exists when they move in opposite directions.
249
Notes
Keywords
Linear Regression:Linear regression is the process of identifying a line that best matches the data
points on the plot so that we can use it to forecast output values for inputs that aren't included in
the data set we have, with the assumption that those outputs will fall on the line.
Independent Variable: The independent variable (IV) is a feature of a psychological experiment
that is manipulated or modified by researchers rather than by other factors..
Dependent Variable: In an experiment, the dependent variable is the variable that is being
measured or assessed. The dependent variable in a research looking at how tutoring affects test
results, for example, would be the participants' test scores, because that's what's being measured.
Correlation: Correlation is a statistical word that refers to the degree to which two variables move
in lockstep. When two variables move in the same direction, it is said that they have a positive
correlation. A negative correlation exists when they move in opposite directions.
Data Visualization:A graphical depiction of information and data is referred to as data
visualisation. Data visualisation techniques make it easy to identify and comprehend trends,
outliers, and patterns in data by employing visual components like charts, graphs, and maps.
Bivariate Analysis: The phrase "bivariate analysis" refers to the study of two variables in order to
discover their correlations. In quality of life research, bivariate analyses are frequently reported.
Multivariate Analysis:MVA stands for multivariate analysis, which is a statistical process for
analysing data including many types of measurements or observations. It might also refer to
difficulties in which more than one dependent variable is investigated at the same time as other
variables.
Predictive Analysis:Predictive analytics is a form of advanced analytics that uses historical data,
statistical modelling, data mining techniques, and machine learning to create predictions about
future events. Predictive analytics is used by businesses to uncover trends in data in order to
identify dangers and opportunities.
Market Analysis: A market study is a proactive investigation of a product or service's market
demand. Market research examines all of the market elements that drive demand for a particular
Introduction for Big Data
250
Notes
product or service. Price, location, competition, substitutes, and overall economic activity are all
factors to consider.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with
lowcost hardware in mind.
Self Assessment
1. Linear regression is a ___________ machine learning algorithm.
A. Supervised
B. Unsupervised
C. Reinforcement
D. Clustering
2. In Linear Regression, which of the following strategies do we apply to determine the best fit line
for data?
A. Least Square Error
B. Maximum Likelihood
C. Logarithmic Loss
D. Both A and B
3. Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables
A. Categorical
B. Continuous
C. Nominal
D. Ordinal
4. _______measures the linear link between two variables, but it doesn't reveal more complicated c
orrelations.
A. Correlation
B. Factorization
C. Regression
D. None of the mentioned
5. Which of the following plot will show a curvilinear relationship between two variables?
A. Scatter Plot
B. Curvilinear
C. Line
D. Bar Plot
251
Notes
12. Which of the following types of analysis used for Data Visualization?
A. Univariate Analysis
B. Bivariate Analysis
C. Multivariate Analysis
D. All of the above
14. Which of the following libraries should be used to make a chart in Python?
A. Visual data
B. Data visualization C. Matplot
D. None of the above
252
Notes
6. A 7. D 8. B 9. C 10. D
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution
That Will Transform How We Live, Work, and Think . Houghton Mifflin
Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark.
OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
253
Notes
Objectives
• learn concepts of machine learning.
• learn four categories of machine learning.
Introduction
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed. Machine learning focuses on the development of
computer programs that can access data and use it to learn for themselves.Machine Learning is the most widely used
method for forecasting the future or categorizing data to assist humans in making important decisions. Machine Learning
algorithms are taught over examples or situations in which they learn from previous experiences and examine historical
data.When a result, as it trains over and over on the examples, it is able torecognise patterns and make predictions about
the future.The learning process starts with observations or data, such as examples, direct experience, or instruction, so that
we may seek for patterns in data and make better judgments in the future based on the examples we offer. The
fundamental goal is for computers to learn on their own, without the need for human involvement, and to adapt their
behaviour accordingly.But, using the classic algorithms of machine learning, text is considered as a sequence of keywords;
instead, an approach based on semantic analysis mimics the human ability to understand the meaning of a text.
254
Notes
Unit 13:
Introduction to Big Data
Supervised learning
Unsupervised learning
Machine learning
approaches are
traditionally divided into
four broad categories:
Semi -supervised
Learning
Reinforcement learning:
Supervised learning:Supervised machine learning algorithms can use labelled examples to apply what
they've learned in the past to fresh data and predict future events. The learning algorithm creates an
inferred function to generate predictions about the output values based on the examination of a known
training dataset. After enough training, the system can offer objectives for any new input. The learning
algorithm may also compare its output to the proper, intended output and detect mistakes, allowing the
model to be modified as needed.
Naive Bayes
learning processes
Linear regression
Logistic regression
K-nearest neighbor
Random forest
• Neural Network:Neural networks reflect the behavior of the human brain, allowing computer programs to
recognize patterns and solve common problems in the fields of AI, machine learning, and deep learning.
• Naïve Bayes:In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying
Bayes' theorem with strong independence assumptions Data Analytics with R
255
Notes
between the features. They are among the simplest Bayesian network models, but coupled with kernel density
estimation, they can achieve higher accuracy levels
• Linear Regression: Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the
other variable's value is called the independent variable.
• Logistic Regression:Logistic regression is a statistical analysis method used to predict a data value based on prior
observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the
relationship between one or more existing independent variables.
• Support Vector Machine:SVMs (support vector machines) are supervised machine learning techniques that may be
used for both classification and regression. However, they are most commonly employed in categorization issues.
SVMs were initially developed in the 1960s, but they were improved around 1990.
• KNN: The supervised machine learning method k-nearest neighbours (KNN) is a basic, easy-to-implement technique
that may be used to tackle both classification and regression issues.
• Random Forest: A random forest is a machine learning approach for solving classification and regression issues. It
makes use of ensemble learning, which is a technique for solving difficult problems by combining many classifiers. A
random forest method is made up of a large number of decision trees.
• Image and object recognition: When applied to different computer vision techniques and visual analysis, supervised
learning algorithms may be used to find, isolate, and categorise items from movies or images, making them usable.
• Predictive Analytics: The creation of predictive analytics systems to give deep insights into multiple business data
points is a common use case for supervised learning models. This enables businesses to predict certain outcomes
depending on a particular output variable, assisting business executives in justifying choices or pivoting for the
organization's advantage.
• Customer Sentiment Analysis: Organizations can extract and categorise significant bits of information from enormous
amounts of data using supervised machine learning algorithms with very little human interaction, including context,
emotion, and purpose. This may be quite beneficial in terms of obtaining a better knowledge of consumer interactions
and improving brand engagement initiatives.
• Spam Detection: Another example of a supervised learning model is spam detection. Organizations may train
databases to identify patterns or abnormalities in fresh data using supervised classification algorithms, allowing them
to efficiently arrange spam and nonspam correspondences.
Introduction to Big Data
256
Notes
Unit 13:
• Datasets can have a higher likelihood of human
error, resulting in algorithms learning incorrectly.
• Unlike unsupervised learning models, supervised learning cannot cluster or classify data on its own.
• Clustering: Clustering is a way of arranging items into clusters so that those with the most similarities stay in
one group while those with less or no similarities stay in another. Cluster analysis identifies similarities among
data items and classifies them according to the presence or absence of such commonalities.
• Associate rule: An association rule is an unsupervised learning approach that is used to discover associations
between variables in a big database. It identifies the group of items that appear in the dataset together. The
association rule improves the effectiveness of marketing strategies. People who buy X (let's say a loaf of bread)
are more likely to buy Y (butter/jam). Market Basket Analysis is a good example of an association rule.
257
Notes
A Semi-Supervised
algorithm assumes the three factors about the
data
• A Semi-Supervised algorithm assumes three factors about the data as shown in Figure 7. Continuity Assumption:
The method believes that points that are closer together have a higher probability of having the same output label.
• Cluster Assumption: The data may be split into distinct clusters, with points in the same cluster having a higher
chance of having the same output label.
• Manifold Assumption:The data are roughly distributed over a manifold with a significantly smaller size than the
input space. This assumption permits distances and densities defined on a manifold to be used.
258
Notes
Unit 13:
determine the best feasible action or path in a given scenario. Reinforcement learning differs from supervised learning in
that the solution key is included in the training data, allowing the model to be trained with the right answer, but in
reinforcement learning, there is no answer and the reinforcement agent determines what to do to complete the job. It is
obliged to learn from its experience in the absence of a training dataset.Here are some important terms used in
Reinforcement.
Agent: It is an assumed entity which performs actions in an environment to gain some reward.
Environment (e): A scenario that an agent has to face.
Reward (R): An immediate return given to an agent when he or she performs specific action or task.
• State (s): State refers to the current situation returned by the environment. Policy (π): It is a strategy which applies
by the agent to decide the next action based on the current state. Value (V): It is expected long-term return with
discount, as compared to the short-term reward.
• Value Function: It specifies the value of a state that is the total amount of reward. It is an agent which should be
expected beginning from that state.
• Model of the environment: This mimics the behavior of the environment. It helps you to make inferences to be
made and also determine how the environment will behave.
• Model based methods: It is a method for solving reinforcement learning problems which use model-based
methods. Q value or action value (Q): Q value is quite similar to value. The only difference between the two is that
it takes an additional parameter as a current action.
Our agent responds by making an action transition from one "state" to the next.Your cat, for example, progresses from
sitting to walking.An agent's reaction is an action, and a policy is a way of choosing an action given a situation in the
hopes of better results.They may receive a reward or a punishment as a result of the transfer.
Data Analytics with R
259
Notes
Value -Based
Model-Based
• Model-based:You must construct a virtual model for each environment in this Reinforcement Learning technique.
The agent learns how to perform in that particular setting.
Positive
Two kinds of
reinforcement learning
methods are:
Negative
• Positive:It is described as an occurrence that occurs as a result of particular actions. It improves the strength and
frequency of the behaviour and has a favourable influence on the agent's actions.This sort of reinforcement aids
in maximising performance and Introduction to Big Data
maintaining change for a longer length of time. However, too much reinforcement might lead to state over-
optimization, which can have an impact on the outcome.
• Negative:Bad Reinforcement is defined as behaviour strengthening that happens as a result of a negative
circumstance that should have been avoided or halted. It assists you in determining the minimal level of
260
Notes
Unit 13:
performance. The disadvantage of this technique is that it just supplies enough to fulfil the minimal behaviour
requirements.
Markov Decision
Process
There are two important
learning models in
reinforcement learning:
Q learning
• Set of actions- A, Set of states –S, Reward- R, Policy- n, Value- V. The mathematical approach for mapping a
solution in reinforcement Learning is recon as a Markov Decision Process or (MDP) as shown in Figure 12.
• Q learning is a value-based approach of giving information to help an agent decide which action to
perform.Let's look at an example to better understand this method:In a building, there are five rooms that are
connected by doors.Each room is assigned a number from 0 to 4.The building's outside might be one large
outdoor space (5)From room 5, doors 1 and 4 lead into the building.
261
Notes
Figure 13 Q Learning
• After that, you must assign a prize value to each door:100 points are awarded for doors that go directly to the
objective.There is no reward for doors that are not directly connected to the target room.Because doors are two-
way and each chamber has two arrows,each of the arrows in the picture above represents an immediate prize
value.
262
Notes
Unit 13:
Introduction to Big Data
• In this image, you can view that room represents a state. Agent’s movement from one room to another represents
an action. In the below-given image, a state is described as a node, while the arrows show the action. or example,
an agent traverse from room number 2 to 5
• Initial state = state 2, State 2-> state 3, State 3 -> state (2,1,4), State 4-> state (0,5,3)
263
Notes
Applications of Reinforcement
Learning
264
Notes
Unit 13:
Introduction to Big Data
It helps you to create training systems that provide custom instruction and materials according to the requirement of
students. Aircraft control and robot motion control.
Summary
• Machine learning (ML) is the study of computer algorithms that may improve themselves over time by gaining
experience and using data. Machine learning algorithms create a model based on training data to make predictions
or judgments without having to be explicitly programmed to do so.
• The process of supplying input data as well as proper output data to the machine learning model is known as
supervised learning. A supervised learning algorithm's goal is to discover a mapping function that will translate
the input variable(x) to the output variable(y) (y).
• Unsupervised learning, also known as unsupervised machine learning, analyses and clusters unlabeled
information using machine learning techniques. Without the need for human interaction, these algorithms uncover
hidden patterns or data groupings.
265
Notes
• A learning issue with a small number of labelled instances and a large number ofunlabeled examples is known as
semi-supervised learning.
• Reinforcement learning (RL) is a branch of machine learning that studies how intelligent agents should operate in a
given environment to maximise the concept of cumulative reward. Reinforcement learning, along with supervised
and unsupervised learning, is one of the three main machine learning paradigms.
• In statistics, naive Bayes classifiers are a subset of "probabilistic classifiers" based on Bayes' theorem and strong
independence assumptions between features. They are one of the most basic Bayesian network models, but when
combined with kernel density estimation, they may attain greater levels of accuracy.
• In statistics, naive Bayes classifiers are a subset of "probabilistic classifiers" based on Bayes' theorem and strong
independence assumptions between features. They are one of the most basic Bayesian network models, but when
combined with kernel density estimation, they may attain greater levels of accuracy.
• Sentiment analysis is the systematic identification, extraction, quantification, and study of emotional states and
subjective information using natural language processing, text analysis, computational linguistics, and biometrics.
• Clustering is the process of splitting a population or set of data points into many groups so that data points in the
same group are more similar than data points in other groups. To put it another way, the goal is to separate groups
with similar characteristics and assign them to clusters.
• In psychology, association refers to a mental link formed by specific experiences between concepts, events, or
mental states. Behaviorism, associationism, psychoanalysis, social psychology, and structuralism are all schools of
thought in psychology that use associations.
Keywords
Machine Learning:Machine learning is a type of data analysis that automates the creation of analytical models. It's a field
of artificial intelligence based on the premise that computers can learn from data, recognise patterns, and make judgments
with little or no human input.
Linear Regression:Linear regression is the process of identifying a line that best matches the data points on the plot so
that we can use it to forecast output values for inputs that aren't included in the data set we have, with the assumption
that those outputs will fall on the line.
Supervised Learning: The machine learning job of learning a function that translates an input to an output based on
example input-output pairs is known as supervised learning. It uses labelled training data and a collection of training
examples to infer a function.
Unsupervised Learning: Unsupervised learning is a sort of algorithm that uses untagged data to discover patterns. The
objective is that the machine will be pushed to create a compact internal picture of its surroundings through imitation,
which is the fundamental method young infants learn, and will be able to generate inventive material as a result.
Semi-supervised Learning: Semi-supervised learning is a machine learning technique that involves training using a small
quantity of labelled data and a big amount of unlabeled data. Semisupervised learning is the middle ground between
unsupervised and supervised learning. It's a unique case of poor supervision.
Reinforcement Learning:Reinforcement learning is a branch of machine learning that studies how intelligent agents
should operate in a given environment to maximise the concept of cumulative reward.
Introduction to Big Data
Naïve Bayes: The Bayes Theorem provides the basis for the Nave Bayes algorithm, which is utilised in a broad range of
classification problems. In this essay, we will learn about the Nave Bayes algorithm as well as all of the key ideas so that
there are no questions.
Clustering: Clustering is the process of splitting a population or set of data points into many groups so that data points in
the same group are more similar than data points in other groups. To put it another way, the goal is to separate groups
with similar characteristics and assign them to clusters.
266
Notes
Unit 13:
Association analysis: The challenge of uncovering intriguing correlations in vast datasets is known as association
analysis. There are two types of interesting relationships: frequent item sets and association rules. According to
association rules, two objects have a strong link.
Markov Decision Process: A Markov decision process (MDP) is a discrete-time stochastic control process in mathematics.
It gives a mathematical framework for modelling decision-making in settings where outcomes are partially random and
partly controlled by a decision maker.
Q Learning:Q-learning is a model-free reinforcement learning technique for determining the worth of a certain action in a
given state. It doesn't require an environment model, and it can handle problems like stochastic transitions and incentives
without the need for adaptations.
Predictive Analysis:Predictive analytics is a form of advanced analytics that uses historical data, statistical modelling,
data mining techniques, and machine learning to create predictions about future events. Predictive analytics is used by
businesses to uncover trends in data in order to identify dangers and opportunities.
Market Analysis: A market study is a proactive investigation of a product or service's market demand. Market research
examines all of the market elements that drive demand for a particular product or service. Price, location, competition,
substitutes, and overall economic activity are all factors to consider.
Self Assessment
1. Supervised machine learning algorithms can use _________- examples to apply what they've learned in the past to
fresh data and predict future events.
A. Labelled
B. Unlabelled
C. Predicted
D. Unpredictable
3. In which of the following option ___________ networks reflect the behavior of the human brain, allowing computers
to recognize patterns and solve common problems.
A. Neural networks
B. Naïve Bayes
C. Linear Regression
D. All of the above
B. probabilistic central
C. probabilistic classifiers
D. None of above
267
Notes
B. Blockchain
C. Both a and b
D. None of the above
7. If we are considering feature to understand the taste of user that is example of ____________
A. Content based filtering
B. Collaborative filtering
C. Both
D. None of above
10. _____________ uses item features to recommend other items similar to what the user likes, based on their previous
actions or explicit feedback.
A. Content-based filtering
B. Collaborative filtering
C. Both
D. None of the above
A. Functions
B. Packages
C. Domains
D. Classes
12. Advanced users can edit R objects directly using ___________ Computer code.
A. C, C++
268
Notes
Unit 13:
B. C++, Java
C. Java, C
D. Java
13. In the R programming language, which of the following is utilised for statistical analysis?
A. Studio
B. Heck
C. KStudio
D. RStudio
15. The R programming language resembles the __ programming language on the surface.
A. C
B. Java
C. C++
D. None of the above
6. C 7. A 8. B 9. A 10. A
Review Questions
1) What is machine learning? Why is the machine learning trend emerging so fast.?
2) Explain different types of machine learning algorithms.
3) Elaborate difference between classification and regression.
269
Notes
Unit 13:
Data Analytics with R
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will Transform How We Live,
Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices of Scalable Realtime Data
Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
270
Notes
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further Readings
Objectives
• explore concepts of SPLUNK
• Learn features of Splunk
• Understand Interfaces, data ingestion and uploading data
• Understand concepts of data Meer
• learn steps to install Splunk enterprise on windows
Introduction
Splunk is a software used to search and analyze machine data. This machine data can come from web applications,
sensors, devices or any data created by user. It serves the needs of IT infrastructure by analyzing the logs generated in
various processes but it can also analyze any structured or semi-structured data with proper data modelling. It has built-in
features to recognize the data types, field separators and optimize the search processes. It also provides data visualization
on the search results.
Prerequisites
• The reader should be familiar with querying language like SQL.
• General knowledge in typical operations in using computer applications like storing and retrieving data and
reading the logs generated by computer programs will be a highly useful.
Introduction to Big Data
• Splunk is a software which processes and brings out insight from machine data and other forms of big data. This
machine data is generated by CPU running a webserver, IOT devices, logs from mobile apps, etc
• It is not necessary to provide this data to the end users and does not have any business meaning. However, they
are extremely important to understand, monitor and optimize the performance of the machines.
• Splunk can read this unstructured, semi-structured or rarely structured data. After reading the data, it allows to
search, tag, create reports and dashboards on these data. With the advent of big data, Splunk is now able to ingest
big data from various sources, which may or may not be machine data and run analytics on big data.
• So, from a simple tool for log analysis, Splunk has come a long way to become a general analytical tool for
unstructured machine data and various forms of big data.
271
Notes
Splunk Enterprise It is utilized by organizations with a significant IT infrastructure and a company that is heavily reliant
on technology. It aids in the collection and analysis of data from websites, applications, devices, and sensors, among other
sources.
Splunk Cloud It's a cloud-hosted platform with the same functionality as the corporate edition. It's available directly
from Splunk or via the AWS cloud platform.
Splunk Light − It allows users to search, report, and get alerts on all log data in real time from a single location. In
comparison to the other two versions, it offers less capabilities and features.
Features of SPLUNK
Features of SPLUNK are shown in Figure 2.
272
Notes
• Data Ingestion: Splunk accepts a wide range of data types, including JSON, XML, and unstructured machine
data such as web and application logs. The user can model the unstructured data into a data structure as desired.
• Data Indexing: Splunk indexes the imported data for quicker searching and querying under various situations.
• Data Searching:In Splunk, searching entails utilising the indexed data to create metrics, forecast future trends,
and spot patterns.
• Using Alerts:When certain criteria are identified in the data being examined, Splunk alerts may be used to send
emails or RSS feeds.
• Dashboards:Splunk Dashboards may display search results as charts, reports, and pivot tables, among other
things.
• Data Model:Based on specialized domain knowledge, the indexed data can be modelled into one or more data
sets. This makes it easy for end users to navigate and evaluate business cases without having to grasp the
intricacies of Splunk's search processing language.
Administrator Link
The Administrator drop down menu allows you to customise and modify the administrator's information. Using the
interface below, we may reset the admin email ID and password.
We can also go to the preferences option from the administrator link to select the time zone and home application on
which the landing page will open once you log in. It now appears on the home page, as shown below in Figure 4.
273
Notes
Unit 14:
Introduction to Big Data
Figure 4 Preferences
Settings Link
This is a link to a page that lists all of Splunk's key functionality. By selecting the lookup link, you may add the lookup
files and lookup definitions, for example.
• Search and Reporting Link: The link to search and reporting brings us to the features where we
can locate the data sets that are accessible for searching the reports and alerts that have been
produced for these searches. The screenshot below clearly demonstrates this. −
Splunk's Add Data function, which is part of the search and reporting interface, is where data gets ingested.
Big Data Management using Splunk
274
Notes
Figure 6Events are stored in the index as a group of files that fall into two categories
275
Notes
When we click this option, we're sent to a screen where we can choose the source and format of the
data we want to send to Splunk for analysis. Gathering the Data
The data for analysis may be obtained from Splunk's official website. Save this file to your local disc
and unzip it. When you access the folder, you'll see three files in various formats. They are log files
created by some online applications. We can also get a different collection of data from Splunk,
which can be found on the official Splunk website.
276
Notes
Input Settings
We configure the host name from which the data is imported in this phase of the data ingestion
process. For the host name, there are several possibilities to pick from as shown in Figure 10.
Constant value
It's the full host name of the server where the source data is stored.
regex on path
When using a regular expression to obtain the host name. Then, in the Regular expression area,
type the regex for the host you wish to extract.
segment in path
Enter the segment number in the Segment number box to extract the host name from a segment in
your data source's route. For example, if the source path is /var/log/ and you want the host value
to be the third segment (the host server name), enter "3."
The next step is to select the index type that will be used to search the input data. The default index
approach is chosen. The summary index is used to construct a summary of the data and establish
an index on it, whereas the history index is used to store the search history. In the image below, it is
clearly represented.
277
Notes
Review Settings
After clicking on the next button, we see a summary of the settings we have chosen. We review it
and choose Next to finish the uploading of dataas shown in Figure 11.
When the load is complete, the screen below opens, indicating that the data was successfully
ingested and outlining the next steps we may do with the data.
278
Notes
Splunk's inbuilt data processing unit evaluates all incoming data and classifies it into several data
kinds and categories. Splunk, for example, can distinguish a log from an Apache web server and
construct suitable fields from the data read.
Splunk's source type identification capability does this by utilising its built-in source types,
sometimes known as "pretrained" source types.
The user does not have to manually classify the data or assign any data types to the fields of the
incoming data, making analysis easy.
279
Notes
280
Notes
When we select the Search & Reporting app, we are greeted with a search box from which we can
begin our log data searchFigure 16. We input the host name in the format indicated below and then
click the search icon in the upper right corner. This returns a result that highlights the search word.
281
Notes
282
Notes
283
Notes
284
Notes
Field Summary
By clicking on the name of the field, you may get more specific stats about that field. It displays all
of the field's different values, as well as their counts and percentages as shown in Figure 22.
285
Notes
14.7 DataMeer
Datameeractsas a job compiler or code generator likeHive.This means every function, filter or join
that the user designs in the spreadsheet will be translated into native Tez code. Tez is great for
splitting up workloads into smaller pieces. To do so, Datameer compiles a job for a Hadoop cluster,
where it is sent to be executed. After the job is compiled and sent to the cluster Datameer does not
control job execution, and can only receive the telemetry metrics provided by the cluster's services.
The job will run with any scheduling settings and use resources granted by the scheduler.All users
working with Datameer's Excel-like User Interface (UI) are generating a Java program for
distributed computing on the cluster backend. This high level of abstraction is one of the key
features that makes Datameer such an outstanding technology. However, this approach does mean
that business users need to keep in mind the types of problems every programmer deal with, i.e.,
data types, memory, and disk usage.These separates analytics work into two stages. First, the
design/edit time and second the execution/runtime of a data link/import job/workbook. Both
stages are located on different parts within your distributed computing system (cluster).
DESIGN/EDIT TIME
The first stage is served on the Datameer application server, running the Datameer service Java
Virtual Machine (JVM), started and executed under the Datameer serviceaccount user. Depending
on your configuration and if (Secure) Impersonation is configured or not, calls are made from
<datameerServiceAccountUser> @ <datameerHost> or <loggedinUser> @ <datameerHost>.
EXECUTION/RUN TIME
The second stage is served on random DataNodes (DN) in the cluster. The DN is running the
container JVM, started by the ApplicationMaster (AM) and executed under the YARN service
account user. Depending on the configuration and if (Secure) Impersonation is configured or not,
callsaremade
from <yarnServiceAccountUser>@<dataNode> or <impersonatedUser>@<dataNode>
• Data cleansing – functions for the removal of bad records, replacing invalid or blank
values, and de-duplicating data,
• Data blending – join and union functions to blend disparate datasets into a common,
normalized view, Advanced transformations – pivoting, encoding, date and time,
conversion, working with lists, parsing functions,
• Data grouping and organization – more sophisticated ways to group, aggregate, and
slide-and-dice data, including pivot tables, sessionization, custom binning, time windows,
statistical grouping, and algorithmic grouping,
• Data science-specific – one-hot, date/time, and binned encoding functions for data
science models.
Datameer can provide the universal tool for all your data transformation needs, whether data
engineering, analytics engineering, and analyst or data scientist data preparation, and
facilitatecataloging and collaboration across all these functions.
286
Notes
• Splunk Enterprise is configured to operate as the Local System user. It asks you to
establish a password for the Splunk administrator.
• This must be completed before the installation may proceed.
Creates a shortcut to the software on the Start Menu.
287
Notes
• If you want to change any of these default installation settings, click Customize Options
and proceed with the instructions in "Customize Options" in this topic.
• Otherwise, click Next. You will be prompted for a password for the Splunk admin user.
After you supply a password, installation begins and you can continue with the "Complete
the install" instructions.
Customize options during the installation
Several settings can be customised throughout the installation process. The installer displays the
"Install Splunk Enterprise to" screen when you select to modify settings. Splunk Enterprise is
installed by default under Program FilesSplunk on the system
disc.The Splunk Enterprise installation directory is referred to as $SPLUNK HOME or percent SPL
UNK HOME percent throughout this documentation set.Splunk Enterprise instals and operates the
splunkd and splunkweb Windows services. Splunk Enterprise operations are handled by the splun
kd service, whereas the splunkweb service is solely installed to run in legacy mode.The user you ch
oose on the "Choose the user Splunk Enterprise should run as" screen instals and runs these service
s. Splunk Enterprise may be launched as the Local System user or as another user.
• When the installer asks you the user that you want to install Splunk Enterprise as, you must
specify the user name in domain\username format. The user must be a valid user in your
security context, and must be an active member of an Active Directory domain.
• Splunk Enterprise must run under either the Local System account or a valid user account
with a valid password and local administrator privileges. Failure to include the domain name
with the user will cause the installation to fail.
• Click Change… to specify a different location to install Splunk Enterprise, or click Next to
accept the default value. The installer displays the "Choose the user Splunk Enterprise should
run as" panel.
• Select a user type and click Next.
• If you selected the Local System user, proceed to Step 5. Otherwise, the installer displays the
Logon Information: specify a username and password panel.
• Enter the Windows credentials that Splunk Enterprise uses to run on the machine and click
Next.
• These credentials are different from the Splunk administrator credentials that you create in
the next step.
• Create credentials for the Splunk administrator user by entering a username and password
that meets the minimum eligibility requirements as shown in the panel and click Next.
• You must perform this action as the installation cannot proceed without your completing it.
If you do not enter a username, the installer creates the admin user during the installation
process.
• The installer displays the installation summary panel.
• Click "Install" to proceed with the installation.
• Complete the installation
• The installer runs, installs the software, and displays the Installation Complete panel.
• If you specified the wrong user during the installation procedure, you will see two pop-up
error windows explaining this. If this occurs, Splunk Enterprise installs itself as the Local
System user by default. Splunk Enterprise does not start automatically in this situation.
• You can proceed through the final panel of the installation, but uncheck the "Launch browser
with Splunk" checkbox to prevent your browser from launching. Then, use theseinstructions
to switch to the correct user before starting Splunk.
• (Optional) Check the boxes to Launch browser with Splunk and Create Start Menu Shortcut.
Click Finish. The installation completes, Splunk Enterprise starts and launches in a supported
browser if you checked the appropriate box.
288
Notes
Summary
• Splunk is a tool for tracking and searching large amounts of data. It indexes and correlates
data in a searchable container and allows for the generation of alerts, reports, and
visualisations.
• Splunk enterprise's goal is to help you figure out what's going on in your company and
take action swiftly.
• Splunk cloud is a versatile, secure, and cost-effective data platform service that allows you
to search, analyse, visualise, and act on your data.
• Splunk Light solves this problem by allowing you to collect and correlate data from almost
any source, format, or location. Data flowing from packaged and client applications, app
servers, web servers, databases, network wire data, virtual machines, operating systems,
sensors, and other sources are just a few of the possibilities.
• The process of acquiring and importing data for immediate use or storage in a database is
known as data intake. Ingesting anything means "to take in or absorb something." Data
can be ingested in batches or broadcast in real time.
• Indexing is a technique for improving database speed by reducing the number of disc
accesses necessary when a query is run. It's a data structure strategy for finding and
accessing data in a database rapidly. A few database columns are used to generate
indexes.
• Panel-based displays are known as dashboards. Modules like as search boxes, fields,
charts, tables, and lists can be included in the panels. Reports are frequently linked to
dashboard panels. You may add a search visualisation or a report to a new or existing
dashboard after you build it.
• The structure of your data is defined by a Splunk data model, which is a hierarchy of
datasets. Your data model should represent the data's basic structure as well as the Pivot
reports that your end users demand.
Keywords
Splunk:Splunk is a search and analysis tool for machine data. Machine data might originate from
online applications, sensors, devices, or any data that the user has developed. It supports IT
infrastructure by analysing logs created during various operations, but it may also evaluate any
organised or semi-structured data with correct data modelling.
Splunk Interface: Splunk's web interface includes all of the tools you'll need to search, report, and
analyse the data you've ingested. The same web interface allows administrators to manage users
and their responsibilities. It also includes connections for data intake as well as Splunk's built-in
applications.
Datameer bills itself as an all-in-one analytics solution. They help ingest (in Hadoop),
cleanse/prepare data that has been ingested or is being ingested, and then query data using
Hive/Tex/Spark, as well as give visualisation for the searched data, according to Datameer.
289
Notes
Data Ingestion: Splunk accepts a wide range of data types, including JSON, XML, and
unstructured machine data such as web and application logs. The user can model the unstructured
data into a data structure as desired.
Data Indexing: Splunk indexes the imported data for quicker searching and querying under
various situations.
Data Searching:In Splunk, searching entails utilising the indexed data to create metrics, forecast
future trends, and spot patterns.
Using Alerts:When certain criteria are identified in the data being examined, Splunk alerts may be
used to send emails or RSS feeds.
Dashboards:Splunk Dashboards may display search results as charts, reports, and pivot tables,
among other things.
Data Model:Based on specialized domain knowledge, the indexed data can be modelled into one
or more data sets. This makes it easy for end users to navigate and evaluate business cases without
having to grasp the intricacies of Splunk's search processing language.
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Application Master: The Application Master is a framework-specific library that is in charge of
negotiating resources with the Resource Manager and working with the Node Manager(s) to
execute and monitor Containers and their resource usage. It is in charge of negotiating suitable
resource Containers with the Resource Manager and keeping track of their progress. The Resource
Manager monitors the Application Master, which operates as a single Container.
NameNode is a component of the Master System. Namenode's main function is to manage all of
the MetaData. The list of files saved in HDFS is known as metadata (Hadoop Distributed File
System). In a Hadoop cluster, data is stored in the form of blocks, as we all know.
Self Assessment
1. Splunk is a software used to _______________ machine data.
A. search and attention
B. search and analyze
C. surfing and analyze
D. none of the mentioned
2. The Administrator drop down menu allows you to customize and modify the
____________ information
A. Administrator's
B. Reporting
C. Customer
D. User
3. The link to _______________ brings us to the features where we can locate the data sets that are
accessible for searching the reports and alerts that have been produced for these searches.
A. search and records
B. short and reporting
C. search and reporting
D. None of the above
290
Notes
4. Data ingestion in Splunk happens through the ______ feature which is part of the search and
reporting app.
A. Add data
B. Upload data
C. Ingest data
D. None of the above
8. Which of the following problems every programmer deals and business users need to keep in
mind?
A. Datatype
B. Memory
C. Disk usage
D. All of the above
10. Select the options that can be used to install splunk enterprise on windows
a. GUI interface
b. Command Line Interface
c. Both
d. None of the above
291
Notes
11. Select the parameter(s) that prevent users from installing Splunk.
A. Unsupported OS
B. Windows Server 2003
C. Both
D. None of the above
12. The MAX PATH path restriction in the Windows API is ___ characters long.
A. 250 B.
260
C. 270
D. None of the above
13. Which feature of Splunk is used to search the entire data set that is ingested.
A. Search & Reporting
B. Refining search results
C. Using fields in search
D. Sharing the search result.
14. What are the different formats are available for exports?
A. CSV
B. XML
C. JSON
D. All of the above
15. Which of the following are the components of the SPLUNK search processing language (SPL)?
A. Search terms
B. Commands
C. Functions
D. All of the above
292
Notes
Unit 14:
6. B 7. A 8. D 9. C 10. C
Review Questions
1) Write down steps forinstalling splunk enterprise on windows 2)
What is data preparation and datameer?
3) Write down functions of search and reporting app?
4) What are the different types of Splunk dashboards and also write down components of
Splunk architecture?
5) What are the benefits of feeding data into a Splunk instance through Splunk Forwarders?
Further Readings
• Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
• Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That
Will Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
• McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
• Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Real time Data Systems. Manning Publications.
• Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/