SlideShare a Scribd company logo
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
Manjeet Singh Nagi
(Manjeet Singh Nagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Index
Chapter 1 – A Brief History of…?
Chapter 2 - NoSQL databases
Chapter 3 - The backbone I (Hadoop, HDFS, MapReduce)
Chapter 4 - The backbone II (MapReduce continued)
Chapter 5 – A quick view of the ecosystem around Hadoop
Chapter 6 - Hive
Chapter 7 - Pig
Chapter 8 - Hbase
Chapter 9 - Sqoop
Chapter 10 - Flume
Chapter 11 - Kafka
Chapter 12 - Oozie
Chapter 13 - Zookeeper
Chapter 14 - Solr
Chapter 15 - Giraph
Chapter 16 - Putting it all together
Chapter 17 – Hadoop ecosystem on Amazon
Chapter 18– Machine Learning with Mahout
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Preface
StartedwritingthisebookexplainingBigDatato managers.Didnotget time tocomplete it.Still
uploadingitforeveryone tohave alook.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 1 – A brief history of …
Any data set that cannot be processed on a single machine within reasonable amount of time
is big data. The underlined phrase in the sentence is very critical to determine if the problem
at hand classifies as a big data problem.
Theoretically any amount of data can be processed on a single machine, with high amount of
storage and multi processors (e.g. a mainframe). But if the machine takes a couple of days, or
even a day, it may not be of much use to the business. If the business or customer or
consumer of the data is OK getting the data processed in as much time as it takes on a single
machine(and there are valid scenarios for such requirements) you do not need to solve the
problem like a big data problem. But in the world today, a large amount of data is coming to
companies. A quicker analysis or processing of this data can help them get quicker insights
and quicker decisions.
Theoretically big data has three attributes. Volume, velocity and variety, Lets understand
each first.
In a big data problem the amount of that data that needs to be processed would typically be
huge (volume). It might run into terra-bytes, peta-bytes etc.
It would typically be coming in at a high speed (real time in some case) (that’s velocity).
And it would come in a lot of variety. Variety could mean it could be coming from different
sources each of which could have different formats of sending the data. Even within the data
from the same source the format could vary over a period of time. Even within the data from
the same source at a given time the data may not have structure.
Having said that, why is it happening that the companies are getting such huge amount of
data at huge velocity and in so much variety?
The following reasons over the past couple of decades lead us to big data
Digitization of organizations – Over the past three decades or so organizations have
become more and more digitized. Every activity done by organizations has become digitized.
Every interface of the organization, be it with consumers, partnering vendors, government
agencies have become digitized. All this is creating a lot of data. But all this would not have
generated data (in volume, velocity and variety) needed to qualify as a big data unless the
developments mentioned in the following paragraphs would have takenplace.
Web 2.0 – Web 2.0 introduced technologies which made billions of people not the
consumers of websites but the content generators. Blogging & social websites are the
examples of the web 2.0. Even on sites not typically classified as social or blogging sites there
are features which enable billions of people to generate contents e.g. sharing the new articles
from news websites, commenting on specific content on a website etc.
Web 2.0 is often a hotly debated term. It is not as if a new version of web or any related
technology were released. But the web in the last decade of the last century was about flow of
information from website owners to billions of web users. Slowly the web evolved to enable
billions of user to generate the content. The content in the web today is much more
democratic. It is by the people and for the people.
Mobile devices – With the advent of mobile devices users are performing many more
activities and spending more hours in the web than earlier. Add to that the fact that mobile
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
devices capture much more user contextual information (location for example) than
desktops did earlier. This contextual information if processed and analyzed can enable
organizations understand their consumers much better and provide much more meaningful
services andor products. Also the very fact the user are spending lot of time on their mobile
devices means the amount of information that is getting generated is much more thanwhat
was getting generated when users were using desktop.
More digitization of the organizations - With more and more retail, social interactions
and information consumption moving to web organizations need to, literally, every click of a
user to understand her better. As opposed to a brick a mortar, where a user a can be
observed physically and assisted in-store assistants, in a e-tail website the only way to
observe a user is observe and analyze every click made by the user on the web. E-tail offers
an advantage over brick and mortar shops in the sense that the user activity can be saved for
analysis later which is not possible in a brick and mortar shop. This analysis needs a lot of
data, in form of logs, to be analyzed.
Typical pattern of big data problem solution
As we all know from our experience, typically in an application the time taken to process the
data is order of magnitudes smaller than the time taken for IO of the data from data
repository (disk or database). Also the time taken to the read (IO) the data over the network
(say from network storage or database on another server) is many times larger than the time
taken to read the data locally from the disk.
So typically when a big data problem is solved,
1. The data is distributed across multiple machines (called nodes). Transferring a
petabyte of data to a single machine would take much more time than the time taken
to divide this data into smaller chunks and transferring to 100 smaller machines in
parallel. The IO is now done to 100 nodes in parallel which reduces the IO time
significantly.
2. Now that the data is distributed across multiple nodes, the
code/application/binary/jar etc. is copied to all the nodes. This is unconventional as
compared to a typical application where data is brought from multiple sources to a
single machine where the application resides and is processed on this single machine.
In big data solutions it is the application that moves closer to the data.
3. Finally, the output from all the nodes is brought to a smaller number of nodes (many
times only 1 node) for final processing or summarization.
So, as you can see, the solution for a big data problems is about distributed storage (#1
above) and distributed processing (#2 and 3). The evolution of the solutions for big data
problem also happened approximately in the same manner. Firstly, many solutions around
distributed storage arrived and then around distributed processing.
How did it all start?
Commercial relational databases ruled the roost when it came to consistent storage since
70s/80s. These relational database had their own advantage which made them so popular.
But they had certain limitations which did not come to the fore till late 90s.
In the late 90s and early part of this century companies had more and more data to store.
The option available with the relational databases was to buy bigger and bigger machines
which are really costly.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Also in order to keep their websites available 100% of the time ( you do not expect google to
be down, do you?) the companies needed to scale out(add more active hot backups). This
made the relational database costly only from hardware perspective (relational database in
companies always ran on high grade servers) but also from licensing cost. The licensing cost
for the relational database was directly proportional to the number of machines it was going
to run on. For e-tailers and search giants had data that needed 100s and 1000s of machines
to store data with fault tolerance.
Also, the relational databases were designed to store data in a fixed format. They were
designed to lock-in to the schema at the time of database design. But companies were getting
data which was un-structured to a large extent (imagine logs). They could have formatted
this data and stored data in a structured format in the relational database but that eliminated
any possibility of using any data, discarded during the formatting stage, whose significance
was realized only later companies were looking for persistent data storage where schema was
not locked-in at the time of database design but at the time of database read.
To summarize, organizations were running into the following limitations of relational
database storages:
Licensing cost prohibited the scaling out that was needed to store large data sets.
Licensing cost as the higher-grade servers needed were prohibitive for the creating fault
tolerance in the storage.
Relational database were designed for locking-in the schema at the time of database design.
As companies started coming against these limitations many of them started designing
databases on their own and bringing them out in public in form of open source database.
These databases were together called NoSQL databases. All these database had the following
attributes (in addition to the fact that they were open source),
They were designed to run on clusters made of commodity hardware. Unlike relational
databases, they did not need high end servers.
They were inherently designed to run on clusters. So as the size of data increases an
organization could just add more commodity hardware and scale out rather than buying
costly servers.
Fault tolerance was inherent in their design. Any data on one node of the cluster was backed
up on another node (number of backups was configurable not only at database level but at
much more granular level). This low cost fault tolerance made them much more resilient on
commodity hardware than relational databases on enterprise servers.
They were design for unstructured data. So you could just load the data in whatever format
you get it. You need not even know what information comes in the data. It was up to the
application to know what to expect in the data.
Also NoSQL databases challenged the very foundation of the relational databases. The
foundation was that relational database updates were ACID (Atomic, Consistent, Isolated
and Durable). NoSQL databases challenged this very foundation. They questioned if every
business scenario really needed the databases to be ACID compliant. We will get into much
more details on this in the chapter 2.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
While many authors do not talk about NoSQL databases when they talk about dig data
technologies, NoSQL databases brought to the fore distributed storage, used for big datasets,
as we know it today.
Sometime in the early part of this century, Google published two papers.
One of the papers was about their distributed file storage system. This was not the first
distributed file storage system in the world. But it had many architectural and design aspect
to solve a very specific problem they had at hand. It was designed for
 Fault tolerance using commodity hardware. So the data was distributed across a
cluster of commodity machines (instead of high end servers). Since the machines are
high-grade there are high risks of failure The distributed file system will take care of
backing up data on each node on other nodes and recovering it in case a machine
fails.
 Scenarios where files written once to the distributed file system are read multiple
times.
 Random reads (reading a specific record from the file) are not required or are an
exception.
 Files are required to be ready sequentially in big chunks rather than one record each
time. These big chunks are also read in sequential manner rather than from random
places in the file
 Random write (updating some particular read) is not needed. So you do not have a
scenario to update a record in the file
 Updates to the file are about adding/appending more data that too in huge chunks
rather than one record at a time.
 Scenarios where a modest number of huge files need to be stored rather than huge
number of modest/small need to be stored.
 Clients(of distributed file system) which want to process bulk of data
faster(throughput) rather thansmall amount of data quickly(latency)
The other paper from Google was about a framework they developed for processing their
data. It is called MapReduce. In this framework user specifies a map function that will
transform their data and a Reduce function that will summarize the output of Map function.
The MapReduce framework takes the onus of
1. distributing the data to be processed across many nodes
2. distributing the map and reduce functions to all the nodes so that code is closer to the
data and hence IO is reduced(refer to Typical patternof big data problemsolution we
discussed earlier in this section)
3. Schedule to run Map and Reduce functions on all the nodes
4. Manage to recover from a failed machine – So the framework will take care of
restoring the data from a backup on another node and restart the map or reduce
function there if some machine fails.
The MapReduce framework was designed to run simple functions on a huge amount of data.
It lets programmers write Map and Reduce functions while it takes care of distributing the
data and code, schedule the run and recovering from a failure.
I do not want you to get bogged down by the term MapReduce. It is similar to a typical
processing of data in other applications. Here is more detail on what a map reduce function
are to make you more comfortable before we move forward.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
A map function accepts a record in the form of a key-value pair and does processing or
formatting on the record and generates an output in the form of another key-value pair. I do
not want you to think of “value” in the key-value pair as a single field. A “value” in the key-
value pair could a complete record with a lot of fields or information in it. E.g. key could be
employee ID and value could be all the details of that employee, key could be a transaction id
and value could be all the details of the transaction. It is up to the map function to decide
what processing or formatting it wants to do on which field in the value.
Similarly, Reduce function reads the key-value output from all the map functions running of
different nodes and summarizes to generate the final output.
A very simple example is – Say you have a huge data which has all the details of the
employees of many organizations from the world in a file. What you want to achieve is
calculate avg salary for each designation (assume there are standard designation). Your map
function will read the part of input file provided to it. The input Key-value would be
designation and the value would be the rest of the information about that employee. Your
map function will read each record and for each input record it will generate an output with
key as designation and value as salary from that record. It sounds simple. Isn’t it? What is
important is that the map function is parallelizable. You candivide your input records into as
many processing nodes as you have and run Map function in parallel on all those nodes. The
map function is not dependent on getting information from another record on another node
while processing a specific record.
The reduce function in our example will read records from all the nodes where map function.
Its input would be the output key-value from map function. It will do an avg for each
designation present in the file.
This lead to the development of Hadoop, an open source product which delivers capability
similar to the ones shared by google in their two documents. Hadoop has two components:
HDFS (Hadoop Distributed File System) – This is similar to google’s distributed file
system (as described above). As the name suggests HDFS is a distributed fault tolerant file
system. It enables storing large files across a cluster of commodity machines.
MapReduce – MapReduce is a framework to process data in the form of key-value pair by
distributing the key-value pairs across a cluster of machines. It is run in two steps. First step
is called a Map where input in the form of Key-value is processed to generate intermediate
key-value pairs. The intermediate key-value pair go through a reduce step which summarizes
the key-value pair to generate the final output.
Hadoop was quickly adopted across organizations. This eventually led to the development of
a lot of other products which extended the functionality of Hadoop further. E.g. Flume,
Sqoop, Pig, Hive etc. We will understand each of these open source product developed
around Hadoop in subsequent chapters in enough details for us to be able to design, at a high
level, a solution to solve a big data business problem.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 2 - NoSQL Database
How NoSQL databases are so scalable
When we say an application or a database is scalable we generally mean horizontal
scalability. The capacity (to process more data or take more load) of a scalable solution can
easily be increased by adding more machines to the cluster. On the other hand, the capacity
of a not-so-scalable solution can either be not increased at all or it can be increased only by
replacing the existing machine/server with a bigger (costlier) (vertical scalability) one.
The way relational and NoSQL databases store the data is very different. This difference
makes NoSQL database very scalable and cluster oriented. Let’s understand this with an
example.
Let’s take example of a professional networking website. Users maintain the information
about their education institute (schools and colleges they passed from) in this application.
Let’s also assume that the typical access pattern from the application is such that if every
time application accesses user information it will access her school/college information as
well.
A typical relational database design to store this information would be to save user and
educational institutes two different tables and maintaining the relationships between the 2
using foreign keys(or using a third table to maintain the start and end date of relationship).
Typical design in a relational database
User table Education Institute
Table
Relationship Table
Nowhere in this design have we told the database the typical access pattern of the data from
the application. I.e. we have not told the database if every time application accesses user
information it will access her school/college information as well.
Now let’s look at the database design of a NoSQL database. NoSQL database would typically
be designed in such a way that a user’s schoolcollege information will be embedded within
User ID
User Name
User DoB
…..
EducationInstitute ID
Institute Name
Institute City
….
User ID(ForeignKey)
EducationInstitute ID(ForeignKey)
Start Date
End Date
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
the user information itself and stored at the same physical node in the cluster. So the user
information would be like,
User {Name, D0B, Current designation, Educational Institute [(Institute1, Start Date, End
Date), (Institute2, Start Date, End Date)…]}
Note that the number of educational institute for a user can vary from zero to any number
your applications wants to allow. Relational database are generally not conducive to store
such limitless list like information. If you try to embed the educational institute within the
user information in a relational database it would also make it highly denormalized and
inefficient.
The NoSQL database would store the user and her educational institute information at the
same node in a cluster. If the educational institute information was not embedded in the user
information but instead maintained separately with a relation between two (like in relational
database) the two could be saved in different node. In that case every time a user information
was accessed the application would have to connect to another node as well to get the
information on the education institute (due to the typical access pattern of our application
described above). This would increase the IO and slow downthe application.
This way of storing the data makes NoSQL database very cluster oriented and scalable. As
you have more users, you can add mode nodes to the cluster and spread data across the
cluster. When application needs the data for a user, the database will get the data from the
node it is on.
You cannot scale the relational database in the same manner. The user and educational
institute are maintained as separate tables. If you spread user information across the nodes
on the cluster what should you do about the educational institutes? Typically many people
would have gone to the same institute. Relational database would maintain this many-to-one
relationship by using foreign key. But you cannot spread educational institutes across nodes
because user on node 1 would have gone to institute on node 2(do that you increase the IO).
Please note NoSQL makes a very strong assumption in terms of how data will be typically
accessed. If there are multiple ways in which data will be typically accessed then probably
NoSQL databases would not be a good option. In our example, what if the application will
also need to typically generate reports by counting the user by their educational institute. In
that case the application will have to scan through all the users across all the nodes in the
database to get the output which would be very in-efficient. Insuch a scenario a relational
database would be a good option or you can used NoSQL for general queries and create a
materialized view for storing counts of users by educational institutes.
I hope now you can imagine how NoSQL databases store the data. They spread the data
across the nodes in clusters but they ensure that the data that is accessed typically together
stays on the same node (latter part needs to be ensured by a good design). As the data
increases in the application one can add more nodes to the database to scale the solution
horizontally.
Please note that, conceptually, it is not as if NoSQL database stores one table on one node
and another table on another node. The scalability comes from the fact that it can distribute
each row of the table to a different node. Imagine (pardon for using a very simplistic example
here) you have a pack of pringles and a pack of chocolates and 10 dishes (plates to serve
them). Unless you open the pack of pringles and chocolates you cannot use the 10 dishes.
You will be able to use only 2 dishes. So users (guests who need the pringles and chocolates)
would put load on those two containers. But if you open the packs you can spread the
pringles and chocolates across 10 dishes. Some containers could have only pringles, some
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
only chocolate, some a combination of both. Some would have more quantity which you can
keep near areas which have more guest. Others canhave less quantity and can be kept near
guest who would not be consuming as much. That’s scalability.
Atomicity
Leaving the technical terms aside, what atomicity means is that related database updates
should be done in a manner that either all are done or none is done. In case of relational
database, the atomicity is ensured by the concept of transaction. So if updates to multiple
tables need to be done in a manner that either all are done or none is done then relational
database wrap these updates in a transaction and make the updates. If there are issues after
updating a few tables and the rest of the tables could not be updated, the database will roll
back the transaction, i.e. the updates already made to the table as a part of the transaction
are rolled back.
Let’s take an example. Say there is a simple application that records the transactions done on
a bank account and maintains the final balance. There are two tables in the database. One
stores the transactions. Other stores the final balance. If a user executes a transaction on an
account, the application needs to make two updates. First it needs to insert the transaction in
the transaction table. Second, it needs to update the final balance in the final balance table.
The application will indicate to the relational database that these two updates constitute one
transaction. The database will update the tables in any order but ensure that either both
updates are done or none. If after updating the transaction table first it runs into issues and
not able to update the final balance table it would roll back the updates made to the first
table and inform the application which must have a code to handle such exceptions.
NoSQL databases manage the atomicity a little bit differently. Though they are atomic to an
extent but not thoroughly.
Let’s continue with the same example of account update we used for understanding
atomicity in relational databases. If it were a NoSQLdatabase there were two ways in which
it could be designed. The final balance could be embedded in the table which lists the
transactions in an account. Or the final balance could be a separate table with a relationship
between the transaction and final balance table.
Design 1
{Bank Account ID, Final Balance, Transactions [(Transaction1 ID, Date, Credit/Debit Ind,
Amount, Other transaction details), (Transaction2 ID, Date, Credit/Debit Ind, Amount,
Other transaction details)… ]
Design 2
Table 1(Bank Account ID, Final Balance)
Table 2(Bank Account ID, Transactions [(Transaction1 ID, Date, Credit/Debit Ind, Amount,
Other transaction details), (Transaction2 ID, Date, Credit/Debit Ind, Amount, Other
transaction details)…]
In Design 1 the final balance is embedded within the same table which has the list of the
transactions. So updates made to transaction and final balance would be both be either done
or none. It would not happen that one of them is done and the other one not. So the
atomicity is ensured as much as a relational database.
In case of design 2, the final balance of an account could be stored on node different from the
list of transactions for that account. The NoSQL database would not be able to ensure
atomicity across nodes. The application will have to ensure that either both the updates are
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
made or none. NoSQL databases would ensure that all the updates made to one table are
either all done or none. But it cannot ensure this across tables.
So atomicity in relational database is ensured by design. While designing the database if we
keep the data, on which we want atomic updates, together the atomicity is ensured as the
data update is a part of the single update. Anything more than this needs to be ensured by
application.
Consistency
The NoSQL databases were designed to run on cluster of commodity grade machines. The
machines of commodity grade have a higher chances of failure. So any data saved on a node
must be backed up other nodes. NoSQL databases generally store a copy of each data on 3
different nodes (this is called replication factor and is configurable). This adds complexity to
the game now. If your data is replicated across 3 nodes you need to ensure they are in synch.
If you don’t they will go out of synch and different users access data from different nodes will
read different version of the data. But if your application needs to wait till all the nodes are
updated before confirming to the user that updates have been made it will make the
application slower. In case of relational database updates only on one node would be needed
(or two in case of hot-backup). But in case of NoSQL there would be 3 IOs needed (or mode
in case your database is configured that way) which would make the response to the
application less responsive or less alive.
So NoSQL database use the concept of Quorum. So when updates to the database are made
the NoSQL database would not wait for update to all the nodes. It would wait only for the
majority to get updated. So if the replication factor is 3 the NoSQL database would wait for
updates confirmed only by 2 nodes (quorum for updates). The third node will be consistent
later. This concept is called eventual consistency as the different nodes eventually become
consistent. What if one of these nodes fail before the updates are made to the 3rd node? The
NoSQL database would take the latest update from the rest of the two nodes on which the
data is saved and replicate it on another node.
What about the quorum while reading the data? The NoSQL database would not read the
data from all the nodes and give the result to the user. That would make the application
slower. The number of nodes from which it will read the data (or quorum for read) should be
1 more than the number of nodes which were not a part of quorum for update. So if your data
was replicated across 3 nodes, 2 nodes were part of update quorum and 1 was not a part of
the quorum, the NoSQL database would read the data from 1+(number of nodes which were
not a part of the update quorum) i.e. 1+1=2 nodes.
General rule is that
Qu+Qr > Rf
Also Qu > (Rf/2)
Qu  Quorum for updates
Qr  Quorum for reads
Rf Replication factor
Please note that any operation that needs more nodes to participate in a quorum will become
slower than the complementary operation (read is complementary for write and vice versa)
in the equation. But the complementary operation would become faster due to the above
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
equation. So if your replication factor is 7(Rf) and you configure that quorum for updates as
6(Qu) then you need to read data from only 2(Qr) nodes (refer to the above equation). The
reads will be much faster than the updates. So based on the requirement of your application
you can configure all the 3 parameter of the above equation in the database. If you need
updates faster then go for smaller quorum (but still a majority) would be good. If you need
reads to be faster instead you need to have higher quorum for updates and lower quorum for
reads. Some of the database allow you to configure the values of the 3 parameters in the
above equation not only at the database level but also at the transaction level. So while
executing the transaction you can indicate to the database if you want to the transaction to
be confirmed by a majority or lesser number of nodes.
Types of NoSQL databases
NoSQL databases are generally classified into four categories based on the information is
stored and accessed –
Key-valuestore
This is one of the easiest NoSQL database category to comprehend. These are typical key-
hash store. You can store and fetch any value for a given key. The database does not care
what is inside the value. You can store xml, JSON or anything you want as a value. The
database does not care. You can evenstore different format for different keys in the same
table (called bucket here). The database does not care. The onus of making sense of the value
read from the database lies with your application. This also means the database cannot do
any validations on the value, it cannot create index on the value and you cannot fetch data
based on any information within the value. All the access is done only based on the key
which makes is very fast.
Typically used to store session information, shopping cart information, user profiles all of
which require fast access.
A table in key-value data store is generally called a bucket. So a database can have multiple
buckets. The buckets are used to categorize keys and store them separately. e.g if you have
three different values for a key you can merge the values into one value and store it(design 1).
In such a case the onus of reading the value and splitting into 3 different values will lie with
your application. Or you can have 3 buckets and in the table and store the 3 values separately
(design 2). First design involves less IO and hence faster. The second design has more IO and
hence slower but the design is less complex
Design 1
Database
Bucket 1
Design 2
Key1
Value1,
Value2,
Value3
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Database
Bucket 1 Bucket 2 Bucket 3
Atomic updates are ensured for a given key-value. But if the application needs atomic
updates across buckets in a database then application will have to ensure that.
Examples of these databases are Redis, Memcached DB, Berkeley DB, Amazon’s Dynamo DB
Documentstore
These are similar to key-value store but these databases do not have a key and the value part
(called a document in this case) stored is not opaque. So you can fetch data based on
different fields with the document. You canvery well save your key within the document and
fetch based on that field. Indexes can be created based on the fields within the document.
The schema of information within the document canvary across documents saved in
different rows. Tables are called collections.
Database
Row 1/Document 1
Row 2/Document 2
Please note the schema of documents in different row is different.
MongoDB and CouchDB and famous examples of this category.
These databases are generally used for Event logging by enterprise application and as a
datastore for Document management systems.
Key1
Value1
Key1
Value2
Key1
Value3
{Name:ABC
LasName:XYZ
DoB:DD/MM/
YYYY}
{Name:DEF
LastName:HKJ
Place:Mumbai
}
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Atomicity is maintained at a single document level just like a key-value store maintained
atomicity at single key-value level.
Columnfamily
These databases are more difficult to understand than key-value and document. But these
are the most interesting ones. These are also the most relevant from Hadoop perspective
because the Hbase, a data store on top of HDFS, belongs to this category.
These database are accessed by a key to read multiple valuescolumns. Different columns are
grouped together into column families. Columns that may be accessed together are generally
grouped into column family and are stored together.
Try not to imagine the columns here like the columns in the relational database. The
columns in relational database have same name across multiple rows. If a row does not have
value for a column it is saved as null in relational database. But in a column-family database
if a row does not have a column it just does not have the that column.
The figure below shows a good example of column-family data store
Row 1
Row 2
So in the example above the access to the database is by using Key (UserID). The columns
(FirstName, MiddleName and LastName) have been grouped together into a column family
(UserDtls) as they will be accessed together and the columns (Institute, StartDate, EndDate)
have been grouped as another column family (Education). Please note that the columns in
the first row in column family ‘UserDtls’ are different from that in the second row.
ColumnFamily-
UserDtls{Name:Manjeet,Mi
ddleName:Singh,LastName:
Nagi}
ColumnFamily-
Education{Institute:AMC
University,StartDate:30/06/2
012,EndDate:30/06/2016}
ColumnFamily-
UserDtls{Name:XYZ,LastNa
me:ABC}
ColumnFamily-
Education{Institute:AMC
University,StartDate:30/06/2
012,EndDate:30/06/2016}
Key – User1
Key-User2
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Indexes can be created on different columns. While adding a new columns family for a
database requires database restart, an application can very easily add new columns with in
the column family. Atomicity of updates is maintained at a columnfamily for a given key.
Since different columns families for a given key can be stored as different nodes atomicity of
updates cannot be maintained across updates for different column families for a given row.
Cassandra and Hbase are examples of these databases.
Before we move on to the next category of the NoSQL database type, I want to reiterate that
the term column should not be visualized similar to column in relational database. In
relational databases the columns are similar to the columns in an excel in the sense that
- all the rows will have all the columns defined on a table
- if a row does not have value for a column its value will be saved a null
- the name of the column is not saved at each row
In a column family table the columns should be imagined like the attributes in an xml. Also
- all the rows need not have all the columns
- if a row does not have value for a column it will not save its value as null. It will not
have that columnat all
- the name of the column is saved at each row
Graph
The category is the most difficult to comprehend. These databases store entities and
relationships between them. Conceptually, entities are nodes in a graph and relationship are
depicted as directional edges in the graphs. Edges can have additional attributes which
depict further properties of the relationship. Neo4J is a good example of this category.
The picture below depicts the kinds of relationships that are generally stored in such data
bases.
Legends
node depicts an entity
Person1
Person2
Book1
Person3
Movie1
BigData
Reportsto
ReportstoisFriendsWith
Knows Likes
Org1
WorksIn,
StartDate,EndDate
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
shows attributes of relationship between entities
Depicts a one way relationship
Depicts two way relationships
Something important to understand is how this category is different from the other NoSQL
databases in terms of scalability. A key-value data store ensures scalability by spreading the
key-values across all the nodes of a cluster. It can do so because it understands (by virtue of
the database design) that it is a key-value combination that will be accessed mostly together.
A document store achieves this by spreading documents across the nodes on the cluster.
Similarly, a column-family data store achieves this by spreading key-column family
combination across cluster.
But a graph data store cannot achieve this as nodes in a relationship are linked to each other.
So graph databases cannot spread the data across nodes in a peer-to-peer manner. They
achieve scalability by using master-slave configuration of the cluster. This can be achieved in
many ways:
1. Reads operations can be directed to slave nodes. Writes operations are directed to
master node. Once master is updated a confirmation is provided to user about the
database updates. Slave nodes are updated after this. Add more and more slave nodes
makes reads more scalable. If writes need to be made more scalable then the data
needs to be sharded across multiple masters and the logic to do so is very specific to
the domain.
2. Writes are directed to salves as well but they provide confirmation to user only after
master has been updated. This makes writes as well scalable without really sharding
the data.
As must be clear by now, graph databases are used more for networking problem (social or
professional networking being one such problem).
As must be clear from the name “NoSQL”, none of these databases use SQL for database
access. All of them have their syntax for database operations. We have not gone into those
languages as the objective of the book is not to get to the level of code. Having said that, the
languages for each of the database are not very difficult to grasp.
So why did we understand so much about the NoSQL databases when the book is primarily
about Hadoop ecosystem. One of the open source in the Hadoop ecosystem, Hbase, is a
column-family store built on top of HDFS. We will understand Hbase in a detailed manner in
chapter 8.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 3 - The backbone (Hadoop, HDFS, MapReduce)
Hadoop is an open source software/framework/product to implement a distributed file
system (HDFS) and for processing your map-reduce (MapReduce) solutions.
I do not want you to get bogged down by the term MapReduce. It is similar to a typical
processing of data in other applications. If you can recall the set theory and functions from
our old school days, you will realize that we learnt about Map in the school days itself.
A Map is a function that processes an input data to produce output data. E.g.
f(x)=x2
The above is a map function. It processes any number x and produces its square. Another
way to look at the above function is using the set theory (again something we learnt in
school)
In the above diagram,there are two sets, A and B. The function or the map f(x) maps each
number in the set A to its square in Set B.
In our enterprise applications the functions or maps are complicated but they are still
functions/maps. e.g. Let’s say we have a mutual fund transaction processing batch system
which receives transaction from the sales agents and processes them. The first program or
script in the transaction processing system would typically do some formatting on the
transaction, validate it and persist the transaction in the data store. So our first program is a
function as depicted below.
f(Input Transaction)=Formatted, Validated, Persisted Transaction
Or we can imagine our program as a map as shownbelow
1
2
3
4
1
4
9
16
f(x)
SetA SetB
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Set A is our set of input transactions. So Set A is our input file. f(x) is our program which
maps the input transaction Tn to Tnfvp which is a formatted, validated transaction persisted
in the data store.
A Reduce function is just a program which reads a group of records and produces a summary
of those records. So extending the same example above there could be a program at the end
of the transaction processing system which sums all the transactions and produce a sum total
of the amount of transactions processed on that day (Fig 1 below). Or it could produce sum
total of transaction separately for each mutual fund product (Fig 2 below).
T1
T2
T3
T4
T5
T1fvp
T2fvp
T3fvp
T4fvp
SetA SetB
f(x)
T1fvp
T2fvp
T3fvp
T
4fvp
Sum of the amount of
transactionsprocessed
Sum of the amount of
transactions processed for
Product A
T1fvp
T2fvp
T3fvp
T4fvp
Sum of the amount of
transactions processed for
Product B
Fig1
Fig2
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Hadoop MapReduce is more suitable where processing of a data record is not dependent,
especially during Map function, on other data records in the input. So if your map program
(In the example) needs information of some of the other transactions when processing a
transaction Tn then Hadoop MapReduce is not the best solution for it. The reason is that
Hadoop MapReduce distributes the transactions across nodes on the cluster and sends the
Map function (your program in this case) to all these nodes to process these transactions. If
your program needs information from other transactions while processing a transaction Tn
then there will be network overhead to get details from transactions on the other nodes. This
network IO could slow down the Map.
In case of Reduce the program may need inputs from other transactions (say if it is summing
all the transactions) but Reduce is generally processed on a very few nodes. Even when
Reduce is processed on more than 1 node (but still very few nodes as compared to that of
Map) the data is divided amongst the nodes in such a manner that Reduce on a node will not
need information from data on the other node executing Reduce. In the example above if you
want to sum the transactions by Product you could send the transaction for Product A to
node 1 and that of Product B to node 2 and run reduce on both the nodes. Reduce will
calculate the sum for each product separately on each node.
Hadoop consists of two components – HDFS and MapReduce. We will understand each of
these in more detail in the sections below
HDFS
HDFS (Hadoop Distributed File System) is, as its name suggest, an open source distributed
file system to store huge amount of data. It splits the files that need to be stored into small
blocks and stores those blocks of file on different nodes on a cluster while letting the users
(applications, software, frameworks which use HDFS for its storage) still view the file as a
single ,unified and un-split file. So the distribution of the file to different nodes on the cluster
is not visible to the user.
At this stage it is important to re-iterate that HDFS is suitable only for certain scenarios.
These are -
 Scenarios where files written once to the distributed file system are read multiple
times.
 Random reads (reading a specific record from the file) are not required or are an
exception.
 Files are required to be read sequentially in big chunks rather than one record each
time. These big chunks are also read in sequential manner rather than from random
places in the file
 Random write (updating some particular read) is not needed. So you do not have a
scenario to update a record in the file
 Updates to the file are about adding/appending more data that too in huge chunks
rather than one record at a time.
 Scenarios where a modest number of huge files need to be stored rather than huge
number of modest/small need to be stored.
 Clients(of distributed file system) which want to process bulk of data
faster(throughput) rather thansmall amount of data quickly(latency)
HDFS works on master-slave architecture. Master node (generally 1) has a Namenode and
SecondaryNode daemons (or processes) running on it. Rest all the nodes in the HDFS cluster
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
are slave nodes and have DataNode process/daemon running on them. Actually blocks of
any data file is saved on slave machines where DataNodes are running. MasterNode only has
metadata of each of the block of these files.
 Master Node
 Slave Node
Namenode, SecondaryNodeandDataNode
Namenode and SecondaryNode are the processes/daemons that run on the master node of
the cluster. Namenode stores metadata about the files stored on HDFS. It stores information
about each block of each file. It does not read or write blocks of files on DataNodes. It only
tells, during a write operation, the HDFS client about the nodes where blocks of files can be
stored. Similarly during the read operations it only tells the HDFS client about the
DataNodes where the blocks of each file are stored. It is the HDFS client that stores or reads
the blocks of the files by connecting with each DataNode.
The metadata is stored in a file named fsimage on the disk. When the Namenode is started
the metadata is loaded into the memory. After this all the metadata updates (about new files
added, old files updated or deleted) are stored in the memory. This is risky for the obvious
reason that if the Namenode goes down all the updates since the last restart would be lost. So
Namenode stores the updates as well in a local file names edits. This eliminates the risk only
to some extent. If the Namenode goes down and needs to be restarted, it will have to merge
edits file into fsimage file. This will slow down the restart of a Namenode. This risk is further
brought down by adding a SecondaryNode. The SecondaryNode daemon/process merges the
edits file on the primary node with the fsimage on the primary node and replaces the existing
fsimage file with this new merged file.
Challenges or Limitationsof the HDFSarchitecture
Since Namenode stores all the Metadata and if it goes bad all your cluster will be useless, the
Namenode is a single point of failure. Hence the physical machine on which Namenode and
the SecondaryNode daemons are run should be of robust standard and not of the same
specifications as the machines on which DataNodes are run which could be commodity
machines. For the same reason, the NameNode should also be backed up frequently to
ensure the metadata can be restored in case the Namenode cannot be restarted after a
failure.
MasterNode
SecondaryNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
HDFS Client
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Also as we know, the Namenode loads all the Metadata from the fsimage file into memory at
the time of start and operates on this data during operations. Metadata of each block of file
takes about 200 bytes. This adds a limitation on the usage of HDFS. Storing a huge file
broken into small blocks works fine on HDFS. But storing too many small files (smaller than
the block size on HDFS) creates Metadata overload which clogs the memory on the
Namenode. This is primarily the reason HDFS is not a suitable distributed storage for
smaller files.
As you would have observed by now, Namenode could create a bottleneck as all the read and
write operations operation on cluster would need to access Namenode for the metadata it
needs. The bottleneck problem was solved in the later versions of Hadoop (Hadoop
2.0/YARN)
Each block of file is saved on the slave nodes running the daemons/processes called
DataNode. DataNodes also send regulars messages (called heartbeat) to the Namenode. This
heartbeat informs Namenode if a specific DataNode is up and running.
Replication of data
Since the DataNodes are run on commodity machines and the chances of these machines
going down is high, each block of file is replicated on 3(default and can be changed) different
DataNodes. The first replica is stored on a node at random. The second replica is stored on a
DataNode which is on a different rack. This ensures against a rack failure. The third replica is
saved on a different machine on the same rack. The chances of multiple racks going down is
less. Hence the third replica is saved on a different node on the same rack as the second
replica without increasing the risk of failure. Saving the third replica on a machine on the
third rack would increase the network IO and make the read and write operations slower as
different copies of the replicas are accessed during the read and write operations. Please
note the number of replica can be configured at HDFS level as well as at each file level.
Increasing the number of replicas makes HDFS operations slower as the IO increases.
Typical read-write operationsin HDFS
When a file needs to be written to HDFS, users/applications interface with HDFS client. The
client starts download the file. Once the download reaches the size of a block, the client
works with Namenode to find out on which DataNode can each block of the file be saved.
Once it get this information it sends the block of the file to first DataNode which starts
writing file on its disk and at the same time starts sending it to the second DataNode where
its replica needs to be saved. The DataNode 2 starts writing it on its disk and start sending it
to disk 3. On completion of write disk 3 confirms to disk 2 which confirms to disk 1 which
eventually confirms to HDFS client and which in turn confirms to Namenode. Once
Namenode gets confirmation it persists the metadata information and makes the file visible
on HDFS. This process is repeated for each block of the file and complete file is saved in this
manner. Checksum on each block is calculated and saved in HDFS to validate the integrity of
each block when the file needs to be read.
Similar process is followed at the time of read. When a file needs to be read, the HDFS client
gets the DataNode information from Namenode for each block and reads it from the
DataNode. Checksum is calculated again and matched with the checksum saved at the time
of write to validate integrity. If the read fails from a DataNode (node is down, or checksum
fails) the block is read from the replicated node
In the above read write operation we assumed a replication factor of 3. This factor can be
configured ad HDFS level or a file level. Even after file has been written to HDFS its
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
replication factor can be reduced. HDFS deletes some of the block replicas of the file to bring
down the replication factor of the file if we reduce the replication factor of a file
When a file is deleted by HDFS client, only the metadata information is updated to mark it as
a deleted file. This makes deletion faster. Actual deletion of file happens later.
All the DataNodes send messages, called heartbeat, to Namenode every 3 seconds. If
Namenode does not receive message from a DataNode it assumes it has failed. Since
Namenode maintains the metadata information of file blocks saved on each DataNode and
also on which other nodes they are replicated, it recreates those blocks on other nodes which
are running and updates its metadata information.
MapReduce
MapReduce is an open source framework to execute your Map (and Reduce) programs on a
cluster of machines. MapReduce copies your Map program (provided by you) to each node
on which a block of input file is stored and runs it on that node to process the block of input
data. Once all the nodes in the cluster have run their Map programs the MapReduce copies
the output from all the nodes to a smaller set of nodes where it copies the Reduce
program(again provided by you) and runs the Reduce program on each of these smaller set
of nodes to process and summarize the output from Map step. Though this is a simplified
view of MapReduce, this is what it does. As we progress in this chapter and next we will see
more complex and detailed view of MapReduce.
Just like HDFS, MapReduce also works on a master slave configuration. Master machine has
a daemon, named JobTracker, running on it. All the other machines on the cluster are salve
machines and have a daemon, named TaskTracker, running on them.
JobTrackerand TaskTracker
JobTracker is responsible for coordinating with all TaskTrackers on the slave nodes where
the Map and Reduce programs are run. It checks with the Namenode (of HDFS) where the
blocks of input files are kept. It sends the Map and Reduce programs to those node. It asks
TaskTracker on each of the slave nodes to run the Map and reduce programs. It keeps
receiving heartbeats from the TaskTracker to check if they are fine. If a TaskTracker does not
send the heartbeat the JobTracker assumes it has failed and reschedules the Map/Reduce
program running on that node on another node which has a replica of that data.
Just like Namenode in HDFS, if the JobTracker in MapReduce goes down all the cluster
running the MapReduce becomes useless. So JobTracker must be run on a machine with
specifications better than that of a machine running TaskTracker.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 4 - The backbone II (MapReduce continued)
Sample MapReduce solution
Let’s look at a very simple MapReduce solution. Let’s say you have billions of successful sale
transactions of all the mutual fund products of all the mutual funds companies in USA since
1975. You need to sum the transactions by year of sale. Your input file has a record for each
transaction. Each record has transaction date, the transaction amount and other transaction
details. While this problem can be solved by processing the transactions on a single machine,
the chances of it overwhelming evena high-end machine is very high. Even if it completes
successfully it would take a lot of time. You can solve this problem much more easily by
distributing the transactions over a cluster and processing them in parallel.
You need to write a Map program which will read a transaction and emit Year of sale and
transaction amount to the output file. You need to write a Reduce program which will take
multiple records(for a given year) with Year of sale and transaction amount as input and
generate an output where transactionamount is summed and it emits Year of sale and
summed transaction amount as output. So,
Map Program:
Reduce Program
TransactionID,
TransactionDate,
TransactionAmount,
Mutual FundProduct,
……..
Year of Transaction,
TransactionAmount
Map Program
Year of Transaction,
TransactionAmount1
Year of Transaction,
TransactionAmount2
Year of Transaction,
TransactionAmount3
Year of Transaction,
Sumof Transaction
Amounts
Reduce Program
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
For the sake of simplicity, our Reduce program assumes that all the transactions it receives
belong to the same year. It just needs to sum all the transactions and emit the sum along
with the year from any of the transactions.
Once your programs are ready you will provide the following inputs to Hadoop
1. Name of the Map Program
2. Name of the Reduce Program
3. Name of the Partition Program. If we skip this then Hadoop uses default Partition
class available to it. We will learn about it later in the chapter.
4. Name of the Combiner Program. Ifwe skip this then Hadoop skips the Combiner
step. We will learn about it later in the chapter.
5. Jar file(in case your programs are in Java) and the path from where to pick it from
6. Path to your input file with billions of transactions
7. Number of reducers you want to run. We will specify 1 reducer for each year since
1975, so a total of 42 reducers. This will ensure each reducer receives transactions of
only 1 year.
Hadoop will take the input file and split it into multiple blocks and store these on multiple
nodes on the cluster (as described in the Typical read-write operations in HDFS)
JobTracker will then copy your Jar (which has the Map and Reduce programs) to each of the
nodes which has a block of input file (it will get this information from Namenode of HDFS).
Hadoop will then run the following steps to execute your Map and Reduce programs. Please
note in the diagram below which phase runs on Map node and which on Reduce node.
Map
This phase will run your Map program on each node which has a block of your data file (not
on the replicas). The output of this phase will be a file on each node with Year of sale as key
Map
Partition
Combine
Sort
Reduce
Map
Partition
Combine
Map
Partition
Combine
Map
Partition
Combine
Shuffle
Sort
Reduce
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
and transaction amount as value. The output file on each node may have records for multiple
years from 1975 to 2016.
Partition
In this phase MapReduce will take the output of Map on each node and partition it into as
many files as there are Reducers (42 in our case). It will do this by partitioning the output file
of each map by key. So the output file of each Map step will be partitioned into 42 files (max)
each of which will have transaction of one year on that node. Partitioning the output file of a
Map by the key is the default Partition behavior. It can be customized to partition by some
other criteria and we will see it in the next chapter. If we do not mention any class of
Partition to Hadoop, it will use the default class available to partition the Map output by the
key in the Map output.
Output file from Map 1
Output file from Map 2
Output file from Map 3
Partition will come into action only if the number of Reducers are going to be > 1. If only 1
Reducer is going to be used, there is no need for portioning as all the records from all the
Maps need to go to only one reducer.
(Key1,value1),
(Key1,value2),
(Key2,value3)
(Key2,value4),
(Key3,value5),
(Key3,value6)
(Key1,value7),
(Key2,value8),
(Key2,value9)
Partition
Partition
Partition
(Key1,
value1),(Key1,value2)
(Key2,value3)
Outputfile forRedcuer1
Outputfile forRedcuer2
(Key2, value4)
(Key3,value5),(Key3,val
ue6),
Outputfile forRedcuer2
Outputfile forRedcuer3
(Key1, value7)
(Key2,value8),(Key2,val
ue9),
Outputfile forRedcuer1
Outputfile forRedcuer2
N
o
d
e
1
N
o
d
e
2
N
o
d
e
3
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Partition ensures that records for a specific key go to the same reducer from the all the
nodes. Reducer nodes will connect to the nodes on which Maps are running and collect the
files generated for only for them (based on name).But it does not ensure all the reducers get
equal load. Roughly, it divides the keys equally between reducers. If some key has more
records in the Map output than other keys then the reducer that is assigned that key will take
more time to complete. There are ways to ensure the load is equally divided between the
reducers. We will see how this is done later in this chapter.
The default Partition program does the partitioning by calculating an index for the key of
each record to be written to the output
Index = Hash of Key % Number of reducers to run
Hash is nothing but any function that generates a unique value for a given input. So two keys
will not generate the same output when run through a hash function. Also a given key will
always generate the same output when run through a hash function.
% is nothing but the simple Modulo function from our mathematics. A % B provides the
remainder left when A is divided by B.
Different index values are assigned to different reducers. Based on the Index value calculated
for a key, all the records with that Index are written to the output file for a reducer which has
that index value assigned to it. Different keys may go to a single reducers but a given key will
not go to multiple reducers.
We can very well overwrite all this default behavior of the Partition program by extending
the default class and customizing the Partitioning method. e.g. in our case we can overwrite
the default behavior by partitioning simply by the key(which is the year of transaction)
instead of the default behavior of calculating Index etc.
Combiner
This is an optional step. Please note that there are millions of transactions on each Map
node. Our Map program does not remove any transaction from further processing. So the
output of each map will also have millions of transactions (though each with two fields, year
of sale and transaction amount). So there are billions of records spread across the
partitioned outputs of Mappers across multiple node. Sending these records to 42 reduce
nodes will cause a lot of network IO and slow downthe overall processing. This is where a
Combiner can help.
Since the Reducer is going to sum all the transactions it receives with an assumption that all
the transactions it receives belong to the same year, we canrun the same summation on the
each partitioned output of each mapper. So thousands of records in each partitioned output
of a mapper will be summed into one record. A Combiner will sum the transactions in each
partitioned output of Partition step. It will take all the records in one partition, sum the
transaction values and emit Year of sale as key and sum of transaction amount as value. So
for each partitioned output (which has thousands of records), the Combiner will generate
only one records. This reduces the amount of data that needs to be transmitted over network.
If you delve over the behavior of Combiner it is like running reducer on the Map node before
transmitting the data to Reduce node.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
The above diagram shows how combiner works. It does not show the scale by which it
reduces the number of transactions which need to be transmitted to Reducer nodes. Imagine
when you have thousands of transactions for a key in a file and combiner generates only one
transaction which summarizes all those transactions and needs to be transmitted to the
reducer node then the amount of data to be transmitted reduces significantly.
As I said earlier, Combiner step is optional and we will have to tell Hadoop to run combiner.
Whether combiner can be run in your solution or not is very specific to the problem you are
trying to solve in the MapReduce. If some processing can be done in the Map output locally
to reduce the amount of data before transmitting it to the reducer nodes then you should
think about running Combiners. We also need to write a combiner program, add it to the jar
which we provide to Hadoop. We also inform Hadoop that a combiner needs to be run. This
can be done by providing the combiner class to Hadoop just like how we provide Map and/or
Reduce class to Hadoop.
Shuffle
Meanwhile, MapReduce would have identified 42 nodes that need to run the Reduce
program and assigned a Key (Year of sale) to each of them. The TaskTracker on each of these
(Key1,
value1),(Key1,value2)
(Key2,value3)
Outputfile forRedcuer1
Outputfile forRedcuer2,Key2
(Key2, value4)
(Key3,value5),(Key3,val
ue6),
Outputfile forRedcuer2
Outputfile forRedcuer3
(Key1, value7)
(Key2,value8),(Key2,val
ue9),
Outputfile forRedcuer1
Outputfile forRedcuer2
N
o
d
e
1
N
o
d
e
2
N
o
d
e
3
Combiner
Combiner
Combiner
(Key1,
Sum(value1,value2)
(Key2,value3)
(Key3,
Sum(value5,value6)
(Key2,value4)
(Key1,value7)
(Key2,
Sum(value8,value9)
Combiner
Combiner
Combiner
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer3
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
nodes will keep on scanning the nodes on which Maps are running and as soon as it finds the
output file generated for its processing (based on name of the file) it will copy the file to its
node. Once a reducer node gets all the files for it processing, MapReduce will go to the next
step.
In the diagram above we assumed that there are only 2 nodes for Reduce phase and
MapReduce assigned Key1 to Reduce1 on Node4 and Key2 to Reduce2 on Node5. We could
have assumed 3 nodes for Reduce phase as well and assigned one key to each node executing
Reduce phase. But keeping only two nodes for Reduce phase and assigning two keys(Key2
and Key3) to the Reducer on Node5 will help you understand the Sort phase better.
Sort
Each reduce node would have received files from multiple Map nodes. So in this step
MapReduce will merge all the files into one and sort by key (Year of transaction in this case)
all the input records to a Reducer.
N
o
d
e
1
N
o
d
e
2
N
o
d
e
3
(Key1,
Sum(value1,value2)
(Key2,value3)
(Key3,
Sum(value5,value6)
(Key2,value4)
(Key1,value7)
(Key2,
Sum(value8,value9)
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer2
CombinedOutputfileforRedcuer3
CombinedOutputfileforRedcuer1
CombinedOutputfileforRedcuer2
N
o
d
e
4
N
o
d
e
5
(Key1,
Sum(value1,value2)
(Key1,value7)
(Key2,value3)
(Key2,value4)
(Key2,value4)
(Key3,
Sum(value5,value6)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Please note Sort phase is run by default. Reduce phase must get its data sorted by keys. We
can overwrite the default behavior of sorting by key by extending the default class. We can
overwrite the default class to sort the input to Reduce by Keys as well values( or a part of the
value) if our scenarios expects that.
Reduce
Reducer will sum all the transactions in a file to generate {Year of Sale, Sum of Transaction}
as output.
N
o
d
e
4
N
o
d
e
5
(Key1,
Sum(value1,value2)
(Key1,value7)
(Key2,value3)
(Key2,value4)
(Key2,value4)
(Key3,
Sum(value5,value6)
Sort
Sort
(Key1,
Sum(value1,value2),
(Key1,value7)
(Key2,value3),
(Key2,value4),
(Key2,value4),
(Key3,
Sum(value5,value6)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Please note that if you are sending multiple keys to your Reduce phase then your Reduce
program should be able to handle that. In the diagram above we have assumed this. But in
the example we have been going through in this chapter, we assumed each instance of
Reducer running will get only 1 key.
N
o
d
e
4
N
o
d
e
5
Reduce
Reduce
(Key1,
Sum(Sum(value1,valu
e2),value7)
(Key2,Sum
(value3,value4,value5
))
(Key3, Sum
(value5,value6)
(Key1,
Sum(value1,value2),
(Key1,value7)
(Key2,value3),
(Key2,value4),
(Key2,value5),
(Key3,
Sum(value5,value6)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 5 – A quick view of the ecosystem around Hadoop
By now we have understood the capabilities of Hadoop quite well. HDFS (Hadoop
Distributed File System) offers a distributed storage for huge amount of data using a cluster
of commodity hardware. The distribution of the data on the cluster is transparent to the end
users or the applications interacting with HDFS. The data is also replicated within the cluster
to provide failover in case a machine in the cluster goes kaput.
MapReduce sits on top of HDFS and provides capability to process MapReduce programs on
data stored on the HDFS cluster.
Over a period of time a lot open source products have cropped up which either enhance the
capability of Hadoop further or overcome the limitations of Hadoop framework.
These new products can be categorized in the following four categories:
1. Ingestion – While we have huge storage available in HDFS, transferring huge amount
of data from the sources available with enterprises could be daunting. Products like
Sqoop, Flume and Kafka offer capability to move the data from our enterprise sources
into HDFS and vice versa. While Sqoop is used for importing the data from SQL data
sources within the enterprise, Kafka and Flume are used to import data from Non
SQL data sources (log etc.). Kafka and Flume have some finer differences between
them and we will see those as we move forward.
2. Processing – While MapReduce offers capability to process data stored on the HDFS
cluster, in order to use MapReduce one must know coding. The coding required to
develop MapReduce programs is quite complicated. Many times you need your
business users to be able to process the data stored on HDFS. Even for technology
teams, developing MapReduce programs in Java or any other language could be
inefficient. So frameworks or products were required which could ease the task of
processing data stored on HDFS. Pig and Hive are products which offer ease to
process data stored on HDFS. Hive offers a language HQL, much similar to SQL,
using which we canquery the data in HDFS. Pig offers, an easy to learn and use
language, called Pig Latin using which we can ETL(extract, transform, load) kind of
procedural programs can be developed to process the data on HDFS. Both, HQL
queries and Pig Latin programs, eventually get converted into MapReduce programs
at the back end and get executed. Thus Pig and Hive offer a higher level of abstraction
as compared to the Java program that one has to write if we need to develop a
MapReduce program.
MapReduce
HDFS
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
3. Real-time systems – MapReduce is designed for high throughput processing rather
than low-latency processing. It could process huge amount of data but it has some
kick start time. It is not designed to process data quickly and turn around. The initial
start time needed for Hadoop to identify the nodes for Map and Reduce, transfer the
code to these nodes and kick start the processing makes HDFS unsuitable for real-
time process where you need the response to your query/program quickly. Hbase
offers such capability. Hbase basically uses the distributed storage offered by HDFS
to offer key-value datastore services (refer to the Chapter 2 – NoSQL Database to
recall what a key-value store is). So it is a key-value type of NoSQL database using
HDFS for storing the keys and the values.
4. Coordination – There are two products in this category that are used for designing
solutions to big data process using Hadoop. Oozie is a workflow schedule to manage
Hadoop jobs. Zookeeper is used for coordination amongst different products in the
Hadoop eco system.
So keeping these products in mind the ecosystem developed around Hadoop looks like this
The subsequent chapters will each pick up one product each from the eco system and explain
it in detail. Since we have already understood the MapReduce which is for processing of data,
we will take up the processing category (Hive, Pig, Hbase) first. Amongst this category, we
will take up Hive first. Understanding Hive is easy as the programing is done using HQL
which is very similar to SQL which most of us understand well. Next we will take up Pig
MapReduce
HDFS
Hive(SQL like processing
capability)
Pig(ETL like procedural
capability)
Hbase(key-value
store using HDFS)
Sqoop(Ingestdata
from SQL data
sources in the
enterprise)
Flume (Ingest
data from non
SQL data sources
in the enterprise)
Kafka (Ingest
data from non
SQL data sources
inthe enterprise)
Oozie(Hadoop job scheduling) Zookeeper(coordination amongst
products)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
which again is easy to understand as the programming language Pig Latin is very easy. Hbase
is more difficult to understand as compared to Hive and Pig so we will take it up last in this
category.
Next we will take up the Ingestion category of product. We will take up Sqoop first. The
reason again being that this product is related to SQL world to which we all can relate to.
Next we will move to Flume as it originated before Kafka. Once we understand Flume we can
identify limitations and see how Kafka overcomes those.
At last, we will move to Oozie and Zookeeper as understanding other products in detail will
help us appreciate these two product better.
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Chapter 6 – Hive
Why Hive?
If we look back the example of transaction processing we took in chapter four, we are
essentially doing the following to the transactions,
1. Select certain fields from each transaction.
2. Group the transactions by year(by sending them to different Reducers)
3. Sum the transaction amount for each group
If you are even remotely aware of SQLworld the equivalent in the SQL is something like
Select Transaction.Year, SUM (Transaction.Amount)
From Transaction
Group By Transaction.Year
In case we wanted to filter out some transactions from processing we could have added the
filter in the Map (just an If condition). Let’s assume we want only those transactions to be
processed which have ’Purchase’ in a field named ‘type’. In the Map program that you
develop for the processing you would add an IF condition to process only those transactions
which have the value ‘Purchase’ in the field named ‘type.’ The SQL equivalent SQL would be
Select Transaction.Year, SUM (Transaction.Amount)
From Transaction
Where Transaction.type=’Purchase’
Group By Transaction.Year
Let’s also consider a scenario where the transaction has another field named “ProductCode”
which has a numeric code for the financial product on which transaction was done. We also
have a file which has a mapping between the “ProductCode” and “ProductName”. If we need
the field “ProductName” in the final output from the Reducer and also want to sum the
transactions on Year and ProductName instead of only Year of transaction, the Map Reduce
processing would be modified as below
Map:
1. Select transaction with ‘Purchase’ code in the ‘type’ field of transaction for further
processing in Map
2. Output year, product code and amount for each transaction with ‘Purchase’ in the
transaction type field.
Partition:
1. Partition transactions by year so that transactions for each year go to a different
Reducer.
Combiner:
1. Sum the transactions on each Partition by Year and ProductCode.
Shuffle:
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Each Reducer picks its files from the Map nodes.
Sort:
1. Sort the transactions by Year and Product Code
Reducer:
1. Load the file which has ProductCode-ProductName mapping into the memory.
2. Sum the input transactions by Year and Product Code. This time this step will sum
the transactions coming from different Map nodes( in the Combiner the same
processing did the sum only for the transactions on each node).
3. Just before writing a Sum read the ProductCode-ProductName mapping from
memory (loaded earlier in the Reducer) to resolve ProductCode in the output record
to ProductName.
4. Write the Sum of transactions by year and product name to the output.
The SQL equivalent of the above processing would be
Select Transaction.Year, Product.ProductName, SUM (Transaction.Amount)
From Transaction,Product
Where Transaction.type=’Purchase
Transaction.ProductCode=Product.ProductCode
Group By Transaction.Year,Product.Name
By now you would have noticed that it takes only a few lines of SQL code to do the processing
that we are trying to in MapReduce. When it comes to writing Java programs for MapReduce
1. The number of lines of codes is large
2. There are many libraries that need to be imported.
3. You need to be aware of which out of the box class file to extend for our specific
requirement.
4. There are variables to be defined, set and reset. And all the other complications
involved in any programing.
5. There are steps for building the jar.
When you have so much of raw of data residing on the HDFS is there no easier way to
process the data? Is there no way a business person or a person with limited technology skill
set can process and analyze the data? Is there a tool/framework which can
1. take the queries in the form similar to SQL written above,
2. do the laborious work of developing Map, Reduce, Partition and combiner classes,
3. schedule as many Maps and Reducers as needed and
4. produce the end result for the user.
That is what Hive does. It does all the 4 points written above and much more. Welcome to
the world of Hive! Hive is a tool operating at a higher level than Hadoop. It takes away the
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
difficult task of writing MapReduce programs. It develops those programs based on the
instructions given to it in the form of HQL (Hive Query Language) which is similar to SQL.
Thus it brings the power of Hadoop within the reach of people who are not programmer but
know what logic needs to be applied to the data to analyze and process it. HQL, like SQL, is a
much easier to pick up than Java or any other programming language. If one already know
SQL then the learning curve is much steeper.
Please note since Hive is only a layer above Hadoop it inherits the limitations of Hadoop
1. Hive does not support row level updates, inserts and deletes
Hive architecture
The following diagram shows the architecture of Hive
Hive sits on top of Hadoop, thus taking away all the complications of writing Map and
Reduce programs to process data. There are three ways to access Hive:
CLI: This is a typical command line interface where a user can write a few queries to load,
read and process data.
HWI: Hive Web Interface is an interface on the web serving the same objective as CLI
Thriftserver: It exposes Hive functionality to other applications that access Hive via JDBC or
ODBC drivers.
MapReduce HDFS
Hadoop
Driver Metastore
Command line Interface Web Interface Thriftserver
JDBC ODBC
Hive
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Metastore: While all the data accessed by Hive is saved on HDFS, the data about databases
and tables is stored in Metastore. We will learn about what kind of data is stored in the
Metastore in the subsequent sections.
Creating tables and types of tables
Hive is a layer sitting above Hadoop. It only extends the functionality of Hadoop by letting
user provide inputs in the form of Hive Query language (HQL) rather than low level
programs. In this and the subsequent sections we will take a few example of HQL to
understand what it does for the user and what it does in Hadoop at the back-end. While we
understand this, we will avoid trying to understand each and every option or variation
possible with an HQL command. The essence of this is to explain the core functionality of a
product without really getting into a lot of code.
Our logical flow would be to first understand the HQLs for defining databases and table.
Then we will move on to understanding HQLs for loading data into these tables. Finally, we
will understand the HQLs for processing the data. All along, we will also understand what
these HQLs eventually do on Hadoop.
Let’s assume that in addition to the transaction file (with all the details of mutual fund
transactions) we also have another file which provides the mapping between the mutual fund
product ID ( the financial product on which the transaction was executed) and mutual fund
name(the name of the product).
In order to load and process the data available in these two files we will firstly create the
database and tables to store the data in these files.
Create database if not exists transproc
The above command will create a directory in the HDFS with name transproc. A database is
just an umbrella folder to contain and organize all the tables. A registry will also be made in
the Metastore table about this new database.
Once your database is created you can very well create tables within the database with
command very similar to the one we used for creating the database
Create table if not exists transproc.transactions(transid STRING, ,transamount
FLOAT,…..)
Create table if not exists transproc.prodinfo (prodid string, prodname string)
The above command would create two subdirectories within the transproc directory and also
make a registry with the Metastore for the two new tables created.
Internaland External tables
Hive will notcreate the
database if it alreadyexisits
Name of the database
Name of the database
withinwhichtable needsto
be created
Name of the table
Layout of the table
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
There are two types of tables in Hive.
If Hive keeps the ownership of data stored in a table then the table is called internal or
managed table. In case of an internal/managed table, when a table is dropped the data from
HDFS as well as the reference to the table in the Metastore is deleted.
If Hive does not keep the ownership of the data stored in a table then the table is called an
external table. In case of an external table, when a table is dropped only the reference to the
table from the Metastore is deleted but the data from the HDFS is not deleted. So the table
stops existing for Hive but the data is still retained in Hadoop.
External tables are used to exchange data between multiple applications. E.g. in our case of
mutual fund transaction processing it may be the case that the product data (product id to
product name mapping) is not owned by the department which has the responsibility of
processing the transactions (typical scenario). Insuch a case, the product department would
make the product information available in a flat file in some HDFS location. The transaction
processing application would define an external table on top of this data. When the
transactions processing is done it could delete the external table. But that would not delete
the product data in the flat file. That data might be referenced by other applications as well.
If we do not mention in our HQL command if the table is internal or external, Hive would
assume it to be internal.
The command to create an external table is
Create external table if not exists transproc.prodinfo (prodid string, prodname string) row
format delimited fields terminated by ‘,’ location ‘location of the external file’
Internalpartitioned tables
Let’s look back at the query that creates the transaction table,
Create table if not exists transproc.transactions(transid STRING, ,transamount
FLOAT,ProductID STRING, SubproductID STRING,… )
Assume that SubproductID indicates a variety of a Product. So a Product can have different
varieties and each can be indicated by the sub product ID.
Now let’s assume that we know the access pattern for this table. By access pattern I mean we
know that when the data is accessed it will be mostly accessed for a specific Product ID
and/or Sub Product ID. Let’s say we also know that the data would generally not be accessed
for many or all the products ID at the same time.
Now the above HQL command for creating a table would create one single directory for the
table. All the data for the table would be in one directory. Every time the table is accessed
Hive (and HDFS in the back-end) would have to find the data for that particular product
and/or subproduct id to fetch it. The directory structure created in HDFS by the above
command would be
../transproc/transactions
Providesthe locationof
the external fileHive
Informshive toexpect
the fieldsinthe external
file separatedby‘,’
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Instead, if we know that typically data would be accessed using product ID and/or
subproduct ID we can segregate the data within the directory into separate subdirectories
for product ID and subproduct ID. This is called partitioning.
The command to partition the table is:
Create table if not exists transproc.transactions(transid STRING, ,transamount FLOAT,…)
Partitioned by (ProductID STRING, SubproductID STRING)
The above command will create a subdirectory like
../transproc/transactions
As and when data is added to this table separate sub-directories will be created within the
transaction directory in HDFS for each ProductID and SuproductID combination.
Load data local inpath ‘path from where data needs to be picked up’
Into table transactions
Partition (ProductID=’A’, SubproductID=’1’)
The above command will create a subdirectory like
../transproc/transactions/ProductID=’A’/SubproductID=’1’
Anytime data is loaded into this table the command to load the data would have to specify
partition information and the data will be loaded into the directory structure for that
partition.
Please also note that the table schema does not have the columns which are a part of
partition now. There is no need save ProductID and SubproductID in the table itself as this
information can be derived from the path of the partition.
If data has to be read for a specific Product ID and Subproduct ID combination the HQL
command would be
Select transproc.transactions where ProductID=’A’ and SubproductID=’1’
This command will make Hive read only the specific subdirectory we created earlier.
Partitioning improves the performance of Hive as it has to read a specific subdirectory to
fetch the data
If the command above is modified like the one given below Hive will read all the
subdirectories with the subdirectory ../transproc/transactions/ProductID=’A’
Select transproc.transactions where ProductID=’A’
If the typical access pattern is not to access the data for specific Product ID and Subproduct
ID combination then it is not a good idea to create partitions. If you create partitions by
Product ID and Subproduct ID but end up writing queries that read data across multiple
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Product ID and Subproduct ID Hive will have to scan multiple subdirectories and it will
impact the performance of Hive.
External partitionedtables
Just like the internal tables the external tables can be partitioned. Since the data is not
managed by Hive, it assumes that the data at the external location is segregated as per the
partition keys
Create external table if not exists transproc.prodinfo (subprodid string, subprodname
string) partitioned by (prodid string) row format delimited fields terminated by ‘,’
Please note we do not declare the location of the data for a partitioned external table as we
would in case of non-partitioned external table. That needs to be done separately using alter
command for each partition separately.
Loading Data
You can load data into tables from a file. If we need to load data into our transaction table
the HQL command would be
Load data local inpath “path of the file here”
Overwrite into table transproc.transactions
Partition (ProductID=’A’ and SubproductID=’1’)
Please note the table should have been defined with partitions on Product ID and
Subproduct ID. Overwrite clause will overwrite (as is obvious from the name) the existing
data present in the table. Hive will create a directory in HDFS for this ProductID and
SubproductID combination if it is not already existing. If the table is not partitioned you can
skip the partition clause.
You can evenread data from one table and insert it into another table. E.g. if we assume the
transaction records were present in another table where they were loaded for initial clean up
by business, we can write query like the one below to load the data into our transaction table
From PreProdTransactions
Insert Overwrite table transactions
Partition (ProductID=’A’ and SubproductID=’1’)
Select * from PreProdTransactions. ProductID=’A’ and PreProdTransactions
.SubproductID=’1’
Insert Overwrite table transactions
Partition (ProductID=’A’ and SubproductID=’1’)
Select * from PreProdTransactions. ProductID=’A’ and PreProdTransactions
.SubproductID=’2’
Insert Overwrite table transactions
Partition (ProductID=’B’ and SubproductID=’1’)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Big Data & Hadoop ecosystem essentials for managers
(Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi)
Select * from PreProdTransactions. ProductID=’B’and PreProdTransactions
.SubproductID=’1’
Insert Overwrite table transactions
Partition (ProductID=’B’ and SubproductID=’2’)
Select * from PreProdTransactions. ProductID=’B’and PreProdTransactions
.SubproductID=’2’
The above query will scan the PreProdTransactions table once and then create the Partitions
for the Transactions table based on the Partitions clause.
A more concise way of writing the above query is
From PreProdTransactions
Insert Overwrite table transactions
Partition (ProductID, SubproductID)
Select …, PreProdTransactions. ProductID, PreProdTransactions. SubproductID
from PreProdTransactions
In this case Hive itself will analyze the data present in the PreProdTransactions table and
create as many partitions in the transactions table as many unique combinations of
ProductID and SubproductID it finds in the PreProdTransactions table.
Reading Data from Hive
Ad

More Related Content

What's hot (20)

Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
Luis Miguel Salgado
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
Marko Grobelnik
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
Kathirvel Ayyaswamy
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Ramakant Gawande
 
Hadoop explained [e book]
Hadoop explained [e book]Hadoop explained [e book]
Hadoop explained [e book]
Supratim Ray
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation Slide
SlideTeam
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
Brett Colbert
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
vikas samant
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
Arockiaraj Durairaj
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
Shahbaz Anjam
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Geoffrey Fox
 
The Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldThe Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient World
PYA, P.C.
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
bobosenthil
 
Chapter 4 what is data and data types
Chapter 4  what is data and data typesChapter 4  what is data and data types
Chapter 4 what is data and data types
Pro Guide
 
Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015
Iran Entrepreneurship Association
 
Moneytree - Data Aggregation with SWF
Moneytree - Data Aggregation with SWFMoneytree - Data Aggregation with SWF
Moneytree - Data Aggregation with SWF
Ross Sharrott
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
L18 Big Data and Analytics
L18 Big Data and AnalyticsL18 Big Data and Analytics
L18 Big Data and Analytics
Ólafur Andri Ragnarsson
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
Kathirvel Ayyaswamy
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Ramakant Gawande
 
Hadoop explained [e book]
Hadoop explained [e book]Hadoop explained [e book]
Hadoop explained [e book]
Supratim Ray
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation Slide
SlideTeam
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
Brett Colbert
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
vikas samant
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
Shahbaz Anjam
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Geoffrey Fox
 
The Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldThe Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient World
PYA, P.C.
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
bobosenthil
 
Chapter 4 what is data and data types
Chapter 4  what is data and data typesChapter 4  what is data and data types
Chapter 4 what is data and data types
Pro Guide
 
Moneytree - Data Aggregation with SWF
Moneytree - Data Aggregation with SWFMoneytree - Data Aggregation with SWF
Moneytree - Data Aggregation with SWF
Ross Sharrott
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 

Viewers also liked (20)

Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and EcosystemErnestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Volha Banadyseva
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
Fadi Yousuf
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
064-066-IPM_July_August_2013-LD
064-066-IPM_July_August_2013-LD064-066-IPM_July_August_2013-LD
064-066-IPM_July_August_2013-LD
Ted Roe
 
Жилищная программа DOM-LAND
Жилищная программа DOM-LANDЖилищная программа DOM-LAND
Жилищная программа DOM-LAND
Nikolay Gorokhov
 
Wedding planner biswadeb
Wedding planner biswadebWedding planner biswadeb
Wedding planner biswadeb
weddingplanner25
 
Portfolio
PortfolioPortfolio
Portfolio
Lavina Singh
 
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...7th SASMA Business Security Conference – Your Business Challenges Today & Tom...
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...
Sebastian Blazkiewicz
 
EMUSER
EMUSEREMUSER
EMUSER
Mohankumar Menon
 
Max Licensing Portfolio
Max Licensing PortfolioMax Licensing Portfolio
Max Licensing Portfolio
Maxine Lister
 
Reseña maestra
Reseña maestraReseña maestra
Reseña maestra
Arely Osorio Alvarado
 
Control de calidad
Control de calidadControl de calidad
Control de calidad
Jesus Huayta
 
Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
Javier Andres Andrada
 
NFBPA Presentation 2013
NFBPA Presentation 2013NFBPA Presentation 2013
NFBPA Presentation 2013
Bernice Butler
 
WAQAS NAEEM.CV
WAQAS NAEEM.CVWAQAS NAEEM.CV
WAQAS NAEEM.CV
waqas naeem
 
Berikut ini cara install windows 8 lengkap beserta gambarnya
Berikut ini cara install windows 8 lengkap beserta gambarnyaBerikut ini cara install windows 8 lengkap beserta gambarnya
Berikut ini cara install windows 8 lengkap beserta gambarnya
agus
 
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016
BrioWeb
 
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and EcosystemErnestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Volha Banadyseva
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
Fadi Yousuf
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
064-066-IPM_July_August_2013-LD
064-066-IPM_July_August_2013-LD064-066-IPM_July_August_2013-LD
064-066-IPM_July_August_2013-LD
Ted Roe
 
Жилищная программа DOM-LAND
Жилищная программа DOM-LANDЖилищная программа DOM-LAND
Жилищная программа DOM-LAND
Nikolay Gorokhov
 
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...7th SASMA Business Security Conference – Your Business Challenges Today & Tom...
7th SASMA Business Security Conference – Your Business Challenges Today & Tom...
Sebastian Blazkiewicz
 
Max Licensing Portfolio
Max Licensing PortfolioMax Licensing Portfolio
Max Licensing Portfolio
Maxine Lister
 
Control de calidad
Control de calidadControl de calidad
Control de calidad
Jesus Huayta
 
NFBPA Presentation 2013
NFBPA Presentation 2013NFBPA Presentation 2013
NFBPA Presentation 2013
Bernice Butler
 
Berikut ini cara install windows 8 lengkap beserta gambarnya
Berikut ini cara install windows 8 lengkap beserta gambarnyaBerikut ini cara install windows 8 lengkap beserta gambarnya
Berikut ini cara install windows 8 lengkap beserta gambarnya
agus
 
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016
Mindfulness e Marketing - Le slide del Workshop di SMAU Padova - 11 marzo 2016
BrioWeb
 
Ad

Similar to Big data and hadoop ecosystem essentials for managers (20)

GADLJRIET850691
GADLJRIET850691GADLJRIET850691
GADLJRIET850691
neha trivedi
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
kevinflorian
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
Rajesh Kumar
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
saranya270513
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
No sql databases
No sql databasesNo sql databases
No sql databases
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Big Data
Big DataBig Data
Big Data
Kirubaburi R
 
Big data
Big dataBig data
Big data
Abhishek Palo
 
Big data
Big dataBig data
Big data
Abhishek Palo
 
Big data
Big dataBig data
Big data
Deddy Setyadi
 
Big Data and Big Data Analytics PowerPoint lecture notes
Big Data and Big Data Analytics PowerPoint lecture notesBig Data and Big Data Analytics PowerPoint lecture notes
Big Data and Big Data Analytics PowerPoint lecture notes
MBIEDANGOMEGNIFRANKG
 
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsWhitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Happiest Minds Technologies
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
Mohamed Magdy
 
Modern data integration | Diyotta
Modern data integration | Diyotta Modern data integration | Diyotta
Modern data integration | Diyotta
diyotta
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
Guduru Lakshmi Kiranmai
 
What is big data
What is big dataWhat is big data
What is big data
ShubShubi
 
Big data
Big dataBig data
Big data
Mahmudul Alam
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET Journal
 
Data Management at Scale Piethein Strengholt
Data Management at Scale Piethein StrengholtData Management at Scale Piethein Strengholt
Data Management at Scale Piethein Strengholt
dacikaashiti
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
kevinflorian
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
Rajesh Kumar
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
saranya270513
 
Big Data and Big Data Analytics PowerPoint lecture notes
Big Data and Big Data Analytics PowerPoint lecture notesBig Data and Big Data Analytics PowerPoint lecture notes
Big Data and Big Data Analytics PowerPoint lecture notes
MBIEDANGOMEGNIFRANKG
 
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsWhitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Happiest Minds Technologies
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
Mohamed Magdy
 
Modern data integration | Diyotta
Modern data integration | Diyotta Modern data integration | Diyotta
Modern data integration | Diyotta
diyotta
 
What is big data
What is big dataWhat is big data
What is big data
ShubShubi
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET Journal
 
Data Management at Scale Piethein Strengholt
Data Management at Scale Piethein StrengholtData Management at Scale Piethein Strengholt
Data Management at Scale Piethein Strengholt
dacikaashiti
 
Ad

Recently uploaded (20)

Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 

Big data and hadoop ecosystem essentials for managers

  • 1. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers Manjeet Singh Nagi (Manjeet Singh Nagi - https://in.linkedin.com/in/manjeetnagi)
  • 2. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Index Chapter 1 – A Brief History of…? Chapter 2 - NoSQL databases Chapter 3 - The backbone I (Hadoop, HDFS, MapReduce) Chapter 4 - The backbone II (MapReduce continued) Chapter 5 – A quick view of the ecosystem around Hadoop Chapter 6 - Hive Chapter 7 - Pig Chapter 8 - Hbase Chapter 9 - Sqoop Chapter 10 - Flume Chapter 11 - Kafka Chapter 12 - Oozie Chapter 13 - Zookeeper Chapter 14 - Solr Chapter 15 - Giraph Chapter 16 - Putting it all together Chapter 17 – Hadoop ecosystem on Amazon Chapter 18– Machine Learning with Mahout
  • 3. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Preface StartedwritingthisebookexplainingBigDatato managers.Didnotget time tocomplete it.Still uploadingitforeveryone tohave alook.
  • 4. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Chapter 1 – A brief history of … Any data set that cannot be processed on a single machine within reasonable amount of time is big data. The underlined phrase in the sentence is very critical to determine if the problem at hand classifies as a big data problem. Theoretically any amount of data can be processed on a single machine, with high amount of storage and multi processors (e.g. a mainframe). But if the machine takes a couple of days, or even a day, it may not be of much use to the business. If the business or customer or consumer of the data is OK getting the data processed in as much time as it takes on a single machine(and there are valid scenarios for such requirements) you do not need to solve the problem like a big data problem. But in the world today, a large amount of data is coming to companies. A quicker analysis or processing of this data can help them get quicker insights and quicker decisions. Theoretically big data has three attributes. Volume, velocity and variety, Lets understand each first. In a big data problem the amount of that data that needs to be processed would typically be huge (volume). It might run into terra-bytes, peta-bytes etc. It would typically be coming in at a high speed (real time in some case) (that’s velocity). And it would come in a lot of variety. Variety could mean it could be coming from different sources each of which could have different formats of sending the data. Even within the data from the same source the format could vary over a period of time. Even within the data from the same source at a given time the data may not have structure. Having said that, why is it happening that the companies are getting such huge amount of data at huge velocity and in so much variety? The following reasons over the past couple of decades lead us to big data Digitization of organizations – Over the past three decades or so organizations have become more and more digitized. Every activity done by organizations has become digitized. Every interface of the organization, be it with consumers, partnering vendors, government agencies have become digitized. All this is creating a lot of data. But all this would not have generated data (in volume, velocity and variety) needed to qualify as a big data unless the developments mentioned in the following paragraphs would have takenplace. Web 2.0 – Web 2.0 introduced technologies which made billions of people not the consumers of websites but the content generators. Blogging & social websites are the examples of the web 2.0. Even on sites not typically classified as social or blogging sites there are features which enable billions of people to generate contents e.g. sharing the new articles from news websites, commenting on specific content on a website etc. Web 2.0 is often a hotly debated term. It is not as if a new version of web or any related technology were released. But the web in the last decade of the last century was about flow of information from website owners to billions of web users. Slowly the web evolved to enable billions of user to generate the content. The content in the web today is much more democratic. It is by the people and for the people. Mobile devices – With the advent of mobile devices users are performing many more activities and spending more hours in the web than earlier. Add to that the fact that mobile
  • 5. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) devices capture much more user contextual information (location for example) than desktops did earlier. This contextual information if processed and analyzed can enable organizations understand their consumers much better and provide much more meaningful services andor products. Also the very fact the user are spending lot of time on their mobile devices means the amount of information that is getting generated is much more thanwhat was getting generated when users were using desktop. More digitization of the organizations - With more and more retail, social interactions and information consumption moving to web organizations need to, literally, every click of a user to understand her better. As opposed to a brick a mortar, where a user a can be observed physically and assisted in-store assistants, in a e-tail website the only way to observe a user is observe and analyze every click made by the user on the web. E-tail offers an advantage over brick and mortar shops in the sense that the user activity can be saved for analysis later which is not possible in a brick and mortar shop. This analysis needs a lot of data, in form of logs, to be analyzed. Typical pattern of big data problem solution As we all know from our experience, typically in an application the time taken to process the data is order of magnitudes smaller than the time taken for IO of the data from data repository (disk or database). Also the time taken to the read (IO) the data over the network (say from network storage or database on another server) is many times larger than the time taken to read the data locally from the disk. So typically when a big data problem is solved, 1. The data is distributed across multiple machines (called nodes). Transferring a petabyte of data to a single machine would take much more time than the time taken to divide this data into smaller chunks and transferring to 100 smaller machines in parallel. The IO is now done to 100 nodes in parallel which reduces the IO time significantly. 2. Now that the data is distributed across multiple nodes, the code/application/binary/jar etc. is copied to all the nodes. This is unconventional as compared to a typical application where data is brought from multiple sources to a single machine where the application resides and is processed on this single machine. In big data solutions it is the application that moves closer to the data. 3. Finally, the output from all the nodes is brought to a smaller number of nodes (many times only 1 node) for final processing or summarization. So, as you can see, the solution for a big data problems is about distributed storage (#1 above) and distributed processing (#2 and 3). The evolution of the solutions for big data problem also happened approximately in the same manner. Firstly, many solutions around distributed storage arrived and then around distributed processing. How did it all start? Commercial relational databases ruled the roost when it came to consistent storage since 70s/80s. These relational database had their own advantage which made them so popular. But they had certain limitations which did not come to the fore till late 90s. In the late 90s and early part of this century companies had more and more data to store. The option available with the relational databases was to buy bigger and bigger machines which are really costly.
  • 6. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Also in order to keep their websites available 100% of the time ( you do not expect google to be down, do you?) the companies needed to scale out(add more active hot backups). This made the relational database costly only from hardware perspective (relational database in companies always ran on high grade servers) but also from licensing cost. The licensing cost for the relational database was directly proportional to the number of machines it was going to run on. For e-tailers and search giants had data that needed 100s and 1000s of machines to store data with fault tolerance. Also, the relational databases were designed to store data in a fixed format. They were designed to lock-in to the schema at the time of database design. But companies were getting data which was un-structured to a large extent (imagine logs). They could have formatted this data and stored data in a structured format in the relational database but that eliminated any possibility of using any data, discarded during the formatting stage, whose significance was realized only later companies were looking for persistent data storage where schema was not locked-in at the time of database design but at the time of database read. To summarize, organizations were running into the following limitations of relational database storages: Licensing cost prohibited the scaling out that was needed to store large data sets. Licensing cost as the higher-grade servers needed were prohibitive for the creating fault tolerance in the storage. Relational database were designed for locking-in the schema at the time of database design. As companies started coming against these limitations many of them started designing databases on their own and bringing them out in public in form of open source database. These databases were together called NoSQL databases. All these database had the following attributes (in addition to the fact that they were open source), They were designed to run on clusters made of commodity hardware. Unlike relational databases, they did not need high end servers. They were inherently designed to run on clusters. So as the size of data increases an organization could just add more commodity hardware and scale out rather than buying costly servers. Fault tolerance was inherent in their design. Any data on one node of the cluster was backed up on another node (number of backups was configurable not only at database level but at much more granular level). This low cost fault tolerance made them much more resilient on commodity hardware than relational databases on enterprise servers. They were design for unstructured data. So you could just load the data in whatever format you get it. You need not even know what information comes in the data. It was up to the application to know what to expect in the data. Also NoSQL databases challenged the very foundation of the relational databases. The foundation was that relational database updates were ACID (Atomic, Consistent, Isolated and Durable). NoSQL databases challenged this very foundation. They questioned if every business scenario really needed the databases to be ACID compliant. We will get into much more details on this in the chapter 2.
  • 7. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) While many authors do not talk about NoSQL databases when they talk about dig data technologies, NoSQL databases brought to the fore distributed storage, used for big datasets, as we know it today. Sometime in the early part of this century, Google published two papers. One of the papers was about their distributed file storage system. This was not the first distributed file storage system in the world. But it had many architectural and design aspect to solve a very specific problem they had at hand. It was designed for  Fault tolerance using commodity hardware. So the data was distributed across a cluster of commodity machines (instead of high end servers). Since the machines are high-grade there are high risks of failure The distributed file system will take care of backing up data on each node on other nodes and recovering it in case a machine fails.  Scenarios where files written once to the distributed file system are read multiple times.  Random reads (reading a specific record from the file) are not required or are an exception.  Files are required to be ready sequentially in big chunks rather than one record each time. These big chunks are also read in sequential manner rather than from random places in the file  Random write (updating some particular read) is not needed. So you do not have a scenario to update a record in the file  Updates to the file are about adding/appending more data that too in huge chunks rather than one record at a time.  Scenarios where a modest number of huge files need to be stored rather than huge number of modest/small need to be stored.  Clients(of distributed file system) which want to process bulk of data faster(throughput) rather thansmall amount of data quickly(latency) The other paper from Google was about a framework they developed for processing their data. It is called MapReduce. In this framework user specifies a map function that will transform their data and a Reduce function that will summarize the output of Map function. The MapReduce framework takes the onus of 1. distributing the data to be processed across many nodes 2. distributing the map and reduce functions to all the nodes so that code is closer to the data and hence IO is reduced(refer to Typical patternof big data problemsolution we discussed earlier in this section) 3. Schedule to run Map and Reduce functions on all the nodes 4. Manage to recover from a failed machine – So the framework will take care of restoring the data from a backup on another node and restart the map or reduce function there if some machine fails. The MapReduce framework was designed to run simple functions on a huge amount of data. It lets programmers write Map and Reduce functions while it takes care of distributing the data and code, schedule the run and recovering from a failure. I do not want you to get bogged down by the term MapReduce. It is similar to a typical processing of data in other applications. Here is more detail on what a map reduce function are to make you more comfortable before we move forward.
  • 8. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) A map function accepts a record in the form of a key-value pair and does processing or formatting on the record and generates an output in the form of another key-value pair. I do not want you to think of “value” in the key-value pair as a single field. A “value” in the key- value pair could a complete record with a lot of fields or information in it. E.g. key could be employee ID and value could be all the details of that employee, key could be a transaction id and value could be all the details of the transaction. It is up to the map function to decide what processing or formatting it wants to do on which field in the value. Similarly, Reduce function reads the key-value output from all the map functions running of different nodes and summarizes to generate the final output. A very simple example is – Say you have a huge data which has all the details of the employees of many organizations from the world in a file. What you want to achieve is calculate avg salary for each designation (assume there are standard designation). Your map function will read the part of input file provided to it. The input Key-value would be designation and the value would be the rest of the information about that employee. Your map function will read each record and for each input record it will generate an output with key as designation and value as salary from that record. It sounds simple. Isn’t it? What is important is that the map function is parallelizable. You candivide your input records into as many processing nodes as you have and run Map function in parallel on all those nodes. The map function is not dependent on getting information from another record on another node while processing a specific record. The reduce function in our example will read records from all the nodes where map function. Its input would be the output key-value from map function. It will do an avg for each designation present in the file. This lead to the development of Hadoop, an open source product which delivers capability similar to the ones shared by google in their two documents. Hadoop has two components: HDFS (Hadoop Distributed File System) – This is similar to google’s distributed file system (as described above). As the name suggests HDFS is a distributed fault tolerant file system. It enables storing large files across a cluster of commodity machines. MapReduce – MapReduce is a framework to process data in the form of key-value pair by distributing the key-value pairs across a cluster of machines. It is run in two steps. First step is called a Map where input in the form of Key-value is processed to generate intermediate key-value pairs. The intermediate key-value pair go through a reduce step which summarizes the key-value pair to generate the final output. Hadoop was quickly adopted across organizations. This eventually led to the development of a lot of other products which extended the functionality of Hadoop further. E.g. Flume, Sqoop, Pig, Hive etc. We will understand each of these open source product developed around Hadoop in subsequent chapters in enough details for us to be able to design, at a high level, a solution to solve a big data business problem.
  • 9. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Chapter 2 - NoSQL Database How NoSQL databases are so scalable When we say an application or a database is scalable we generally mean horizontal scalability. The capacity (to process more data or take more load) of a scalable solution can easily be increased by adding more machines to the cluster. On the other hand, the capacity of a not-so-scalable solution can either be not increased at all or it can be increased only by replacing the existing machine/server with a bigger (costlier) (vertical scalability) one. The way relational and NoSQL databases store the data is very different. This difference makes NoSQL database very scalable and cluster oriented. Let’s understand this with an example. Let’s take example of a professional networking website. Users maintain the information about their education institute (schools and colleges they passed from) in this application. Let’s also assume that the typical access pattern from the application is such that if every time application accesses user information it will access her school/college information as well. A typical relational database design to store this information would be to save user and educational institutes two different tables and maintaining the relationships between the 2 using foreign keys(or using a third table to maintain the start and end date of relationship). Typical design in a relational database User table Education Institute Table Relationship Table Nowhere in this design have we told the database the typical access pattern of the data from the application. I.e. we have not told the database if every time application accesses user information it will access her school/college information as well. Now let’s look at the database design of a NoSQL database. NoSQL database would typically be designed in such a way that a user’s schoolcollege information will be embedded within User ID User Name User DoB ….. EducationInstitute ID Institute Name Institute City …. User ID(ForeignKey) EducationInstitute ID(ForeignKey) Start Date End Date
  • 10. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) the user information itself and stored at the same physical node in the cluster. So the user information would be like, User {Name, D0B, Current designation, Educational Institute [(Institute1, Start Date, End Date), (Institute2, Start Date, End Date)…]} Note that the number of educational institute for a user can vary from zero to any number your applications wants to allow. Relational database are generally not conducive to store such limitless list like information. If you try to embed the educational institute within the user information in a relational database it would also make it highly denormalized and inefficient. The NoSQL database would store the user and her educational institute information at the same node in a cluster. If the educational institute information was not embedded in the user information but instead maintained separately with a relation between two (like in relational database) the two could be saved in different node. In that case every time a user information was accessed the application would have to connect to another node as well to get the information on the education institute (due to the typical access pattern of our application described above). This would increase the IO and slow downthe application. This way of storing the data makes NoSQL database very cluster oriented and scalable. As you have more users, you can add mode nodes to the cluster and spread data across the cluster. When application needs the data for a user, the database will get the data from the node it is on. You cannot scale the relational database in the same manner. The user and educational institute are maintained as separate tables. If you spread user information across the nodes on the cluster what should you do about the educational institutes? Typically many people would have gone to the same institute. Relational database would maintain this many-to-one relationship by using foreign key. But you cannot spread educational institutes across nodes because user on node 1 would have gone to institute on node 2(do that you increase the IO). Please note NoSQL makes a very strong assumption in terms of how data will be typically accessed. If there are multiple ways in which data will be typically accessed then probably NoSQL databases would not be a good option. In our example, what if the application will also need to typically generate reports by counting the user by their educational institute. In that case the application will have to scan through all the users across all the nodes in the database to get the output which would be very in-efficient. Insuch a scenario a relational database would be a good option or you can used NoSQL for general queries and create a materialized view for storing counts of users by educational institutes. I hope now you can imagine how NoSQL databases store the data. They spread the data across the nodes in clusters but they ensure that the data that is accessed typically together stays on the same node (latter part needs to be ensured by a good design). As the data increases in the application one can add more nodes to the database to scale the solution horizontally. Please note that, conceptually, it is not as if NoSQL database stores one table on one node and another table on another node. The scalability comes from the fact that it can distribute each row of the table to a different node. Imagine (pardon for using a very simplistic example here) you have a pack of pringles and a pack of chocolates and 10 dishes (plates to serve them). Unless you open the pack of pringles and chocolates you cannot use the 10 dishes. You will be able to use only 2 dishes. So users (guests who need the pringles and chocolates) would put load on those two containers. But if you open the packs you can spread the pringles and chocolates across 10 dishes. Some containers could have only pringles, some
  • 11. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) only chocolate, some a combination of both. Some would have more quantity which you can keep near areas which have more guest. Others canhave less quantity and can be kept near guest who would not be consuming as much. That’s scalability. Atomicity Leaving the technical terms aside, what atomicity means is that related database updates should be done in a manner that either all are done or none is done. In case of relational database, the atomicity is ensured by the concept of transaction. So if updates to multiple tables need to be done in a manner that either all are done or none is done then relational database wrap these updates in a transaction and make the updates. If there are issues after updating a few tables and the rest of the tables could not be updated, the database will roll back the transaction, i.e. the updates already made to the table as a part of the transaction are rolled back. Let’s take an example. Say there is a simple application that records the transactions done on a bank account and maintains the final balance. There are two tables in the database. One stores the transactions. Other stores the final balance. If a user executes a transaction on an account, the application needs to make two updates. First it needs to insert the transaction in the transaction table. Second, it needs to update the final balance in the final balance table. The application will indicate to the relational database that these two updates constitute one transaction. The database will update the tables in any order but ensure that either both updates are done or none. If after updating the transaction table first it runs into issues and not able to update the final balance table it would roll back the updates made to the first table and inform the application which must have a code to handle such exceptions. NoSQL databases manage the atomicity a little bit differently. Though they are atomic to an extent but not thoroughly. Let’s continue with the same example of account update we used for understanding atomicity in relational databases. If it were a NoSQLdatabase there were two ways in which it could be designed. The final balance could be embedded in the table which lists the transactions in an account. Or the final balance could be a separate table with a relationship between the transaction and final balance table. Design 1 {Bank Account ID, Final Balance, Transactions [(Transaction1 ID, Date, Credit/Debit Ind, Amount, Other transaction details), (Transaction2 ID, Date, Credit/Debit Ind, Amount, Other transaction details)… ] Design 2 Table 1(Bank Account ID, Final Balance) Table 2(Bank Account ID, Transactions [(Transaction1 ID, Date, Credit/Debit Ind, Amount, Other transaction details), (Transaction2 ID, Date, Credit/Debit Ind, Amount, Other transaction details)…] In Design 1 the final balance is embedded within the same table which has the list of the transactions. So updates made to transaction and final balance would be both be either done or none. It would not happen that one of them is done and the other one not. So the atomicity is ensured as much as a relational database. In case of design 2, the final balance of an account could be stored on node different from the list of transactions for that account. The NoSQL database would not be able to ensure atomicity across nodes. The application will have to ensure that either both the updates are
  • 12. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) made or none. NoSQL databases would ensure that all the updates made to one table are either all done or none. But it cannot ensure this across tables. So atomicity in relational database is ensured by design. While designing the database if we keep the data, on which we want atomic updates, together the atomicity is ensured as the data update is a part of the single update. Anything more than this needs to be ensured by application. Consistency The NoSQL databases were designed to run on cluster of commodity grade machines. The machines of commodity grade have a higher chances of failure. So any data saved on a node must be backed up other nodes. NoSQL databases generally store a copy of each data on 3 different nodes (this is called replication factor and is configurable). This adds complexity to the game now. If your data is replicated across 3 nodes you need to ensure they are in synch. If you don’t they will go out of synch and different users access data from different nodes will read different version of the data. But if your application needs to wait till all the nodes are updated before confirming to the user that updates have been made it will make the application slower. In case of relational database updates only on one node would be needed (or two in case of hot-backup). But in case of NoSQL there would be 3 IOs needed (or mode in case your database is configured that way) which would make the response to the application less responsive or less alive. So NoSQL database use the concept of Quorum. So when updates to the database are made the NoSQL database would not wait for update to all the nodes. It would wait only for the majority to get updated. So if the replication factor is 3 the NoSQL database would wait for updates confirmed only by 2 nodes (quorum for updates). The third node will be consistent later. This concept is called eventual consistency as the different nodes eventually become consistent. What if one of these nodes fail before the updates are made to the 3rd node? The NoSQL database would take the latest update from the rest of the two nodes on which the data is saved and replicate it on another node. What about the quorum while reading the data? The NoSQL database would not read the data from all the nodes and give the result to the user. That would make the application slower. The number of nodes from which it will read the data (or quorum for read) should be 1 more than the number of nodes which were not a part of quorum for update. So if your data was replicated across 3 nodes, 2 nodes were part of update quorum and 1 was not a part of the quorum, the NoSQL database would read the data from 1+(number of nodes which were not a part of the update quorum) i.e. 1+1=2 nodes. General rule is that Qu+Qr > Rf Also Qu > (Rf/2) Qu  Quorum for updates Qr  Quorum for reads Rf Replication factor Please note that any operation that needs more nodes to participate in a quorum will become slower than the complementary operation (read is complementary for write and vice versa) in the equation. But the complementary operation would become faster due to the above
  • 13. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) equation. So if your replication factor is 7(Rf) and you configure that quorum for updates as 6(Qu) then you need to read data from only 2(Qr) nodes (refer to the above equation). The reads will be much faster than the updates. So based on the requirement of your application you can configure all the 3 parameter of the above equation in the database. If you need updates faster then go for smaller quorum (but still a majority) would be good. If you need reads to be faster instead you need to have higher quorum for updates and lower quorum for reads. Some of the database allow you to configure the values of the 3 parameters in the above equation not only at the database level but also at the transaction level. So while executing the transaction you can indicate to the database if you want to the transaction to be confirmed by a majority or lesser number of nodes. Types of NoSQL databases NoSQL databases are generally classified into four categories based on the information is stored and accessed – Key-valuestore This is one of the easiest NoSQL database category to comprehend. These are typical key- hash store. You can store and fetch any value for a given key. The database does not care what is inside the value. You can store xml, JSON or anything you want as a value. The database does not care. You can evenstore different format for different keys in the same table (called bucket here). The database does not care. The onus of making sense of the value read from the database lies with your application. This also means the database cannot do any validations on the value, it cannot create index on the value and you cannot fetch data based on any information within the value. All the access is done only based on the key which makes is very fast. Typically used to store session information, shopping cart information, user profiles all of which require fast access. A table in key-value data store is generally called a bucket. So a database can have multiple buckets. The buckets are used to categorize keys and store them separately. e.g if you have three different values for a key you can merge the values into one value and store it(design 1). In such a case the onus of reading the value and splitting into 3 different values will lie with your application. Or you can have 3 buckets and in the table and store the 3 values separately (design 2). First design involves less IO and hence faster. The second design has more IO and hence slower but the design is less complex Design 1 Database Bucket 1 Design 2 Key1 Value1, Value2, Value3
  • 14. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Database Bucket 1 Bucket 2 Bucket 3 Atomic updates are ensured for a given key-value. But if the application needs atomic updates across buckets in a database then application will have to ensure that. Examples of these databases are Redis, Memcached DB, Berkeley DB, Amazon’s Dynamo DB Documentstore These are similar to key-value store but these databases do not have a key and the value part (called a document in this case) stored is not opaque. So you can fetch data based on different fields with the document. You canvery well save your key within the document and fetch based on that field. Indexes can be created based on the fields within the document. The schema of information within the document canvary across documents saved in different rows. Tables are called collections. Database Row 1/Document 1 Row 2/Document 2 Please note the schema of documents in different row is different. MongoDB and CouchDB and famous examples of this category. These databases are generally used for Event logging by enterprise application and as a datastore for Document management systems. Key1 Value1 Key1 Value2 Key1 Value3 {Name:ABC LasName:XYZ DoB:DD/MM/ YYYY} {Name:DEF LastName:HKJ Place:Mumbai }
  • 15. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Atomicity is maintained at a single document level just like a key-value store maintained atomicity at single key-value level. Columnfamily These databases are more difficult to understand than key-value and document. But these are the most interesting ones. These are also the most relevant from Hadoop perspective because the Hbase, a data store on top of HDFS, belongs to this category. These database are accessed by a key to read multiple valuescolumns. Different columns are grouped together into column families. Columns that may be accessed together are generally grouped into column family and are stored together. Try not to imagine the columns here like the columns in the relational database. The columns in relational database have same name across multiple rows. If a row does not have value for a column it is saved as null in relational database. But in a column-family database if a row does not have a column it just does not have the that column. The figure below shows a good example of column-family data store Row 1 Row 2 So in the example above the access to the database is by using Key (UserID). The columns (FirstName, MiddleName and LastName) have been grouped together into a column family (UserDtls) as they will be accessed together and the columns (Institute, StartDate, EndDate) have been grouped as another column family (Education). Please note that the columns in the first row in column family ‘UserDtls’ are different from that in the second row. ColumnFamily- UserDtls{Name:Manjeet,Mi ddleName:Singh,LastName: Nagi} ColumnFamily- Education{Institute:AMC University,StartDate:30/06/2 012,EndDate:30/06/2016} ColumnFamily- UserDtls{Name:XYZ,LastNa me:ABC} ColumnFamily- Education{Institute:AMC University,StartDate:30/06/2 012,EndDate:30/06/2016} Key – User1 Key-User2
  • 16. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Indexes can be created on different columns. While adding a new columns family for a database requires database restart, an application can very easily add new columns with in the column family. Atomicity of updates is maintained at a columnfamily for a given key. Since different columns families for a given key can be stored as different nodes atomicity of updates cannot be maintained across updates for different column families for a given row. Cassandra and Hbase are examples of these databases. Before we move on to the next category of the NoSQL database type, I want to reiterate that the term column should not be visualized similar to column in relational database. In relational databases the columns are similar to the columns in an excel in the sense that - all the rows will have all the columns defined on a table - if a row does not have value for a column its value will be saved a null - the name of the column is not saved at each row In a column family table the columns should be imagined like the attributes in an xml. Also - all the rows need not have all the columns - if a row does not have value for a column it will not save its value as null. It will not have that columnat all - the name of the column is saved at each row Graph The category is the most difficult to comprehend. These databases store entities and relationships between them. Conceptually, entities are nodes in a graph and relationship are depicted as directional edges in the graphs. Edges can have additional attributes which depict further properties of the relationship. Neo4J is a good example of this category. The picture below depicts the kinds of relationships that are generally stored in such data bases. Legends node depicts an entity Person1 Person2 Book1 Person3 Movie1 BigData Reportsto ReportstoisFriendsWith Knows Likes Org1 WorksIn, StartDate,EndDate
  • 17. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) shows attributes of relationship between entities Depicts a one way relationship Depicts two way relationships Something important to understand is how this category is different from the other NoSQL databases in terms of scalability. A key-value data store ensures scalability by spreading the key-values across all the nodes of a cluster. It can do so because it understands (by virtue of the database design) that it is a key-value combination that will be accessed mostly together. A document store achieves this by spreading documents across the nodes on the cluster. Similarly, a column-family data store achieves this by spreading key-column family combination across cluster. But a graph data store cannot achieve this as nodes in a relationship are linked to each other. So graph databases cannot spread the data across nodes in a peer-to-peer manner. They achieve scalability by using master-slave configuration of the cluster. This can be achieved in many ways: 1. Reads operations can be directed to slave nodes. Writes operations are directed to master node. Once master is updated a confirmation is provided to user about the database updates. Slave nodes are updated after this. Add more and more slave nodes makes reads more scalable. If writes need to be made more scalable then the data needs to be sharded across multiple masters and the logic to do so is very specific to the domain. 2. Writes are directed to salves as well but they provide confirmation to user only after master has been updated. This makes writes as well scalable without really sharding the data. As must be clear by now, graph databases are used more for networking problem (social or professional networking being one such problem). As must be clear from the name “NoSQL”, none of these databases use SQL for database access. All of them have their syntax for database operations. We have not gone into those languages as the objective of the book is not to get to the level of code. Having said that, the languages for each of the database are not very difficult to grasp. So why did we understand so much about the NoSQL databases when the book is primarily about Hadoop ecosystem. One of the open source in the Hadoop ecosystem, Hbase, is a column-family store built on top of HDFS. We will understand Hbase in a detailed manner in chapter 8.
  • 18. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Chapter 3 - The backbone (Hadoop, HDFS, MapReduce) Hadoop is an open source software/framework/product to implement a distributed file system (HDFS) and for processing your map-reduce (MapReduce) solutions. I do not want you to get bogged down by the term MapReduce. It is similar to a typical processing of data in other applications. If you can recall the set theory and functions from our old school days, you will realize that we learnt about Map in the school days itself. A Map is a function that processes an input data to produce output data. E.g. f(x)=x2 The above is a map function. It processes any number x and produces its square. Another way to look at the above function is using the set theory (again something we learnt in school) In the above diagram,there are two sets, A and B. The function or the map f(x) maps each number in the set A to its square in Set B. In our enterprise applications the functions or maps are complicated but they are still functions/maps. e.g. Let’s say we have a mutual fund transaction processing batch system which receives transaction from the sales agents and processes them. The first program or script in the transaction processing system would typically do some formatting on the transaction, validate it and persist the transaction in the data store. So our first program is a function as depicted below. f(Input Transaction)=Formatted, Validated, Persisted Transaction Or we can imagine our program as a map as shownbelow 1 2 3 4 1 4 9 16 f(x) SetA SetB
  • 19. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Set A is our set of input transactions. So Set A is our input file. f(x) is our program which maps the input transaction Tn to Tnfvp which is a formatted, validated transaction persisted in the data store. A Reduce function is just a program which reads a group of records and produces a summary of those records. So extending the same example above there could be a program at the end of the transaction processing system which sums all the transactions and produce a sum total of the amount of transactions processed on that day (Fig 1 below). Or it could produce sum total of transaction separately for each mutual fund product (Fig 2 below). T1 T2 T3 T4 T5 T1fvp T2fvp T3fvp T4fvp SetA SetB f(x) T1fvp T2fvp T3fvp T 4fvp Sum of the amount of transactionsprocessed Sum of the amount of transactions processed for Product A T1fvp T2fvp T3fvp T4fvp Sum of the amount of transactions processed for Product B Fig1 Fig2
  • 20. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Hadoop MapReduce is more suitable where processing of a data record is not dependent, especially during Map function, on other data records in the input. So if your map program (In the example) needs information of some of the other transactions when processing a transaction Tn then Hadoop MapReduce is not the best solution for it. The reason is that Hadoop MapReduce distributes the transactions across nodes on the cluster and sends the Map function (your program in this case) to all these nodes to process these transactions. If your program needs information from other transactions while processing a transaction Tn then there will be network overhead to get details from transactions on the other nodes. This network IO could slow down the Map. In case of Reduce the program may need inputs from other transactions (say if it is summing all the transactions) but Reduce is generally processed on a very few nodes. Even when Reduce is processed on more than 1 node (but still very few nodes as compared to that of Map) the data is divided amongst the nodes in such a manner that Reduce on a node will not need information from data on the other node executing Reduce. In the example above if you want to sum the transactions by Product you could send the transaction for Product A to node 1 and that of Product B to node 2 and run reduce on both the nodes. Reduce will calculate the sum for each product separately on each node. Hadoop consists of two components – HDFS and MapReduce. We will understand each of these in more detail in the sections below HDFS HDFS (Hadoop Distributed File System) is, as its name suggest, an open source distributed file system to store huge amount of data. It splits the files that need to be stored into small blocks and stores those blocks of file on different nodes on a cluster while letting the users (applications, software, frameworks which use HDFS for its storage) still view the file as a single ,unified and un-split file. So the distribution of the file to different nodes on the cluster is not visible to the user. At this stage it is important to re-iterate that HDFS is suitable only for certain scenarios. These are -  Scenarios where files written once to the distributed file system are read multiple times.  Random reads (reading a specific record from the file) are not required or are an exception.  Files are required to be read sequentially in big chunks rather than one record each time. These big chunks are also read in sequential manner rather than from random places in the file  Random write (updating some particular read) is not needed. So you do not have a scenario to update a record in the file  Updates to the file are about adding/appending more data that too in huge chunks rather than one record at a time.  Scenarios where a modest number of huge files need to be stored rather than huge number of modest/small need to be stored.  Clients(of distributed file system) which want to process bulk of data faster(throughput) rather thansmall amount of data quickly(latency) HDFS works on master-slave architecture. Master node (generally 1) has a Namenode and SecondaryNode daemons (or processes) running on it. Rest all the nodes in the HDFS cluster
  • 21. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) are slave nodes and have DataNode process/daemon running on them. Actually blocks of any data file is saved on slave machines where DataNodes are running. MasterNode only has metadata of each of the block of these files.  Master Node  Slave Node Namenode, SecondaryNodeandDataNode Namenode and SecondaryNode are the processes/daemons that run on the master node of the cluster. Namenode stores metadata about the files stored on HDFS. It stores information about each block of each file. It does not read or write blocks of files on DataNodes. It only tells, during a write operation, the HDFS client about the nodes where blocks of files can be stored. Similarly during the read operations it only tells the HDFS client about the DataNodes where the blocks of each file are stored. It is the HDFS client that stores or reads the blocks of the files by connecting with each DataNode. The metadata is stored in a file named fsimage on the disk. When the Namenode is started the metadata is loaded into the memory. After this all the metadata updates (about new files added, old files updated or deleted) are stored in the memory. This is risky for the obvious reason that if the Namenode goes down all the updates since the last restart would be lost. So Namenode stores the updates as well in a local file names edits. This eliminates the risk only to some extent. If the Namenode goes down and needs to be restarted, it will have to merge edits file into fsimage file. This will slow down the restart of a Namenode. This risk is further brought down by adding a SecondaryNode. The SecondaryNode daemon/process merges the edits file on the primary node with the fsimage on the primary node and replaces the existing fsimage file with this new merged file. Challenges or Limitationsof the HDFSarchitecture Since Namenode stores all the Metadata and if it goes bad all your cluster will be useless, the Namenode is a single point of failure. Hence the physical machine on which Namenode and the SecondaryNode daemons are run should be of robust standard and not of the same specifications as the machines on which DataNodes are run which could be commodity machines. For the same reason, the NameNode should also be backed up frequently to ensure the metadata can be restored in case the Namenode cannot be restarted after a failure. MasterNode SecondaryNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode HDFS Client
  • 22. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Also as we know, the Namenode loads all the Metadata from the fsimage file into memory at the time of start and operates on this data during operations. Metadata of each block of file takes about 200 bytes. This adds a limitation on the usage of HDFS. Storing a huge file broken into small blocks works fine on HDFS. But storing too many small files (smaller than the block size on HDFS) creates Metadata overload which clogs the memory on the Namenode. This is primarily the reason HDFS is not a suitable distributed storage for smaller files. As you would have observed by now, Namenode could create a bottleneck as all the read and write operations operation on cluster would need to access Namenode for the metadata it needs. The bottleneck problem was solved in the later versions of Hadoop (Hadoop 2.0/YARN) Each block of file is saved on the slave nodes running the daemons/processes called DataNode. DataNodes also send regulars messages (called heartbeat) to the Namenode. This heartbeat informs Namenode if a specific DataNode is up and running. Replication of data Since the DataNodes are run on commodity machines and the chances of these machines going down is high, each block of file is replicated on 3(default and can be changed) different DataNodes. The first replica is stored on a node at random. The second replica is stored on a DataNode which is on a different rack. This ensures against a rack failure. The third replica is saved on a different machine on the same rack. The chances of multiple racks going down is less. Hence the third replica is saved on a different node on the same rack as the second replica without increasing the risk of failure. Saving the third replica on a machine on the third rack would increase the network IO and make the read and write operations slower as different copies of the replicas are accessed during the read and write operations. Please note the number of replica can be configured at HDFS level as well as at each file level. Increasing the number of replicas makes HDFS operations slower as the IO increases. Typical read-write operationsin HDFS When a file needs to be written to HDFS, users/applications interface with HDFS client. The client starts download the file. Once the download reaches the size of a block, the client works with Namenode to find out on which DataNode can each block of the file be saved. Once it get this information it sends the block of the file to first DataNode which starts writing file on its disk and at the same time starts sending it to the second DataNode where its replica needs to be saved. The DataNode 2 starts writing it on its disk and start sending it to disk 3. On completion of write disk 3 confirms to disk 2 which confirms to disk 1 which eventually confirms to HDFS client and which in turn confirms to Namenode. Once Namenode gets confirmation it persists the metadata information and makes the file visible on HDFS. This process is repeated for each block of the file and complete file is saved in this manner. Checksum on each block is calculated and saved in HDFS to validate the integrity of each block when the file needs to be read. Similar process is followed at the time of read. When a file needs to be read, the HDFS client gets the DataNode information from Namenode for each block and reads it from the DataNode. Checksum is calculated again and matched with the checksum saved at the time of write to validate integrity. If the read fails from a DataNode (node is down, or checksum fails) the block is read from the replicated node In the above read write operation we assumed a replication factor of 3. This factor can be configured ad HDFS level or a file level. Even after file has been written to HDFS its
  • 23. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) replication factor can be reduced. HDFS deletes some of the block replicas of the file to bring down the replication factor of the file if we reduce the replication factor of a file When a file is deleted by HDFS client, only the metadata information is updated to mark it as a deleted file. This makes deletion faster. Actual deletion of file happens later. All the DataNodes send messages, called heartbeat, to Namenode every 3 seconds. If Namenode does not receive message from a DataNode it assumes it has failed. Since Namenode maintains the metadata information of file blocks saved on each DataNode and also on which other nodes they are replicated, it recreates those blocks on other nodes which are running and updates its metadata information. MapReduce MapReduce is an open source framework to execute your Map (and Reduce) programs on a cluster of machines. MapReduce copies your Map program (provided by you) to each node on which a block of input file is stored and runs it on that node to process the block of input data. Once all the nodes in the cluster have run their Map programs the MapReduce copies the output from all the nodes to a smaller set of nodes where it copies the Reduce program(again provided by you) and runs the Reduce program on each of these smaller set of nodes to process and summarize the output from Map step. Though this is a simplified view of MapReduce, this is what it does. As we progress in this chapter and next we will see more complex and detailed view of MapReduce. Just like HDFS, MapReduce also works on a master slave configuration. Master machine has a daemon, named JobTracker, running on it. All the other machines on the cluster are salve machines and have a daemon, named TaskTracker, running on them. JobTrackerand TaskTracker JobTracker is responsible for coordinating with all TaskTrackers on the slave nodes where the Map and Reduce programs are run. It checks with the Namenode (of HDFS) where the blocks of input files are kept. It sends the Map and Reduce programs to those node. It asks TaskTracker on each of the slave nodes to run the Map and reduce programs. It keeps receiving heartbeats from the TaskTracker to check if they are fine. If a TaskTracker does not send the heartbeat the JobTracker assumes it has failed and reschedules the Map/Reduce program running on that node on another node which has a replica of that data. Just like Namenode in HDFS, if the JobTracker in MapReduce goes down all the cluster running the MapReduce becomes useless. So JobTracker must be run on a machine with specifications better than that of a machine running TaskTracker.
  • 24. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Chapter 4 - The backbone II (MapReduce continued) Sample MapReduce solution Let’s look at a very simple MapReduce solution. Let’s say you have billions of successful sale transactions of all the mutual fund products of all the mutual funds companies in USA since 1975. You need to sum the transactions by year of sale. Your input file has a record for each transaction. Each record has transaction date, the transaction amount and other transaction details. While this problem can be solved by processing the transactions on a single machine, the chances of it overwhelming evena high-end machine is very high. Even if it completes successfully it would take a lot of time. You can solve this problem much more easily by distributing the transactions over a cluster and processing them in parallel. You need to write a Map program which will read a transaction and emit Year of sale and transaction amount to the output file. You need to write a Reduce program which will take multiple records(for a given year) with Year of sale and transaction amount as input and generate an output where transactionamount is summed and it emits Year of sale and summed transaction amount as output. So, Map Program: Reduce Program TransactionID, TransactionDate, TransactionAmount, Mutual FundProduct, …….. Year of Transaction, TransactionAmount Map Program Year of Transaction, TransactionAmount1 Year of Transaction, TransactionAmount2 Year of Transaction, TransactionAmount3 Year of Transaction, Sumof Transaction Amounts Reduce Program
  • 25. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) For the sake of simplicity, our Reduce program assumes that all the transactions it receives belong to the same year. It just needs to sum all the transactions and emit the sum along with the year from any of the transactions. Once your programs are ready you will provide the following inputs to Hadoop 1. Name of the Map Program 2. Name of the Reduce Program 3. Name of the Partition Program. If we skip this then Hadoop uses default Partition class available to it. We will learn about it later in the chapter. 4. Name of the Combiner Program. Ifwe skip this then Hadoop skips the Combiner step. We will learn about it later in the chapter. 5. Jar file(in case your programs are in Java) and the path from where to pick it from 6. Path to your input file with billions of transactions 7. Number of reducers you want to run. We will specify 1 reducer for each year since 1975, so a total of 42 reducers. This will ensure each reducer receives transactions of only 1 year. Hadoop will take the input file and split it into multiple blocks and store these on multiple nodes on the cluster (as described in the Typical read-write operations in HDFS) JobTracker will then copy your Jar (which has the Map and Reduce programs) to each of the nodes which has a block of input file (it will get this information from Namenode of HDFS). Hadoop will then run the following steps to execute your Map and Reduce programs. Please note in the diagram below which phase runs on Map node and which on Reduce node. Map This phase will run your Map program on each node which has a block of your data file (not on the replicas). The output of this phase will be a file on each node with Year of sale as key Map Partition Combine Sort Reduce Map Partition Combine Map Partition Combine Map Partition Combine Shuffle Sort Reduce
  • 26. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) and transaction amount as value. The output file on each node may have records for multiple years from 1975 to 2016. Partition In this phase MapReduce will take the output of Map on each node and partition it into as many files as there are Reducers (42 in our case). It will do this by partitioning the output file of each map by key. So the output file of each Map step will be partitioned into 42 files (max) each of which will have transaction of one year on that node. Partitioning the output file of a Map by the key is the default Partition behavior. It can be customized to partition by some other criteria and we will see it in the next chapter. If we do not mention any class of Partition to Hadoop, it will use the default class available to partition the Map output by the key in the Map output. Output file from Map 1 Output file from Map 2 Output file from Map 3 Partition will come into action only if the number of Reducers are going to be > 1. If only 1 Reducer is going to be used, there is no need for portioning as all the records from all the Maps need to go to only one reducer. (Key1,value1), (Key1,value2), (Key2,value3) (Key2,value4), (Key3,value5), (Key3,value6) (Key1,value7), (Key2,value8), (Key2,value9) Partition Partition Partition (Key1, value1),(Key1,value2) (Key2,value3) Outputfile forRedcuer1 Outputfile forRedcuer2 (Key2, value4) (Key3,value5),(Key3,val ue6), Outputfile forRedcuer2 Outputfile forRedcuer3 (Key1, value7) (Key2,value8),(Key2,val ue9), Outputfile forRedcuer1 Outputfile forRedcuer2 N o d e 1 N o d e 2 N o d e 3
  • 27. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Partition ensures that records for a specific key go to the same reducer from the all the nodes. Reducer nodes will connect to the nodes on which Maps are running and collect the files generated for only for them (based on name).But it does not ensure all the reducers get equal load. Roughly, it divides the keys equally between reducers. If some key has more records in the Map output than other keys then the reducer that is assigned that key will take more time to complete. There are ways to ensure the load is equally divided between the reducers. We will see how this is done later in this chapter. The default Partition program does the partitioning by calculating an index for the key of each record to be written to the output Index = Hash of Key % Number of reducers to run Hash is nothing but any function that generates a unique value for a given input. So two keys will not generate the same output when run through a hash function. Also a given key will always generate the same output when run through a hash function. % is nothing but the simple Modulo function from our mathematics. A % B provides the remainder left when A is divided by B. Different index values are assigned to different reducers. Based on the Index value calculated for a key, all the records with that Index are written to the output file for a reducer which has that index value assigned to it. Different keys may go to a single reducers but a given key will not go to multiple reducers. We can very well overwrite all this default behavior of the Partition program by extending the default class and customizing the Partitioning method. e.g. in our case we can overwrite the default behavior by partitioning simply by the key(which is the year of transaction) instead of the default behavior of calculating Index etc. Combiner This is an optional step. Please note that there are millions of transactions on each Map node. Our Map program does not remove any transaction from further processing. So the output of each map will also have millions of transactions (though each with two fields, year of sale and transaction amount). So there are billions of records spread across the partitioned outputs of Mappers across multiple node. Sending these records to 42 reduce nodes will cause a lot of network IO and slow downthe overall processing. This is where a Combiner can help. Since the Reducer is going to sum all the transactions it receives with an assumption that all the transactions it receives belong to the same year, we canrun the same summation on the each partitioned output of each mapper. So thousands of records in each partitioned output of a mapper will be summed into one record. A Combiner will sum the transactions in each partitioned output of Partition step. It will take all the records in one partition, sum the transaction values and emit Year of sale as key and sum of transaction amount as value. So for each partitioned output (which has thousands of records), the Combiner will generate only one records. This reduces the amount of data that needs to be transmitted over network. If you delve over the behavior of Combiner it is like running reducer on the Map node before transmitting the data to Reduce node.
  • 28. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) The above diagram shows how combiner works. It does not show the scale by which it reduces the number of transactions which need to be transmitted to Reducer nodes. Imagine when you have thousands of transactions for a key in a file and combiner generates only one transaction which summarizes all those transactions and needs to be transmitted to the reducer node then the amount of data to be transmitted reduces significantly. As I said earlier, Combiner step is optional and we will have to tell Hadoop to run combiner. Whether combiner can be run in your solution or not is very specific to the problem you are trying to solve in the MapReduce. If some processing can be done in the Map output locally to reduce the amount of data before transmitting it to the reducer nodes then you should think about running Combiners. We also need to write a combiner program, add it to the jar which we provide to Hadoop. We also inform Hadoop that a combiner needs to be run. This can be done by providing the combiner class to Hadoop just like how we provide Map and/or Reduce class to Hadoop. Shuffle Meanwhile, MapReduce would have identified 42 nodes that need to run the Reduce program and assigned a Key (Year of sale) to each of them. The TaskTracker on each of these (Key1, value1),(Key1,value2) (Key2,value3) Outputfile forRedcuer1 Outputfile forRedcuer2,Key2 (Key2, value4) (Key3,value5),(Key3,val ue6), Outputfile forRedcuer2 Outputfile forRedcuer3 (Key1, value7) (Key2,value8),(Key2,val ue9), Outputfile forRedcuer1 Outputfile forRedcuer2 N o d e 1 N o d e 2 N o d e 3 Combiner Combiner Combiner (Key1, Sum(value1,value2) (Key2,value3) (Key3, Sum(value5,value6) (Key2,value4) (Key1,value7) (Key2, Sum(value8,value9) Combiner Combiner Combiner CombinedOutputfileforRedcuer1 CombinedOutputfileforRedcuer2 CombinedOutputfileforRedcuer2 CombinedOutputfileforRedcuer3 CombinedOutputfileforRedcuer1 CombinedOutputfileforRedcuer2
  • 29. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) nodes will keep on scanning the nodes on which Maps are running and as soon as it finds the output file generated for its processing (based on name of the file) it will copy the file to its node. Once a reducer node gets all the files for it processing, MapReduce will go to the next step. In the diagram above we assumed that there are only 2 nodes for Reduce phase and MapReduce assigned Key1 to Reduce1 on Node4 and Key2 to Reduce2 on Node5. We could have assumed 3 nodes for Reduce phase as well and assigned one key to each node executing Reduce phase. But keeping only two nodes for Reduce phase and assigning two keys(Key2 and Key3) to the Reducer on Node5 will help you understand the Sort phase better. Sort Each reduce node would have received files from multiple Map nodes. So in this step MapReduce will merge all the files into one and sort by key (Year of transaction in this case) all the input records to a Reducer. N o d e 1 N o d e 2 N o d e 3 (Key1, Sum(value1,value2) (Key2,value3) (Key3, Sum(value5,value6) (Key2,value4) (Key1,value7) (Key2, Sum(value8,value9) CombinedOutputfileforRedcuer1 CombinedOutputfileforRedcuer2 CombinedOutputfileforRedcuer2 CombinedOutputfileforRedcuer3 CombinedOutputfileforRedcuer1 CombinedOutputfileforRedcuer2 N o d e 4 N o d e 5 (Key1, Sum(value1,value2) (Key1,value7) (Key2,value3) (Key2,value4) (Key2,value4) (Key3, Sum(value5,value6)
  • 30. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Please note Sort phase is run by default. Reduce phase must get its data sorted by keys. We can overwrite the default behavior of sorting by key by extending the default class. We can overwrite the default class to sort the input to Reduce by Keys as well values( or a part of the value) if our scenarios expects that. Reduce Reducer will sum all the transactions in a file to generate {Year of Sale, Sum of Transaction} as output. N o d e 4 N o d e 5 (Key1, Sum(value1,value2) (Key1,value7) (Key2,value3) (Key2,value4) (Key2,value4) (Key3, Sum(value5,value6) Sort Sort (Key1, Sum(value1,value2), (Key1,value7) (Key2,value3), (Key2,value4), (Key2,value4), (Key3, Sum(value5,value6)
  • 31. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Please note that if you are sending multiple keys to your Reduce phase then your Reduce program should be able to handle that. In the diagram above we have assumed this. But in the example we have been going through in this chapter, we assumed each instance of Reducer running will get only 1 key. N o d e 4 N o d e 5 Reduce Reduce (Key1, Sum(Sum(value1,valu e2),value7) (Key2,Sum (value3,value4,value5 )) (Key3, Sum (value5,value6) (Key1, Sum(value1,value2), (Key1,value7) (Key2,value3), (Key2,value4), (Key2,value5), (Key3, Sum(value5,value6)
  • 32. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Chapter 5 – A quick view of the ecosystem around Hadoop By now we have understood the capabilities of Hadoop quite well. HDFS (Hadoop Distributed File System) offers a distributed storage for huge amount of data using a cluster of commodity hardware. The distribution of the data on the cluster is transparent to the end users or the applications interacting with HDFS. The data is also replicated within the cluster to provide failover in case a machine in the cluster goes kaput. MapReduce sits on top of HDFS and provides capability to process MapReduce programs on data stored on the HDFS cluster. Over a period of time a lot open source products have cropped up which either enhance the capability of Hadoop further or overcome the limitations of Hadoop framework. These new products can be categorized in the following four categories: 1. Ingestion – While we have huge storage available in HDFS, transferring huge amount of data from the sources available with enterprises could be daunting. Products like Sqoop, Flume and Kafka offer capability to move the data from our enterprise sources into HDFS and vice versa. While Sqoop is used for importing the data from SQL data sources within the enterprise, Kafka and Flume are used to import data from Non SQL data sources (log etc.). Kafka and Flume have some finer differences between them and we will see those as we move forward. 2. Processing – While MapReduce offers capability to process data stored on the HDFS cluster, in order to use MapReduce one must know coding. The coding required to develop MapReduce programs is quite complicated. Many times you need your business users to be able to process the data stored on HDFS. Even for technology teams, developing MapReduce programs in Java or any other language could be inefficient. So frameworks or products were required which could ease the task of processing data stored on HDFS. Pig and Hive are products which offer ease to process data stored on HDFS. Hive offers a language HQL, much similar to SQL, using which we canquery the data in HDFS. Pig offers, an easy to learn and use language, called Pig Latin using which we can ETL(extract, transform, load) kind of procedural programs can be developed to process the data on HDFS. Both, HQL queries and Pig Latin programs, eventually get converted into MapReduce programs at the back end and get executed. Thus Pig and Hive offer a higher level of abstraction as compared to the Java program that one has to write if we need to develop a MapReduce program. MapReduce HDFS
  • 33. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) 3. Real-time systems – MapReduce is designed for high throughput processing rather than low-latency processing. It could process huge amount of data but it has some kick start time. It is not designed to process data quickly and turn around. The initial start time needed for Hadoop to identify the nodes for Map and Reduce, transfer the code to these nodes and kick start the processing makes HDFS unsuitable for real- time process where you need the response to your query/program quickly. Hbase offers such capability. Hbase basically uses the distributed storage offered by HDFS to offer key-value datastore services (refer to the Chapter 2 – NoSQL Database to recall what a key-value store is). So it is a key-value type of NoSQL database using HDFS for storing the keys and the values. 4. Coordination – There are two products in this category that are used for designing solutions to big data process using Hadoop. Oozie is a workflow schedule to manage Hadoop jobs. Zookeeper is used for coordination amongst different products in the Hadoop eco system. So keeping these products in mind the ecosystem developed around Hadoop looks like this The subsequent chapters will each pick up one product each from the eco system and explain it in detail. Since we have already understood the MapReduce which is for processing of data, we will take up the processing category (Hive, Pig, Hbase) first. Amongst this category, we will take up Hive first. Understanding Hive is easy as the programing is done using HQL which is very similar to SQL which most of us understand well. Next we will take up Pig MapReduce HDFS Hive(SQL like processing capability) Pig(ETL like procedural capability) Hbase(key-value store using HDFS) Sqoop(Ingestdata from SQL data sources in the enterprise) Flume (Ingest data from non SQL data sources in the enterprise) Kafka (Ingest data from non SQL data sources inthe enterprise) Oozie(Hadoop job scheduling) Zookeeper(coordination amongst products)
  • 34. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) which again is easy to understand as the programming language Pig Latin is very easy. Hbase is more difficult to understand as compared to Hive and Pig so we will take it up last in this category. Next we will take up the Ingestion category of product. We will take up Sqoop first. The reason again being that this product is related to SQL world to which we all can relate to. Next we will move to Flume as it originated before Kafka. Once we understand Flume we can identify limitations and see how Kafka overcomes those. At last, we will move to Oozie and Zookeeper as understanding other products in detail will help us appreciate these two product better.
  • 35. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Chapter 6 – Hive Why Hive? If we look back the example of transaction processing we took in chapter four, we are essentially doing the following to the transactions, 1. Select certain fields from each transaction. 2. Group the transactions by year(by sending them to different Reducers) 3. Sum the transaction amount for each group If you are even remotely aware of SQLworld the equivalent in the SQL is something like Select Transaction.Year, SUM (Transaction.Amount) From Transaction Group By Transaction.Year In case we wanted to filter out some transactions from processing we could have added the filter in the Map (just an If condition). Let’s assume we want only those transactions to be processed which have ’Purchase’ in a field named ‘type’. In the Map program that you develop for the processing you would add an IF condition to process only those transactions which have the value ‘Purchase’ in the field named ‘type.’ The SQL equivalent SQL would be Select Transaction.Year, SUM (Transaction.Amount) From Transaction Where Transaction.type=’Purchase’ Group By Transaction.Year Let’s also consider a scenario where the transaction has another field named “ProductCode” which has a numeric code for the financial product on which transaction was done. We also have a file which has a mapping between the “ProductCode” and “ProductName”. If we need the field “ProductName” in the final output from the Reducer and also want to sum the transactions on Year and ProductName instead of only Year of transaction, the Map Reduce processing would be modified as below Map: 1. Select transaction with ‘Purchase’ code in the ‘type’ field of transaction for further processing in Map 2. Output year, product code and amount for each transaction with ‘Purchase’ in the transaction type field. Partition: 1. Partition transactions by year so that transactions for each year go to a different Reducer. Combiner: 1. Sum the transactions on each Partition by Year and ProductCode. Shuffle:
  • 36. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Each Reducer picks its files from the Map nodes. Sort: 1. Sort the transactions by Year and Product Code Reducer: 1. Load the file which has ProductCode-ProductName mapping into the memory. 2. Sum the input transactions by Year and Product Code. This time this step will sum the transactions coming from different Map nodes( in the Combiner the same processing did the sum only for the transactions on each node). 3. Just before writing a Sum read the ProductCode-ProductName mapping from memory (loaded earlier in the Reducer) to resolve ProductCode in the output record to ProductName. 4. Write the Sum of transactions by year and product name to the output. The SQL equivalent of the above processing would be Select Transaction.Year, Product.ProductName, SUM (Transaction.Amount) From Transaction,Product Where Transaction.type=’Purchase Transaction.ProductCode=Product.ProductCode Group By Transaction.Year,Product.Name By now you would have noticed that it takes only a few lines of SQL code to do the processing that we are trying to in MapReduce. When it comes to writing Java programs for MapReduce 1. The number of lines of codes is large 2. There are many libraries that need to be imported. 3. You need to be aware of which out of the box class file to extend for our specific requirement. 4. There are variables to be defined, set and reset. And all the other complications involved in any programing. 5. There are steps for building the jar. When you have so much of raw of data residing on the HDFS is there no easier way to process the data? Is there no way a business person or a person with limited technology skill set can process and analyze the data? Is there a tool/framework which can 1. take the queries in the form similar to SQL written above, 2. do the laborious work of developing Map, Reduce, Partition and combiner classes, 3. schedule as many Maps and Reducers as needed and 4. produce the end result for the user. That is what Hive does. It does all the 4 points written above and much more. Welcome to the world of Hive! Hive is a tool operating at a higher level than Hadoop. It takes away the
  • 37. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) difficult task of writing MapReduce programs. It develops those programs based on the instructions given to it in the form of HQL (Hive Query Language) which is similar to SQL. Thus it brings the power of Hadoop within the reach of people who are not programmer but know what logic needs to be applied to the data to analyze and process it. HQL, like SQL, is a much easier to pick up than Java or any other programming language. If one already know SQL then the learning curve is much steeper. Please note since Hive is only a layer above Hadoop it inherits the limitations of Hadoop 1. Hive does not support row level updates, inserts and deletes Hive architecture The following diagram shows the architecture of Hive Hive sits on top of Hadoop, thus taking away all the complications of writing Map and Reduce programs to process data. There are three ways to access Hive: CLI: This is a typical command line interface where a user can write a few queries to load, read and process data. HWI: Hive Web Interface is an interface on the web serving the same objective as CLI Thriftserver: It exposes Hive functionality to other applications that access Hive via JDBC or ODBC drivers. MapReduce HDFS Hadoop Driver Metastore Command line Interface Web Interface Thriftserver JDBC ODBC Hive
  • 38. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Metastore: While all the data accessed by Hive is saved on HDFS, the data about databases and tables is stored in Metastore. We will learn about what kind of data is stored in the Metastore in the subsequent sections. Creating tables and types of tables Hive is a layer sitting above Hadoop. It only extends the functionality of Hadoop by letting user provide inputs in the form of Hive Query language (HQL) rather than low level programs. In this and the subsequent sections we will take a few example of HQL to understand what it does for the user and what it does in Hadoop at the back-end. While we understand this, we will avoid trying to understand each and every option or variation possible with an HQL command. The essence of this is to explain the core functionality of a product without really getting into a lot of code. Our logical flow would be to first understand the HQLs for defining databases and table. Then we will move on to understanding HQLs for loading data into these tables. Finally, we will understand the HQLs for processing the data. All along, we will also understand what these HQLs eventually do on Hadoop. Let’s assume that in addition to the transaction file (with all the details of mutual fund transactions) we also have another file which provides the mapping between the mutual fund product ID ( the financial product on which the transaction was executed) and mutual fund name(the name of the product). In order to load and process the data available in these two files we will firstly create the database and tables to store the data in these files. Create database if not exists transproc The above command will create a directory in the HDFS with name transproc. A database is just an umbrella folder to contain and organize all the tables. A registry will also be made in the Metastore table about this new database. Once your database is created you can very well create tables within the database with command very similar to the one we used for creating the database Create table if not exists transproc.transactions(transid STRING, ,transamount FLOAT,…..) Create table if not exists transproc.prodinfo (prodid string, prodname string) The above command would create two subdirectories within the transproc directory and also make a registry with the Metastore for the two new tables created. Internaland External tables Hive will notcreate the database if it alreadyexisits Name of the database Name of the database withinwhichtable needsto be created Name of the table Layout of the table
  • 39. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) There are two types of tables in Hive. If Hive keeps the ownership of data stored in a table then the table is called internal or managed table. In case of an internal/managed table, when a table is dropped the data from HDFS as well as the reference to the table in the Metastore is deleted. If Hive does not keep the ownership of the data stored in a table then the table is called an external table. In case of an external table, when a table is dropped only the reference to the table from the Metastore is deleted but the data from the HDFS is not deleted. So the table stops existing for Hive but the data is still retained in Hadoop. External tables are used to exchange data between multiple applications. E.g. in our case of mutual fund transaction processing it may be the case that the product data (product id to product name mapping) is not owned by the department which has the responsibility of processing the transactions (typical scenario). Insuch a case, the product department would make the product information available in a flat file in some HDFS location. The transaction processing application would define an external table on top of this data. When the transactions processing is done it could delete the external table. But that would not delete the product data in the flat file. That data might be referenced by other applications as well. If we do not mention in our HQL command if the table is internal or external, Hive would assume it to be internal. The command to create an external table is Create external table if not exists transproc.prodinfo (prodid string, prodname string) row format delimited fields terminated by ‘,’ location ‘location of the external file’ Internalpartitioned tables Let’s look back at the query that creates the transaction table, Create table if not exists transproc.transactions(transid STRING, ,transamount FLOAT,ProductID STRING, SubproductID STRING,… ) Assume that SubproductID indicates a variety of a Product. So a Product can have different varieties and each can be indicated by the sub product ID. Now let’s assume that we know the access pattern for this table. By access pattern I mean we know that when the data is accessed it will be mostly accessed for a specific Product ID and/or Sub Product ID. Let’s say we also know that the data would generally not be accessed for many or all the products ID at the same time. Now the above HQL command for creating a table would create one single directory for the table. All the data for the table would be in one directory. Every time the table is accessed Hive (and HDFS in the back-end) would have to find the data for that particular product and/or subproduct id to fetch it. The directory structure created in HDFS by the above command would be ../transproc/transactions Providesthe locationof the external fileHive Informshive toexpect the fieldsinthe external file separatedby‘,’
  • 40. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Instead, if we know that typically data would be accessed using product ID and/or subproduct ID we can segregate the data within the directory into separate subdirectories for product ID and subproduct ID. This is called partitioning. The command to partition the table is: Create table if not exists transproc.transactions(transid STRING, ,transamount FLOAT,…) Partitioned by (ProductID STRING, SubproductID STRING) The above command will create a subdirectory like ../transproc/transactions As and when data is added to this table separate sub-directories will be created within the transaction directory in HDFS for each ProductID and SuproductID combination. Load data local inpath ‘path from where data needs to be picked up’ Into table transactions Partition (ProductID=’A’, SubproductID=’1’) The above command will create a subdirectory like ../transproc/transactions/ProductID=’A’/SubproductID=’1’ Anytime data is loaded into this table the command to load the data would have to specify partition information and the data will be loaded into the directory structure for that partition. Please also note that the table schema does not have the columns which are a part of partition now. There is no need save ProductID and SubproductID in the table itself as this information can be derived from the path of the partition. If data has to be read for a specific Product ID and Subproduct ID combination the HQL command would be Select transproc.transactions where ProductID=’A’ and SubproductID=’1’ This command will make Hive read only the specific subdirectory we created earlier. Partitioning improves the performance of Hive as it has to read a specific subdirectory to fetch the data If the command above is modified like the one given below Hive will read all the subdirectories with the subdirectory ../transproc/transactions/ProductID=’A’ Select transproc.transactions where ProductID=’A’ If the typical access pattern is not to access the data for specific Product ID and Subproduct ID combination then it is not a good idea to create partitions. If you create partitions by Product ID and Subproduct ID but end up writing queries that read data across multiple
  • 41. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Product ID and Subproduct ID Hive will have to scan multiple subdirectories and it will impact the performance of Hive. External partitionedtables Just like the internal tables the external tables can be partitioned. Since the data is not managed by Hive, it assumes that the data at the external location is segregated as per the partition keys Create external table if not exists transproc.prodinfo (subprodid string, subprodname string) partitioned by (prodid string) row format delimited fields terminated by ‘,’ Please note we do not declare the location of the data for a partitioned external table as we would in case of non-partitioned external table. That needs to be done separately using alter command for each partition separately. Loading Data You can load data into tables from a file. If we need to load data into our transaction table the HQL command would be Load data local inpath “path of the file here” Overwrite into table transproc.transactions Partition (ProductID=’A’ and SubproductID=’1’) Please note the table should have been defined with partitions on Product ID and Subproduct ID. Overwrite clause will overwrite (as is obvious from the name) the existing data present in the table. Hive will create a directory in HDFS for this ProductID and SubproductID combination if it is not already existing. If the table is not partitioned you can skip the partition clause. You can evenread data from one table and insert it into another table. E.g. if we assume the transaction records were present in another table where they were loaded for initial clean up by business, we can write query like the one below to load the data into our transaction table From PreProdTransactions Insert Overwrite table transactions Partition (ProductID=’A’ and SubproductID=’1’) Select * from PreProdTransactions. ProductID=’A’ and PreProdTransactions .SubproductID=’1’ Insert Overwrite table transactions Partition (ProductID=’A’ and SubproductID=’1’) Select * from PreProdTransactions. ProductID=’A’ and PreProdTransactions .SubproductID=’2’ Insert Overwrite table transactions Partition (ProductID=’B’ and SubproductID=’1’)
  • 42. Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Big Data & Hadoop ecosystem essentials for managers (Manjeet SinghNagi - https://in.linkedin.com/in/manjeetnagi) Select * from PreProdTransactions. ProductID=’B’and PreProdTransactions .SubproductID=’1’ Insert Overwrite table transactions Partition (ProductID=’B’ and SubproductID=’2’) Select * from PreProdTransactions. ProductID=’B’and PreProdTransactions .SubproductID=’2’ The above query will scan the PreProdTransactions table once and then create the Partitions for the Transactions table based on the Partitions clause. A more concise way of writing the above query is From PreProdTransactions Insert Overwrite table transactions Partition (ProductID, SubproductID) Select …, PreProdTransactions. ProductID, PreProdTransactions. SubproductID from PreProdTransactions In this case Hive itself will analyze the data present in the PreProdTransactions table and create as many partitions in the transactions table as many unique combinations of ProductID and SubproductID it finds in the PreProdTransactions table. Reading Data from Hive