Big Data Analysis
Big Data Analysis
PREPARED BY
ASST.PROF. SANTOSH KUMAR RATH
GOVERNMENT COLLEGE OF ENGINEERING KALAHANDI,
BHAWANIPATNA
Page | 2
3 Vs of Big Data
Volume
We currently see the exponential growth in the data storage as the data is now more than text data.
We can find data in the format of videos, musics and large images on our social media channels. It is
very common to have Terabytes and Petabytes of the storage system for enterprises. As the
database grows the applications and architecture built to support the data needs to be reevaluated
quite often. Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data. The big volume
indeed represents Big Data.
Velocity
The data growth and social media explosion have changed how we look at the data. There was a
time when we used to believe that data of yesterday is recent. The matter of the fact newspapers is
still following that logic. However, news channels and radios have changed how fast we receive the
news. Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages (a tweet, status updates etc.) is not something
interests users. They often discard old messages and pay attention to recent updates. The data
movement is now almost real time and the update window has reduced to fractions of the seconds.
This high velocity data represent Big Data.
Variety
Data can be stored in multiple format. For example database, excel, csv, access or for the matter of
the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional
format as we assume, it may be in the form of video, SMS, pdf or something we might have not
thought about it. It is the need of the organization to arrange it and make it meaningful. It will be
easy to do so if we have data in the same format, however it is not the case most of the time. The
real world have data in many different formats and that is the challenge we need to overcome with
the Big Data. This variety of the data represent represent Big Data.
Page | 3
Above image gives good overview of how in Big Data Architecture various components are
associated with each other. In Big Data various different data sources are part of the architecture
hence extract, transform and integration are one of the most essential layers of the architecture.
Page | 5
Most of the data is stored in relational as well as non relational data marts and data warehousing
solutions. As per the business need various data are processed as well converted to proper reports
and visualizations for end users. Just like software the hardware is almost the most important part
of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely
important and failure over instances as well as redundant physical infrastructure is usually
implemented.
NoSQL in Data Management
NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL. This is
because in Big Data Architecture the data is in any format. It can be unstructured, relational or in
any other format or from any other data source. To bring all the data together relational technology
is not enough, hence new tools, architecture and other algorithms are invented which takes care of
all the kind of data. This is collectively called NoSQL.
NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL means
there is No SQL, which is not true they both sound same but the meaning is totally
different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per
Wikipedias NoSQL Database Definition A NoSQL database provides a mechanism for storage and
retrieval of data that uses looser consistency models than traditional relational databases.
Why use NoSQL?
A traditional relation database usually deals with predictable structured data. Whereas as the world
has moved forward with unstructured data we often see the limitations of the traditional relational
database in dealing with them. For example, nowadays we have data in format of SMS, wave files,
photos and video format. It is a bit difficult to manage them by using a traditional relational
database. I often see people using BLOB filed to store such a data. BLOB can store the data but when
we have to retrieve them or even process them the same BLOB is extremely slow in processing the
unstructured data. A NoSQL database is the type of database that can handle unstructured,
unorganized and unpredictable data that our business needs it.
Along with the support to unstructured data, the other advantage of NoSQL Database is high
performance and high availability.
Page | 6
Hadoop
What is Hadoop?
Apache Hadoop is an open-source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs
applications on large clusters of commodity hardware and it processes thousands of terabytes of
data on thousands of the nodes. Hadoop is inspired from Googles MapReduce and Google File
System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and
high availability.
What are the core components of Hadoop?
There are two major components of the Hadoop framework and both fo them does two of the
important task for it.
Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of
resources and they have processed them locally. Once the commodity server has processed
the data they send it back collectively to main server. This is effectively a process where we
process large data effectively and efficiently. (We will understand this in tomorrows blog
post).
Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference
between any other file system and Hadoop. When we move a file on HDFS, it is
automatically split into many small pieces. These small chunks of the file are replicated and
stored on other servers (usually 3) for the fault tolerance or high availability. (We will
understand this in the day after tomorrows blog post).
Besides above two core components Hadoop project also contains following modules as well.
Page | 7
A small Hadoop cluster includes a single master node and multiple worker or slave node. As
discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and
another is of HDFS Layer. Each of these layer have its own relevant component. The master node
consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of
a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or
compute node. The matter of the fact that is the key feature of the Hadoop.
Why Use Hadoop?
There are many advantages of using Hadoop. Let me quickly list them over here:
Robust and Scalable We can add new nodes as needed as well modify them.
Affordable and Cost Effective We do not need any special hardware for running Hadoop.
We can just use commodity server.
Adaptive and Flexible Hadoop is built keeping in mind that it will handle structured and
unstructured data.
Highly Available and Fault Tolerant When a node fails, the Hadoop framework
automatically fails over to another node.
Page | 8
MapReduce
What is MapReduce?
MapReduce was designed by Google as a programming model for processing large data sets with a
parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary
technology, it has been quite a generalized term in the recent time.
MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering
and sorting operation on data where as procedure Reduce() performs a summary operation of the
data. This model is based on modified concepts of the map and reduce functions commonly
available in functional programming. The library where the procedure Map () and Reduce ()
belongs is written in many different languages. The most popular free implementation of
MapReduce is Apache Hadoop, which we will explore tomorrow.
Page | 9
Input Reader
Map Function
Partition Function
Compare Function
Reduce Function
Output Writer
Page | 10
Page | 11
Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client
connects to NameSpace as well as DataNode. Client App access to the Data Node is regulated by
NameSpace Node. NameSpace Node allows Client App to connect to the Data Node based by
allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks
(let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks
directly to the DataNode. Client App does not have to directly write to all the node. It just has to
write to any one of the node and Name Node will decide on which other Data Node it will have to
replicate the data. In our example Client App directly writes to Data Node 1 and detained 3.
However, data chunks are automatically replicated to other nodes. All the information like in which
DataNode which data block is placed is written back to NameNode.
High Availability During Disaster
Now as multiple DataNode have same data blocks in the case of any DataNode which faces the
disaster, the entire process will continue as other DataNode will assume the role to serve the
specific data block which was on the failed node. This system provides very high tolerance to
disaster and provides high availability.
If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop
Application will stop performing as it is a single node where we store all the metadata. As this node
is very critical, it is usually replicated on another clustered as well as on another data rack. Though,
that replicated node is not operational in architecture, it has all the necessary data to perform the
task of the NameNode in the case of the NameNode fails.
The entire Hadoop architecture is built to function smoothly even there are node failures or
hardware malfunction. It is built on the simple concept that data is so big it is impossible to have
come up with a single piece of the hardware which can manage it properly. We need lots of
commodity (cheap) hardware to manage our big data and hardware failure is part of the
commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to
overcome the limitation of the non-functioning hardware.
Page | 12
Page | 13
NewSQL
NewSQL stands for new scalable and high performance SQL Database vendors. The products sold
by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about
vendors who supports emerging data products with relational database properties (like ACID,
Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in
memory data for speedy access as well are available immediate scalability.
NewSQL is our shorthand for the various new scalable/high performance SQL database vendors.
We have previously referred to these products as ScalableSQL to differentiate them from the
incumbent relational database products. Since this implies horizontal scalability, which is not
necessarily a feature of all the products, we adopted the term NewSQL in the new report. And to
clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about
the NewSQL vendors is the vendor, not the SQL.
In other words NewSQL incorporates the concepts and principles of Structured Query Language
(SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of
NoSQL.
Categories of NewSQL
There are three major categories of the NewSQL
New Architecture In this framework each node owns a subset of the data and queries are split into
smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB
MySQL Engines Highly Optimized storage engine for SQL with the interface of MySQ Lare the
example of such category. E.g. InnoDB, Akiban
Transparent Sharding This system automatically split database across multiple nodes.
Page | 14