0% found this document useful (0 votes)
43 views4 pages

Mapreduce With Hadoop For Simplified Analysis of Big Data: Ch. Shobha Rani Dr. B. Rama

Hadoop Research Paper

Uploaded by

Anil Kashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views4 pages

Mapreduce With Hadoop For Simplified Analysis of Big Data: Ch. Shobha Rani Dr. B. Rama

Hadoop Research Paper

Uploaded by

Anil Kashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ISSN No.

0976-5697
Volume 8, No. 5, May-June 2017
International Journal of Advanced Research in Computer Science
RESEARCH PAPER
Available Online at www.ijarcs.info
MapReduce with Hadoop for Simplified Analysis of Big Data
Ch. Shobha Rani Dr. B. Rama
Research Scholar Assistant Professor
Department of Computer Science Department of Computer Science
Kakatiya University, Warangal, Telangana Kakatiya University, Warangal, Telangana

Abstract: With the development of web based applications and mobile computer technology there is a rapid growth of data, their computations
and analysis continuously in the recent years. Various fields around the globe are facing a big problem with this large scale data which highly
supports in decision making. The traditional relational DBMS’s were unable to handle this Big Data. The most classical data mining methods are
also not suitable for handling this big data. Efficient algorithms are required to process Big Data. Out of the many parallel algorithms
MapReduce is adopted by many popular and huge IT companies such as Google, Yahoo, FaceBook etc. In Big data world MapReduce has been
playing a vital role in meeting the increasing demands on computing resources affected by voluminous data sets. MapReduce is a popular
programming model suitable for Big Data Analysis in distributed and parallel computing. The high scalability of MapReduce is one of the
reasons for adapting this model. Hadoop is an open source; distributed programming framework with enables the storage and processing of large
data sets. [1] In this paper we try to focus especially on MapReduce with Hadoop for the analytical processing of big data.

Keywords: Big Data, Hadoop, MapReduce, BigData Analytics.

I. INTRODUCTION

In the current era, enormous data is being generated day by


day continuously. With the rapid expansion of data, we are
moving from the Petabyte to exabytes and zettabytes age. At
the same time, new technologies progressing with high
speed make it possible to organize and manipulate the
voluminous amounts of data presently being generated. With
this trend there exists a greater demand for new data storage
and analysis methods. [2] Especially, the real world aspects
of extracting knowledge from huge data sets have become
utmost important.
“Big Data” is the biggest observable fact that has captured
the attention of the modern computing industry today since
the expansion of Internet globally. Big Data is gaining more Fig. 1 : 5v’s of big data
popularity today is because of the technological revolutions
that have emerged are providing the capability to process Along with the three V’s, there also exists ambiguity,
data of multiple formats and structures without worrying viscosity, and virality.
about the constraints associated with traditional systems and
database platforms.  Ambiguity— which comes into existence if the metadata
lags behind the clarity of data in Big Data. For example,
II. IMPORTANCE OF BIG DATA in a graph, 1 and 0 can depict degree or can depict status
as true and false.
Big Data can be defined as large volumes of data which is  Viscosity—measures the resistance (slow down) to flow
either structured or unstructured and generated at high in the volume of data. Resistance can manifest in
speeds globally by various new technological devices. dataflow, business rules, and even be a limitation of
Big Data includes the data that is generated every second by technology. For example, social network predictions
sensors, mobiles, and consumer-driven data from social come under this category, where a number of enterprises
networks. Big Data is evolving from various facets within just cannot understand what impact is there on business
organizations legal, sales, marketing, procurement, finance, and how it resists the usage of the data in many cases.
and human resources departments etc.  Virality—measures and describes how quickly data is
shared in a people-to-people (peer) network.

Big Data is non relational


MapReduce is complementary to DBMS phenomenon, not a
competing technology. [3]

Parallel DBMS are for efficient querying of large data
sets. [4]

© 2015-19, IJARCS All Rights Reserved 853


Ch. Shobha Rani et al, International Journal of Advanced Research in Computer Science, 8 (5), May-June 2017,853-856

• Big Data exits mostly in real-manner rather than the


traditional Data Warehouse applications.[8]
• Traditional DW architectures (like Exadata, Teradata)
are not well suited for Big data applications.
• The architectures like Shared nothing and massively
parallel processing are very well suited for big data
applications.
• MR-style systems are suitable for complex analytics and
especially ETL tasks.
• Parallel DBMS require data to fit into the traditional Fig. 2 : Map Reduce workflow
relational representation of rows and columns.
• In contrast, the MapReduce architecture does not require Map Phase: In the map phase, the master node takes the
that data files must and should stick to a particular input, divides it into smaller sub-tasks, and distributes them
schema such as the relational data model. That is, the to worker nodes. A worker node may do this again
MR programmer can structure their data in any manner repeatedly, leading to a multi-level tree structure. The
or even to have no structure at all. As MR supports both worker node processes the smaller task only, and passes the
structured and unstructured data this is possible. intermediated result back to its master node.
Reduce Phase: During the reduce phase, the master node
Big Data Pillars collects all the intermediated outputs of all the sub-tasks

Big Table – consisting of relational tables. generated by various worker nodes and combines them in

Big Text – comprising of text in the form of structured, some way to form the final output – the solution to the
semi-structured data, natural language, and semantic problem it was originally trying to solve.
data. [4] a) Input reader: The input reader splits the input file into
• Big Metadata – collects and stores the data about data appropriate sizes (in practice typically 64 MB to 512 MB
stored in big data. as per HDFS) and one split is assigned to one Map
• Big Graphs – Graphs include connections between function by MapReduce framework. The input reader
objects, their semantic discovery, and the degree of takes input from stable storage (typically as in our case
separation, linguistic analytics, and subject predicates. Hadoop distributed file system) and generates the output
as key/value pairs.
III. MAPREDUCE b) Map function: Each Map function takes a series of
key/value pairs generated by the input reader, processes
MapReduce is a an emerging programming paradigm which each, and in turn produces zero or more output key/value
is designed for processing extremely large volumes of data pairs.[5] The input and output types of the map can be and
in parallel mode by splitting the job into various often are different from each other.
independent tasks.[3] A MapReduce program in general is a c) Partition function: Each Map function output is
combination of a Map() function and a Reduce() function. assigned to a particular reducer by the application'
The job of Map() is to perform filtering and sorting partition function for sharing purposes. The partition
operations as such, sorting customers by first name into function is given as input the key and the number of
queues, by generating one queue for each name and the reducers and it return the index of desired reduce.
Reduce() performs a summary/aggregate operations like d) Comparison function: The input for every Reduce is
counting the number of customers in each queue, thereby fetched from the machine where the Map run and sorted
yielding the name counts. [3] The "MapReduce System" well using comparison function.
known as MapReduce "framework" or “architecture” e ) Reduce function: The frame work calls the applications
demonstrates the processing with the distributed servers, Reduce function for each unique key in the sorted order.
running the various tasks in parallel, managing all It also iterates through the values that are associated with
communications and data transfers between the various parts that key and produce zero or more outputs.[6]
of the system, and providing for redundant data and fault f) Output writer: It writes the output of the Reduce
tolerance.[3] function to stable storage, usually a Hadoop distributed
MapReduce is a framework for processing voluminous data file system.
splitted and distributed across huge datasets using a large
number of computers (nodes). The group of nodes Performance:
collectively treated as a cluster, if all nodes are with similar MapReduce programs do not produce the output with high
hardware configurations working on the same local network speed. The main benefit of this programming model is to
or else the nodes are treated as a grid, if they are make use of the optimized shuffle operation of the platform,
geographically shared and distributed with varying hardware and the only task of the programmer is to write
specifications. Processing may occur on the data that is the Map and reduce functions of the program. While
stored either in system log files (unstructured) or in execution, the author of a MapReduce program needs to
a database (structured). MapReduce takes advantage of shuffle the intermediate results.[8] However, the partition
locality of data, to minimise the data transfer distance. function and the amount of data generated by
the Map function highly influence the performance of the
program. In addition to the partitioner,
the Combiner function helps to reduce the amount of data
written to storage (disk), and transmitted over the network.

© 2015-19, IJARCS All Rights Reserved 854


Ch. Shobha Rani et al, International Journal of Advanced Research in Computer Science, 8 (5), May-June 2017,853-856

IV HADOOP Fig.3 HDFS architecture

Apache Hadoop[1] which is an open-source software V MAPREDUCE IMPLEMENTATION


in Java is used majorly for distributed storage and
processing of extremely large data sets on computer clusters. While designing the MapReduce programs the user may not
Apache Hadoop mainly coupled with a storage part (Hadoop specify the mappers since it depends on the file size and the
Distributed File System (HDFS)) and a processing part block size, where as the number of reducers can be
(MapReduce). Splitting up of files into large blocks and configured by user based on number of mappers. In general
distributing them on the nodes in the cluster is take care by the Partitioner decides to choose reducers or else Hadoop
Hadoop only. It is not the job of the programmer to do the takes over the job. With the help of the combiner the
distribution over the cluster, Hadoop itself looks into it. In network traffic will be highly reduced.
Hadoop the data processing is done with MapReduce, by If map() is not defined by the user then the output of Record
transferring the program code to the nodes in parallel based reader is sent to identity mapper(without any logic) then to
on the data requirements of each node i.e, the process code reduce without any reducer defined in the program then the
travels to the node.[7] output of the identity reducer is stored in the data node itself
The Hadoop framework is encapsulated with the following and is not sent to HDFS.
modules: [1] When multiple mappers are running there may be a situation
• Hadoop Common where some mappers may be running very slow, Hadoop
• Hadoop Distributed File System (HDFS then identifies such slow running jobs and triggers the same
• Hadoop MapReduce. job to other data node, this concept is called as Speculator
execution in Hadoop.

SHUF
INPUT SPLITTING MAPPING FLE / REDUCE OUTPUT
SORT

Fig .4. MapReduce Program execution sequence

If we consider a sample input file consisting the text as :


hello hadoop bye hadoop
hello google goodbye google
The internal execution process will be as follows:

© 2015-19, IJARCS All Rights Reserved 855


Ch. Shobha Rani et al, International Journal of Advanced Research in Computer Science, 8 (5), May-June 2017,853-856

VII CONCLUSION REFERENCES

With the invent of new technologies emerging at a rapid [1] http://hadoop.apache.org,2010


rate, one must be very careful to understand the global [2] V. Patil, V.B. Nikam, “Study of Mining Algorithm in cloud
competition and the big data analysis that support decision computing using MapReduce Framework”, Journal of
making. This paper analyzes the concept of big data analysis Engineering, Computers & Applied Sciences (JEC&AS)
Vol.2, No.7, July 2013.
and how it can be simplified as from existing traditional [3] https://en.wikipedia.org/wiki/MapReduce
relational database technologies. This paper clearly specifies [4] D. Usha, A.P.S. AslinJenil, “ A Survey of Big Data
the Hadoop environment, its architecture and how it can be Processing in Perspective of Hadoop and Mapreduce”,
implemented using MapReduce along with various International Journal of Current Engineering and Technology,
functions. As Big Data Analysis is still in its infancy stage Vol.4, No.2,April 2014.
we are sure that this paper helps the researchers to better [5] S. Ghemawat et al .“The Google File System.” ACM SIGOPS
understand the concepts of Big Data its processing and Operating Systems Review, 37(5):29–43, 2003.
analysis. Big Data will definitely bring a major social [6] J. Dean and S. Ghemawat, “Mapreduce: Simplified data
change. Though programming languages like R, SPSS are processing on large clusters,” in Proceedings of OSDI’04:
Sixth Symposium on Operating System Design and
evolving for Big Data analytics further research is still Implementation, December 2004.
required to ensure integrity, security for the large data sets [7] T. White. Hadoop: “The Definitive Guide.”,Yahoo
being processed. Big Data Analytics should be exploited for Press,2010.
sustainable and unbiased society. [8] Russom,P “Big Data Analytics”, TDWI Best Practices Report,
pp.1-40, 2011.

© 2015-19, IJARCS All Rights Reserved 856

You might also like