0% found this document useful (0 votes)

473 views

Chicago Crime (2013) Analysis Using Pig and Visualization Using R

This document provides a 3-chapter history of the Hadoop framework: 1. It describes the origin of the name "Hadoop" which was inspired by a stuffed yellow elephant named by Doug Cutting's child. 2. It outlines how Hadoop was originally created by Doug Cutting as part of the Apache Nutch open source web search engine project to address challenges in processing large amounts of web data. 3. It explains that Hadoop development was influenced by a 2003 Google paper describing their distributed file system GFS, leading Cutting and others to create an open source implementation of this concept called the Nutch Distributed Filesystem.

Uploaded by

Saurabh Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

473 views

Chicago Crime (2013) Analysis Using Pig and Visualization Using R

Uploaded by

Saurabh Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Hadoop Framework

Maharishi Arvind Institute of Engineering and

Technology, Jaipur

Department of Computer Science

PROJECT REPORT

Chicago Crime (2013) Analysis Using Pig and

Visualization Using R

Project Report for the Training Completion and Project Made During Training

Period

Submitted By:

Saurabh Sharma 14EMTCS060

Maharishi Arvind institute of Engineering

and technology
Page 1
Hadoop Framework

2017-18

Index

Contents Page No.

1) Acknowledgement 3

2) Abstract 4

3) Chapter1

Introduction of Hadoop 5-7

4) Chapter 2-

History of Hadoop 8-11

5) Chapter 3-

3.1 MapReduce 13
3.2 Programming Model 14
3.3 Map 15
3.4 Reduce 16
3.5 HDFS 17
3.6 HDFS Concepts 19
3.7Hadoop File System 22

6) Chapter 4-

4.1 Hadoop Ecosystem 24-27

7) Chapter 5-

5.1 Applications of Hadoop 28-30

8)Chapter 6-

6.1 Modules of Project 31-40

9) Chapter 7-

7.1 Visualization using R Programming Language 41-50

Page 2
Hadoop Framework

10) Chapter 8-

Conclusion 54-55

11) References-

Reference 56-57

Page 3
Hadoop Framework

ACKNOWLEDGEMENT

I have taken efforts in this project. However, it would not have been possible without the kind
support and help of many individuals and organizations. I would like to extend my sincere
thanks to all of them.
I am highly indebted to CEG for their guidance and constant supervision as well as for
providing necessary information regarding the project & also for their support in completing
the project.
I would like to express my gratitude towards my parents & member of CEG for their kind co-
operation and encouragement which help me in completion of this project.
I would like to express my special gratitude and thanks to industry persons for giving me
such attention and time.
My thanks and appreciations also go to my colleague in developing the project and people
who have willingly helped me out with their abilities.

Page 4
Hadoop Framework

ABSTRACT

My topic is Hadoop which is cluster computing framework. Apache Hadoop is a software

framework that supports data-intensive distributed applications under a free license. Hadoop
was inspired by Google's MapReduce and Google File System (GFS) papers. Hadoop,
however, was designed to solve a different problem: the fast, reliable analysis of both
structured data and complex data. As a result, many enterprises deploy Hadoop alongside
their legacy IT systems, which allows them to combine old data and new data sets in
powerful new ways. The Hadoop framework is used by major players
including Google, Yahoo and IBM, largely for applications involving search engines and
advertising. I am going to represent the History, Development and Current Situation of this
Technology. This technology is now under the Apache Software Foundation via Cloudera.

Page 5
Hadoop Framework

CHAPTER 1:
INTRODUCTION TO
HADOOP

Page 6
Hadoop Framework

Chapter 1:
Introduction to HADOOP

Today, were surrounded by data. People upload videos, take pictures on their cell phones,
text friends, update their Facebook status, leave comments around the web, click on ads, and
so forth. Machines, too, are generating and keeping more and more data.
The exponential growth of data first presented challenges to cutting-edge businesses such
as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and
petabytes of data to figure out which websites were popular, what books were in demand, and
what kinds of ads appealed to people. Existing tools were becoming inadequate to process
such large data sets. Google was the first to publicize MapReducea system they had used to
scale their data processing needs.
This system aroused a lot of interest because many other businesses were facing similar
scaling challenges, and it wasnt feasible for everyone to reinvent their own proprietary tool.
Doug Cutting saw an opportunity and led the charge to develop an open source version of this
MapReduce system called Hadoop. Soon after, Yahoo and others rallied around to support
this effort. Today, Hadoop is a core part of the computing infrastructure for many web
companies, such as Yahoo, Facebook, LinkedIn, and Twitter.
Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data. Distributed computing is a wide and varied field, but the key
distinctions of Hadoop are that it is
AccessibleHadoop runs on large clusters of commodity machines or on cloud
computing services such as Amazons Elastic Compute Cloud (EC2 ).
RobustBecause it is intended to run on commodity hardware, Hadoop is
architected with the assumption of frequent hardware malfunctions. It can gracefully handle
most such failures.
ScalableHadoop scales linearly to handle larger data by adding more nodes to the
cluster.

Page 7
Hadoop Framework

SimpleHadoop allows users to quickly write efficient parallel code.

Hadoops accessibility and simplicity give it an edge over writing and running large
distributed programs. Even college students can quickly and cheaply create their own Hadoop
cluster. On the other hand, its robustness and scalability make it suitable for even the most
demanding jobs at Yahoo and Facebook.

Page 8
Hadoop Framework

CHAPTER 2:

HISTORY OF HADOOP

Page 9
Hadoop Framework

Chapter 2:

History of HADOOP

Page 10
Hadoop Framework

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.

2.1 The Origin of the Name Hadoop:

The name Hadoop is not an acronym; its a made-up name. The projects creator, Doug
Cutting, explains how the name came about:

Page 11
Hadoop Framework

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are
good at generating such. Googol is a kids term.
Subprojects and contrib modules in Hadoop also tend to have names that are unrelated
to their function, often with an elephant or other animal theme (Pig, for example). Smaller
components are given more descriptive (and therefore more mundane) names. This is a good
principle, as it means you can generally work out what something does from its name. For
example, the jobtracker keeps track of MapReduce jobs.

Building a web search engine from scratch was an ambitious goal, for not only is the
software required to crawl and index websites complex to write, but it is also a challenge to
run without a dedicated operations team, since there are so many moving parts. Its expensive
too: Mike Cafarella and Doug Cutting estimated a system supporting a 1- billion-page index
would cost around half a million dollars in hardware, with a monthly running cost of $30,000.
Nevertheless, they believed it was a worthy goal, as it would open up and ultimately
democratize search engine algorithms. Nutch was started in 2002, and a working crawler and
search system quickly emerged.
However, they realized that their architecture wouldnt scale to the billions of pages on the
Web. Help was at hand with the publication of a paper in 2003 that described the architecture
of Googles distributed filesystem, called GFS, which was being used in production at
Google.# GFS, or something like it, would solve their storage needs for the very large files
generated as a part of the web crawl and indexing process. In particular, GFS would free up
time being spent on administrative tasks such as managing storage nodes. In 2004, they set
about writing an open source implementation, the Nutch Distributed Filesystem (NDFS).
In 2004, Google published the paper that introduced MapReduce to the world. Early in
2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the
middle of that year all the major Nutch algorithms had been ported to run using MapReduce
and NDFS. NDFS and the MapReduce implementation in Nutch were applicable beyond the
realm of search, and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!,
which provided a dedicated team and the resources to turn Hadoop into a system that ran at
web scale (see sidebar). This was demonstrated in February 2008 when Yahoo! announced
that its production search index was being generated by a 10,000-core Hadoop cluster.

Page 12
Hadoop Framework

In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this timem Hadoop was being used by many
other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.

Page 13
Hadoop Framework

CHAPTER 3:
KEY TECHNOLOGY

Page 14
Hadoop Framework

Chapter 3:
Key Technology

The key technology for Hadoop is the MapReduce programming model and Hadoop
Distributed File System. The operation on large data is not possible in serial programming
paradigm. MapReduce do task parallel to accomplish work in less time which is the main aim
of this technology. MapReduce require special file system. In the real scenario , the data
which are in terms on perabyte. To store and maintain this much data on distributed
commodity hardware, Hadoop Distributed File System is invented. It is basically inspired by
Google File System.

3.1 MapReduce
MapReduce is a framework for processing highly distributable problems across huge
datasets using a large number of computers (nodes), collectively referred to as a cluster (if all
nodes use the same hardware) or a grid (if the nodes use different hardware). Computational
processing can occur on data stored either in a filesystem (unstructured) or in a database
(structured).

Figure 1. MapReduce Programming Model

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and
distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-

Page 15
Hadoop Framework

level tree structure. The worker node processes the smaller problem, and passes the answer
back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and
combines them in some way to form the output the answer to the problem it was originally
trying to solve.
MapReduce allows for distributed processing of the map and reduction operations.
Provided each mapping operation is independent of the others, all maps can be performed in
parallel though in practice it is limited by the number of independent data sources and/or
the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction
phase - provided all outputs of the map operation that share the same key are presented to the
same reducer at the same time. While this process can often appear inefficient compared to
algorithms that are more sequential, MapReduce can be applied to significantly larger
datasets than "commodity" servers can handle a large server farm can use MapReduce to
sort a petabyte of data in only a few hours. The parallelism also offers some possibility of
recovering from partial failure of servers or storage during the operation: if one mapper or
reducer fails, the work can be rescheduled assuming the input data is still available.
MapReduce is a programming model and an associated implementation for processing
and generating largedata sets. Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pairs, and a reduce function that merges all
intermediate values associated with the same intermediate key. Many real world tasks are
expressible in this model.

3.2 PROGRAMMING MODEL

The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set
of intermediate key/value pairs. The MapReduce library groups together all intermediate
values associatedwith the same intermediate key I and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key I and a set of
values for that key. It merges together these values to form a possibly smaller set of values.
Typically just zero or one output value is produced per Reduce invocation. The intermediate

Page 16
Hadoop Framework

values are supplied to the user's reduce function via an iterator. This allows us to handle lists
of values that are too large to fit in memory.

3.3 MAP
map (in_key, in_value) -> (out_key, intermediate_value) list

Figure:2 Map Technology

Example: Upper-case Mapper

let map(k, v) = emit(k.toUpper(), v.toUpper())

(foo, bar) --> (FOO, BAR)

(Foo, other) -->(FOO, OTHER)

(key2, data) --> (KEY2, DATA)

Page 17
Hadoop Framework

3.4 REDUCE

reduce (out_key, intermediate_value list) ->out_value list

Figure:3 Reducing Technology

Example: Sum Reducer

let reduce(k, vals)
sum = 0
foreachint v in vals:
sum += v
emit(k, sum)
(A, [42, 100, 312]) --> (A, 454)
(B, [12, 6, -2]) --> (B, 16)

Hadoop Map-Reduce is a software framework for easily writing applications which

process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands
of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Page 18
Hadoop Framework

A Map-Reduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the outputs
of the maps, which are then input to the reduce tasks. Typically both the input and the output
of the job are stored in a file-system. The framework takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.
A MapReducejob is a unit of work that the client wants to be performed: it consists of the
input data, the MapReduce program, and configuration information. Hadoop runs the job by
dividing it into tasks, of which there are two types: map tasks and reduce tasks. There are two
types of nodes that control the job execution process: a jobtrackerand a number of
tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to
run on tasktrackers.

Figure 4: HadoopMapReduce

3.5 HDFS (Hadoop Distributed File System)

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is highly
fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Page 19
Hadoop Framework

HDFS was originally built as infrastructure for the Apache Nutch web search engine project.
HDFS is now an Apache Hadoop subproject.

Figure 5 HDFS Architecture

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a

master server that manages the file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster, which
manage storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into one or more
blocks and these blocks are stored in a set of DataNodes. The NameNode executes file
system namespace operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving
read and write requests from the file systems clients. The DataNodes also perform block
creation, deletion, and replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity
machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built
using the Java language; any machine that supports Java can run the NameNode or the
DataNode software. Usage of the highly portable Java language means that HDFS can be

Page 20
Hadoop Framework

deployed on a wide range of machines. A typical deployment has a dedicated machine that
runs only the NameNode software. Each of the other machines in the cluster runs one
instance of the DataNode software. The architecture does not preclude running multiple
DataNodes on the same machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the
system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is
designed in such a way that user data never flows through the NameNode.

Filesystems that manage the storage across a network of machines are called
distributed filesystems. Since they are network-based, all the complications of network
programming kick in, thus making distributed filesystems more complex than regular disk
filesystems. For example, one of the biggest challenges is making the filesystem tolerate node
failure without suffering data loss. Hadoop comes with a distributed filesystem called HDFS,
which stands for HadoopDistributed Filesystem.

HDFS, the Hadoop Distributed File System, is a distributed file system designed to
hold very large amounts of data (terabytes or even petabytes), and provide high-
throughput access to this information. Files are stored in a redundant fashion across
multiple machines to ensure their durability to failure and high availability to very parallel
applications.

3.6 HDFS CONCEPTS

a) Blocks
A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an
integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in
size, while disk blocks are normally 512 bytes. This is generally transparent to the filesystem
user who is simply reading or writing a fileof whatever length. However, there are tools to
do with filesystem maintenance, such as dfand fsck, that operate on the filesystem block

Page 21
Hadoop Framework

level. HDFS too has the concept of a block, but it is a much larger unit64 MB by default.
Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,
which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS
that is smaller than a single block does not occupy a full blocks worth of underlying storage.
When unqualified, the term block in this book refers to a block in HDFS.

b) Namenodes and Datanodes

A HDFS cluster has two types of node operating in a master-worker pattern: a
namenode(the master) and a number of datanodes(workers). The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files and
directories in the tree. This information is stored persistently on the local disk in the form of
two files: the namespace image and the edit log. The namenode also knows the datanodes on
which all the blocks for a given file are located, however, it does not store block locations

Page 22
Hadoop Framework

ersistently, since this information is reconstructed from datanodes when the system starts. A
client accesses the filesystem on behalf of the user by communicating with the namenode and
datanodes.

Figure 6 HDFS Architecture

The client presents a POSIX-like filesystem interface, so the user code does not need to know
about the namenode and datanode to function. Datanodes are the work horses of the
filesystem. They store and retrieve blocks when they are told to (by clients or the namenode),

Page 23
Hadoop Framework

and they report back to the namenode periodically with lists of blocks that they are storing.
Without the namenode, the filesystem cannot be used. In fact, if the machine running the
namenode were obliterated, all the files on the filesystem would be lost since there would be
no way of knowing how to reconstruct the files from the blocks on the datanodes. For this
reason, it is important to make the namenode resilient to failure.

c) The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is
similar to most other existing file systems; one can create and remove files, move a file from
one directory to another, or rename a file. HDFS does not yet implement user quotas or
access permissions. HDFS does not support hard links or soft links. However, the HDFS
architecture does not preclude implementing these features.

d) Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed later.

Data replication

3.7 HADOOP FILESYSTEMS

Page 24
Hadoop Framework

Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The
Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and
there are several concrete implementations, which are described in following table.
A filesystem for a locally
connected disk with client-side
Local file checksums.
fs.LocalFileSystem Use RawLocalFileSystem for a
local filesystem with no
checksums.
Hadoops distributed filesystem.
HDFS is designed to work
HDFS hdfs hdfs.DistributedFileSystem efficiently in conjunction with
Map-Reduce.
A filesystem layered on another
filesystem for archiving files.
HAR har Fs.HarFileSystem Hadoop
Archives are typically used for
archiving files in HDFS to reduce
the namenodes memory usage.
CloudStore (formerly
Kosmosfilesystem)is a distributed
KFS(C Kfs fs.kfs.KosmosFileSystem filesystem like HDFS or Googles
loud GFS, written in C++.
Store)
A filesystem backed by an FTP
FTP ftp fs.ftp.FtpFileSystem server.
Table: 8.2 Various HadoopFilesystems

3.8 HADOOP ARCHIVES

Page 25
Hadoop Framework

HDFS stores small files inefficiently, since each file is stored in a block, and block
metadata is held in memory by the namenode. Thus, a large number of small files can eat up
a lot of memory on the namenode. (Note, however, that small files do not take up any more
disk space than is required to store the raw contents of the file. For example, a 1 MB file
stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives,
or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently,
thereby reducing namenode memory usage while still allowing transparent access to files. In
particular, Hadoop Archives can be used as input to MapReduce.

CHAPTER 4

Page 26
Hadoop Framework

HADOOP
ECOSYSTEMS

Page 27
Hadoop Framework

4.1 Avro

Apache Avro is a data serialization system.

Avro provides:
1. Rich data structures.

2. A compact, fast, binary data format.

3. A container file, to store persistent data.

4. Simple integration with dynamic languages. Code generation is not required to read
or write data files nor to use or implement RPC protocols. Code generation as an
optional optimization, only worth implementing for statically typed languages.

4.2 Chukwa

Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa
is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and
inherits Hadoops scalability and robustness. Chukwa also includes a exible and powerful
toolkit for displaying monitoring and analyzing results, in order to make the best use of this
collected data.

4.3 HBase

Just as Google's Bigtable leverages the distributed data storage provided by the Google File
System, HBase provides Bigtable-like capabilities on top of Hadoop Core.

4.4 Hive

Page 28
Hadoop Framework

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive
provides a mechanism to project structure onto this data and query the data using a SQL-like
language called HiveQL. At the same time this language also allows traditional map/reduce
programmers to plug in their custom mappers and reducers when it is inconvenient or
inefficient to express this logic in HiveQL.

4.5 Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.

4.6 ZooKeeper

ZooKeeper is a centralized service for maintaining configuration information, naming,

providing distributed synchronization, and providing group services. All of these kinds of
services are used in some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and race conditions that are
inevitable. Because of the difficulty of implementing these kinds of services, applications
initially usually skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different implementations of these services
lead to management complexity when the applications are deployed.

Page 29
Hadoop Framework

4.7 Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and
data availability.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs
out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and
Distcp) as well as system specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
Developers interested in getting more involved with Oozie may join the mailing lists, report
bugs, retrieve code from the version control system, and make contributions.

Page 30
Hadoop Framework

CHAPTER 5
APPLICATIONS OF
HADOOP

Page 31
Hadoop Framework

Chapter 4
Applications of Hadoop

4.1.
Amazon S3 (Simple Storage Service) is a data storage service. You are billed monthly for
storage and data transfer. Transfer between S3 and AmazonEC2 is free. This makes use of S3
attractive for Hadoop users who run clusters on EC2.

Hadoop provides two filesystems that use S3 :-

1. S3 Native FileSystem (URI scheme: s3n)

2. S3 Block FileSystem (URI scheme: s3)

4.2.
Facebooks engineering team has posted some details on the tools its using to analyze the
huge data sets it collects. One of the main tools it uses is Hadoop that makes it easier to
analyze vast amounts of data.
Some interesting tidbits from the post:
Facebook has multiple Hadoop clusters deployed now - with the biggest having about 2500
cpu cores and 1 PetaByte of disk space. They are loading over 250 gigabytes of compressed
data (over 2 terabytes uncompressed) into the Hadoop file system every day and have
hundreds of jobs running each day against these data sets. The list of projects that are using
this infrastructure has proliferated - from those generating mundane statistics about site
usage, to others being used to fight spam and determine application quality.

Page 32
Hadoop Framework

Over time, we have added classic data warehouse features like partitioning,
sampling and indexing to this environment. This in-house data warehousing
layer over Hadoop is called Hive.

4.3 .
Yahoo! recently launched the world's largest Apache Hadoop production application.
The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000
core Linux cluster and produces data that is now used in every Yahoo! Web search
query.

The Webmap build starts with every Web page crawled by Yahoo! and produces a
database of all known Web pages and sites on the internet and a vast array of data
about every page and site. This derived data feeds the Machine Learned Ranking
algorithms at the heart of Yahoo! Search.

Some Webmap size data:Number of links between pages in the index: roughly 1
trillion links

Size of output: over 300 TB, compressed!

Number of cores used to run a single Map-Reduce job: over 10,000

Raw disk used in the production cluster: over 5 Petabytes

Page 33
Hadoop Framework

CHAPTER 6

PROJECT

MODULES

Page 34
Hadoop Framework

Project Starts With Loading Data in HDFS(Hadoop Distributed File

System)

Loading Crime2013.csv file.

The Structure For Crime2013.csv File is ...

This Crime2013.csv File Contain Around 177000 Records of The Crime

That Were Happened In Chicago In 2013.

Analysis Is Done On The Basis Of Below Problem Statement Mentioned.

Page 35
Hadoop Framework

Problem Statement 1: The most frequently occurring primary type (i.e.

theft, narcotics etc...)

Pig Script:

Run This Script: pig c1.pig

Page 36
Hadoop Framework

Problem Statement 2: Districts with the most reported incidents.

Pig Script:

Run This Script: pig c2.pig

Page 37
Hadoop Framework

Problem Statement 3: Blocks with the most reported incidents.

Pig Script:

Run This Script: pig c3.pig

Page 38
Hadoop Framework

Problem Statement 4: Blocks with the most reported incidents, grouped

by primary type.

Pig Script:

Run This Script: pig c4.pig

Page 39
Hadoop Framework

Problem Statement 5: A look at the date and time when the highest
number of incidents were reported.

Pig Script:

Run This Script: pig c5.pig

Page 40
Hadoop Framework

Problem Statement 6: Arrests by primary type.

Pig Script:

His

Run This Script: pig c6.pig

Page 41
Hadoop Framework

Problem Statement 7: Arrests by district.

Pig Script:

Run This Script: pig c7.pig

Page 42
Hadoop Framework

Page 43
Hadoop Framework

Problem Statement 8: A look at the date and time when the highest
number of arrests took place.

Pig Script:

Run This Script: pig c8.pig

Page 44
Hadoop Framework

CHAPTER 7

VISUALIZATION
USING
R PROGRAMMING
LANGUAGE

Page 45
Hadoop Framework

INTRODUCTION TO R:

R is a programming language and software environment for statistical analysis, graphics

representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development
Core Team.

The core of R is an interpreted computer language which allows branching and looping as
well as modular programming using functions. R allows integration with the procedures
written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNU S.

Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.

Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source
code archive.

Page 46
Hadoop Framework

FEATURES OF R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features of
R

R is a well-developed, simple and effective programming language which

includes conditionals, loops, user defined recursive functions and input and
output facilities.

R has an effective data handling and storage facility,

R provides a suite of operators for calculations on arrays, lists, vectors and

matrices.

R provides a large, coherent and integrated collection of tools for data

analysis.

R provides graphical facilities for data analysis and display either directly at
the computer or printing at the papers.

As a conclusion, R is worlds most widely used statistics programming language. It's the # 1
choice of data scientists and supported by a vibrant and talented community of contributors.
R is taught in universities and deployed in mission critical business applications.

Page 47
Hadoop Framework

Visualization 1: Visualize by Number of Crimes of Different Type.

Page 48
Hadoop Framework

Page 49
Hadoop Framework

Visualization 2: Crime by Time Every 6 Hours.

Page 50
Hadoop Framework

Visualization 3: Crime by Week.

Page 51
Hadoop Framework

Visualization 4: Crime by Month.

Page 52
Hadoop Framework

Visualization 5: Heat Map of Crime by Time Every 6 Hours.

Page 53
Hadoop Framework

Visualization 6: Heat Map for Crime by Week.

Page 54
Hadoop Framework

Visualization 7: Heat Map of Crime by Month.

Page 55
Hadoop Framework

Page 56
Hadoop Framework

Page 57
Hadoop Framework

CHAPTER 8:
CONCLUSION

Page 58
Hadoop Framework

Chapter 5
CONCLUSION

By the above description we can understand the need of Big Data in future, so
Hadoop can be the best of maintenance and efficient implementation of large
data.

This technology has bright future scope because day by day need of data
would increase and security issues also the major point. In now a day many
Multinational organizations are prefer Hadoop over RDBMS.

So major companies like Facebook amazon, yahoo, LinkedIn etc. are adapting
Hadoop and in future there can be many names in the list.

Hence Hadoop Technology is the best appropriate approach for handling the
large data in smart way and its future is bright

Future work:

1. This analysis can be further carried out on Fully Distributed cluster mode that is hadoop

2. Similar analysis can be further carried out in different sector.

Page 59
Hadoop Framework

REFERENCES:

Page 60
Hadoop Framework

References:

http://www.cloudera.com/hadoop-training-thinking-at-scale

http://developer.yahoo.com/hadoop/tutorial/module1.html

http://hadoop.apache.org/core/docs/current/api/

http://hadoop.apache.org/core/version_control.html

http://wikipidea.in/apachehadoop.com

Page 61

Eberhard Zetzner Theatrum Chemicum Volume 4
No ratings yet
Eberhard Zetzner Theatrum Chemicum Volume 4
1,224 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
No ratings yet
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
10 pages
Student name:TARUN KUMAR Roll No:1314310112
No ratings yet
Student name:TARUN KUMAR Roll No:1314310112
22 pages
Bda Aiml Note Unit 2
No ratings yet
Bda Aiml Note Unit 2
13 pages
CSE Hadoop Report
No ratings yet
CSE Hadoop Report
14 pages
Big Data Hadoop Stack
No ratings yet
Big Data Hadoop Stack
52 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
Building A Big Data Platform With The Hadoop Ecosystem
No ratings yet
Building A Big Data Platform With The Hadoop Ecosystem
53 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
No ratings yet
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
8 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
BlackHat USA 2010 Becherer Andrew Hadoop Security WP
No ratings yet
BlackHat USA 2010 Becherer Andrew Hadoop Security WP
8 pages
Hadoop and Its Ecosystem.docx edited
No ratings yet
Hadoop and Its Ecosystem.docx edited
10 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
A Comparative Study On Apache Spark and Map Reduce With Performance Analysis Using KNN and Page Rank Algorithm
No ratings yet
A Comparative Study On Apache Spark and Map Reduce With Performance Analysis Using KNN and Page Rank Algorithm
6 pages
ADM Hadoop
No ratings yet
ADM Hadoop
25 pages
Case 11 - Big Data and The Elephant 2022 Valacich IS Today
No ratings yet
Case 11 - Big Data and The Elephant 2022 Valacich IS Today
1 page
Notes
No ratings yet
Notes
53 pages
Unit 2
No ratings yet
Unit 2
10 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Singh 2016
No ratings yet
Singh 2016
10 pages
Best Hadoop Online Training
100% (1)
Best Hadoop Online Training
6 pages
Optimization of Computing and Networking Resources of A Hadoop Cluster Based On Software Defined Network
No ratings yet
Optimization of Computing and Networking Resources of A Hadoop Cluster Based On Software Defined Network
15 pages
Spark Streaming Research
No ratings yet
Spark Streaming Research
6 pages
What Is Hadoop - Introduction, Architecture, Ecosystem, Components
No ratings yet
What Is Hadoop - Introduction, Architecture, Ecosystem, Components
8 pages
Compusoft, 3 (10), 1136-1139 PDF
No ratings yet
Compusoft, 3 (10), 1136-1139 PDF
4 pages
Big Data Analytics Using Hadoop
No ratings yet
Big Data Analytics Using Hadoop
26 pages
Bda Lab
No ratings yet
Bda Lab
94 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
2012 Efficient Big Data Processing in Hadoop MapReduce
No ratings yet
2012 Efficient Big Data Processing in Hadoop MapReduce
2 pages
Features of MapReduce
No ratings yet
Features of MapReduce
4 pages
Apache Hadoop: A Guide For Cluster Configuration & Testing
No ratings yet
Apache Hadoop: A Guide For Cluster Configuration & Testing
6 pages
Hadoop Seminar Report
No ratings yet
Hadoop Seminar Report
29 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
Big Data G
No ratings yet
Big Data G
11 pages
Unit 2
No ratings yet
Unit 2
30 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
Bigdata
No ratings yet
Bigdata
6 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Radoop: Analyzing Big Data With Rapidminer and Hadoop
No ratings yet
Radoop: Analyzing Big Data With Rapidminer and Hadoop
12 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Unit 2
No ratings yet
Unit 2
21 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
CC 2
No ratings yet
CC 2
25 pages
Iouu
No ratings yet
Iouu
12 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Ewwww
No ratings yet
Ewwww
12 pages
Certified Hadoop and Spark Course Curriculum
No ratings yet
Certified Hadoop and Spark Course Curriculum
9 pages
Hadoop
No ratings yet
Hadoop
11 pages
Big Data Tutorial
No ratings yet
Big Data Tutorial
2 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Sensors 20 05297
No ratings yet
Sensors 20 05297
18 pages
Link Encryption: 3.1 Security Objectives
No ratings yet
Link Encryption: 3.1 Security Objectives
21 pages
QwikResume - Your Guide To Perfect Resume Construction Estimator
No ratings yet
QwikResume - Your Guide To Perfect Resume Construction Estimator
3 pages
TechNewsLetter_April2
No ratings yet
TechNewsLetter_April2
5 pages
INFINIDAT White Paper - Storage Architecture - 161007 - A4
No ratings yet
INFINIDAT White Paper - Storage Architecture - 161007 - A4
6 pages
Quality Analyst Hitesh Kumar
No ratings yet
Quality Analyst Hitesh Kumar
2 pages
Mirantis CKA Exam
No ratings yet
Mirantis CKA Exam
10 pages
Project Electricity Bill System
No ratings yet
Project Electricity Bill System
53 pages
Nimesh Desai
No ratings yet
Nimesh Desai
5 pages
Lec 1+2 - Introduction
No ratings yet
Lec 1+2 - Introduction
41 pages
Complete Hadoop Map Reduce Hive Setup Step by Step
No ratings yet
Complete Hadoop Map Reduce Hive Setup Step by Step
30 pages
Software Project
No ratings yet
Software Project
18 pages
Cisco Umbrella WLAN Integration Guide PDF
No ratings yet
Cisco Umbrella WLAN Integration Guide PDF
36 pages
Spring Books
No ratings yet
Spring Books
1 page
Growing Rails
No ratings yet
Growing Rails
88 pages
BSBPMG421 - Project Schedule Template-V1
No ratings yet
BSBPMG421 - Project Schedule Template-V1
3 pages
Alfresco Best Practices: Luis Cabaceira, Solutions Architect at Alfresco
No ratings yet
Alfresco Best Practices: Luis Cabaceira, Solutions Architect at Alfresco
24 pages
Itm Admin
No ratings yet
Itm Admin
324 pages
Why Have Sales Genie CRM++ in Your Cloud
No ratings yet
Why Have Sales Genie CRM++ in Your Cloud
24 pages
Agile Product Development
No ratings yet
Agile Product Development
2 pages
Sneha Resume Apr 2024
No ratings yet
Sneha Resume Apr 2024
2 pages
181 Dec2019
No ratings yet
181 Dec2019
8 pages
GitHub Integration With CA Agile Central RALLY Jenkins Datasheet
No ratings yet
GitHub Integration With CA Agile Central RALLY Jenkins Datasheet
2 pages
MPN Solutions Partner Benefits Guide
No ratings yet
MPN Solutions Partner Benefits Guide
22 pages
Discover Zero Downtime Maintenance For Upgrading Sap S/4Hana
No ratings yet
Discover Zero Downtime Maintenance For Upgrading Sap S/4Hana
31 pages
Sunrise Teacher Book 10: Find Creators and Content
No ratings yet
Sunrise Teacher Book 10: Find Creators and Content
3 pages
Workflow Automation Software Guide
No ratings yet
Workflow Automation Software Guide
19 pages
New Type of Reports in SSRS
No ratings yet
New Type of Reports in SSRS
15 pages
Lab - Search NOC Certifications and Jobs
No ratings yet
Lab - Search NOC Certifications and Jobs
1 page