0% found this document useful (0 votes)

16 views49 pages

big-data-analytics

The document provides an introduction to Big Data, its characteristics, and the significance of Hadoop in processing large datasets. It discusses the three Vs of Big Data: Volume, Velocity, and Variety, and contrasts traditional data approaches with Big Data methodologies. Additionally, it outlines the core components of Hadoop, including MapReduce and HDFS, and their roles in managing and analyzing vast amounts of data.

Uploaded by

Ajay Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views49 pages

big-data-analytics

Uploaded by

Ajay Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 49

1. Introduction to Big Data........................................................................................................1

1.1 Introduction to Big data
1.2 Big data Characteristics
1.3 Big Data Source

1.4 Traditional Vs Big Data Business Approach

2. Introduction to Hadoop........................................................................................................13
2.1 What is Hadoop?
2.2 Why Use Hadoop?

2.3 Core Hadoop Components

3 NoSQL.......................................................................................................................................49
3.1 What is NoSQL?
3.2 NOSQL is not
CHAPTER-1
Introduction to Big Data
1.1 Introduction to Big Data
1. Big Data is becoming one of the most talked about technology trends nowadays.
2. The real challenge with the big organization is to get maximum out of the data
already available and predict what kind of data to collect in the future.
3. How to take the existing data and make it meaningful that it provides us accurate
insight in the past data is one of the key discussion points in many of the executive
meetings in organizations.
4. With the explosion of the data the challenge has gone to the next level and now a
Big Data is becoming the reality in many organizations.
5. The goal of every organization and expert is same to get maximum out of the data,
the route and the starting point are different for each organization and expert.
6. As organizations are evaluating and architecting big data solutions they are also
learning the ways and opportunities which are related to Big Data.
7. There is not a single solution to big data as well there is not a single vendor
which can claim to know all about Big Data.
8. Big Data is too big a concept and there are many players – different architectures,
different vendors and different technology.
9. The three Vs of Big data are Velocity, Volume and Variety.

Figure 1.1: Big Data Sphere

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Figure 1.2: Big Data – Transactions, Interactions, Observations

1.2 Big data Characteristics

1. The three Vs of Big data are Velocity, Volume and Variety

Figure 1.2: Characteristics of Big Data

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
i. Volume
 The exponential growth in the data storage as the data is now more than
text data.
 The data can be found in the format of videos, music’s and large images on
our social media channels.
 It is very common to have Terabytes and Petabytes of the storage system for
enterprises.
 As the database grows the applications and architecture built to support
the data needs to be reevaluated quite often.
 Sometimes the same data is re-evaluated with multiple angles and even
though the original data is the same the new found intelligence creates
explosion of the data.
 The big volume indeed represents Big Data.

ii. Velocity
 The data growth and social media explosion have changed how we look at
the data.
 There was a time when we used to believe that data of yesterday is recent.
 The matter of the fact newspapers is still following that logic.
 However, news channels and radios have changed how fast we receive the
news.
 Today, people reply on social media to update them with the latest
happening. On social media sometimes a few seconds old messages (a
tweet, status updates etc.) is not something interests users.
 They often discard old messages and pay attention to recent updates. The
data movement is now almost real time and the update window has
reduced to fractions of the seconds.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
iii. Variety
 Data can be stored in multiple format. For example database, excel, csv,
access or for the matter of the fact, it can be stored in a simple text file.
 Sometimes the data is not even in the traditional format as we assume, it
may be in the form of video, SMS, pdf or something we might have
not thought about it. It is the need of the organization to arrange it and
make it meaningful.
 It will be easy to do so if we have data in the same format, however it
is not the case most of the time. The real world have data in many
different formats and that is the challenge we need to overcome with the
Big Data. This variety of the data represent Big Data.

Figure 1.4: Volume, Velocity, Variety

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
1.3 Big Data Source
1.Hadoop
 Apache Hadoop is one of the main supportive element in Big Data technologies. It
simplifies the processing of large amount of structured or unstructured data in a
cheap manner.
 Hadoop is an open source project from apache that is continuously improving over
the years.
 "Hadoop is basically a set of software libraries and frameworks to manage and
process big amount of data from a single server to thousands of machines.
 It provides an efficient and powerful error detection mechanism based on
application layer rather than relying upon hardware."
 In December 2012 apache releases Hadoop 1.0.0, more information and
installation guide can be found at Apache Hadoop Documentation. Hadoop is not a
single project but includes a number of other technologies in it.

2. HDFS(Hadoop distributed file system)

 HDFS is a java based file system that is used to store structured or unstructured
data over large clusters of distributed servers.
 The data stored in HDFS has no restriction or rule to be applied, the data can be
either fully unstructured of purely structured.
 In HDFS the work to make data senseful is done by developer's code only.
 Hadoop distributed file system provides a highly fault tolerant atmosphere with a
deployment on low cost hardware machines.
 HDFS is now a part of Apache Hadoop project, more information and installation
guide can be found at Apache HDFS documentation.

3. Hive
 Hive was originally developed by Facebook, now it is made open source for some
time.
 Hive works something like a bridge in between sql and Hadoop, it is basically used to
make Sql queries on Hadoop clusters.
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
 Apache Hive is basically a data warehouse that provides ad-hoc queries, data
summarization and analysis of huge data sets stored in Hadoop compatible file
systems

1.4 Traditional Vs Big Data Business Approach

1. Schema less and Column oriented Databases (No Sql)

i. We are using table and row based relational databases over the years, these
databases are just fine with online transactions and quick updates.
ii. When unstructured and large amount of data comes into the picture we needs
some databases without having a hard code schema attachment.
iii. There are a number of databases to fit into this category, these databases can store
unstructured, semi structured or even fully structured data.
iv. Apart from other benefits the finest thing with schema less databases is that it
makes data migration very easy.
v. MongoDB is a very popular and widely used NoSQL database these days.
vi. NoSQL and schema less databases are used when the primary concern is to
sto

"NoSQL (not only Sql) is a type of databases that does not primarily rely upon schema based structure and
does not use Sql for data processing."

Figure 1.6: Big Data Architecture

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
vii. The traditional approach work on the structured data that has a basic layout
and the structure provided.

viii. The structured approach designs the database as per the requirements in
tuples and columns.
ix. Working on the live coming data, which can be an input from the ever changing
scenario cannot be dealt in the traditional approach.
x. The Big data approach is iterative.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Figure 1.8: Streaming Data
xi. The Big data analytics work on the unstructured data, where no specific
pattern of the data is defined.
xii. The data is not organized in rows and columns.
xiii. The live flow of data is captured and the analysis is done on it.
xiv. Efficiency increases when the data to be analyzed is large.

Figure 1.9: Big Data Architecture

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Case Study of Big Data Solutions

Figure 1.10: Big Data Infrastructure

2. Above image gives good overview of how in Big Data Infrastructure various
components are associated with each other.
3. In Big Data various different data sources are part of the architecture hence
extract, transform and integration are one of the most essential layers of the
architecture.
4. Most of the data is stored in relational as well as non-relational data marts and
data warehousing solutions.
5. As per the business need various data are processed as well converted to proper
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
6. Just like software the hardware is almost the most important part of the Big Data
Infrastructure.
7. In the big data architecture hardware infrastructure is extremely important and
failure over instances as well as redundant physical infrastructure is usually
implemented.
Life cycle of Data

Figure 1.11: Life Cycle of Data

i. The life cycle of the data is as shown in Figure 1.11.
ii. The analysis of data is done from the knowledge experts and the expertise is
applied for the development of an application.
iii. The streaming of data after the analysis and the application, the data log is
created for the acquisition of data.
iv. The data id mapped and clustered together on the data log.
v. The clustered data from the data acquisition is then aggregated by applying
various aggregation algorithms.
vi. The integrated data again goes for an analysis.
vii. The complete steps are repeated till the desired, and expected output is
produced

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
CHAPTER-2
INTRODUCTION TO HADOOP

2.1 What is Hadoop?

1. Hadoop is an open source framework that supports the processing of large data
sets in a distributed computing environment.
2. Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a
number of related projects such as Apache Hive, HBase and Zookeeper.
MapReduce and Hadoop distributed file system (HDFS) are the main component of
Hadoop.
3. Apache Hadoop is an open-source, free and Java based software framework offers a
powerful distributed platform to store and manage Big Data.
4. It is licensed under an Apache V2 license.
5. It runs applications on large clusters of commodity hardware and it processes
thousands of terabytes of data on thousands of the nodes. Hadoop is inspired
from Google’s MapReduce and Google File System (GFS) papers.
6. The major advantage of Hadoop framework is that it provides reliability and
high availability.

2.2 Why Use Hadoop?

There are many advantages of using Hadoop:
1. Robust and Scalable – We can add new nodes as needed as well modify them.
2. Affordable and Cost Effective – We do not need any special hardware for
running Hadoop. We can just use commodity server.
3. Adaptive and Flexible – Hadoop is built keeping in mind that it will
handle structured and unstructured data.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
4. Highly Available and Fault Tolerant – When a node fails, the Hadoop
framework automatically fails over to another node.

2.3 Core Hadoop Components

There are two major components of the Hadoop framework and both of them
does two of the important task for it.
1. Hadoop MapReduce is the method to split a larger data problem into
smaller chunk and distribute it to many different commodity servers. Each
server have their own set of resources and they have processed them locally.
Once the commodity server has processed the data they send it back collectively
to main server. This is effectively a process where we process large data
effectively and efficiently
2. Hadoop Distributed File System (HDFS) is a virtual file system. There is a
big difference between any other file system and Hadoop. When we move a
file on HDFS, it is automatically split into many small pieces. These small
chunks of the file are replicated and stored on other servers (usually 3) for the
fault tolerance or high availability.
3. Namenode: Namenode is the heart of the Hadoop system. The NameNode
manages the file system namespace. It stores the metadata information of the
data blocks. This metadata is stored permanently on to local disk in the form of
namespace image and edit log file. The NameNode also knows the location of the
data blocks on the data node. However the NameNode does not store this
information persistently. The NameNode creates the block to DataNode mapping
when it is restarted. If the NameNode crashes, then the entire Hadoop system
goes down. Read more about Namenode
4. Secondary Namenode: The responsibility of secondary name node is to

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
periodically copy and merge the namespace image and edit log. In case if the name
node crashes, then the namespace image stored in secondary NameNode can be used
to restart the NameNode.
5. DataNode: It stores the blocks of data and retrieves them. The DataNodes also
reports the blocks information to the NameNode periodically.

6. Job Tracker: Job Tracker responsibility is to schedule the client’s jobs. Job tracker
creates map and reduce tasks and schedules them to run on the DataNodes (task
trackers). Job Tracker also checks for any failed tasks and reschedules the failed
tasks on another DataNode. Job tracker can be run on the NameNode or a
separate node.
7. Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility
is to run the map or reduce tasks assigned by the NameNode and to report the status
of the tasks to the NameNode.
Besides above two core components Hadoop project also contains following
modules as well.
1. Hadoop Common: Common utilities for the other Hadoop modules
2. Hadoop Yarn: A framework for job scheduling and cluster resource management

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
 Traditionally, Lustre is configured to manage remote data
storage disk devices within a Storage Area Network
(SAN), which is two or more remotely attached disk
devices communicating via a Small Computer System
Interface (SCSI) protocol.
 This includes Fibre Channel, Fibre Channel over Ethernet
(FCoE), Serial Attached SCSI (SAS) and even iSCSI.
 With Hadoop HDFS the software needs a dedicated
cluster of computers on which to run.
 But folks who run high performance computing
clusters for other purposes often don't run HDFS,
which leaves them with a bunch of computing power,
tasks that could almost certainly benefit from a bit
of map reduce and no way to put that power to work
running Hadoop.

Tachyon  Tachyon is an open source memory-centric distributed 1. Tachyon site

storage system enabling reliable data sharing at memory-
speed across cluster jobs, possibly written in different
computation frameworks, such as Apache Spark and

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Apache MapReduce.
 In the big data ecosystem, Tachyon lies between
computation frameworks or jobs, such as Apache Spark,
Apache MapReduce, or Apache Flink, and various kinds of
storage systems, such as Amazon S3, OpenStack Swift,
GlusterFS, HDFS, or Ceph.
 Tachyon brings significant performance improvement to
the stack; for example, Baidu uses Tachyon to improve
their data analytics performance by 30 times.
 Beyond performance, Tachyon bridges new workloads with
data stored in traditional storage systems. Users can run
Tachyon using its standalone cluster mode, for
example on Amazon EC2, or launch Tachyon with
Apache Mesos or Apache Yarn.
 Tachyon is Hadoop compatible. This means that
existing Spark and MapReduce programs can run on top
of Tachyon without any code changes.
 The project is open source (Apache License 2.0) and is
deployed at
multiple companies. It is one of the fastest growing
open source projects.

GridGain  GridGain is open source project licensed under Apache 2.0. 1. GridGain site
One of the
main pieces of this platform is the In-Memory
Apache Hadoop

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Accelerator which aims to accelerate HDFS and
Map/Reduce by bringing both, data and computations
into memory.
 This work is done with the GGFS - Hadoop compliant
in-memory file system. For I/O intensive jobs GridGain
GGFS offers performance close to 100x faster than
standard HDFS.
 Paraphrasing Dmitriy Setrakyan from GridGain Systems
talking about GGFS regarding Tachyon
 GGFS allows read-through and write-through to/from
underlying HDFS or any other Hadoop compliant file
system with zero code change. Essentially GGFS
entirely removes ETL step from integration.
 GGFS has ability to pick and choose what folders stay in
memory, what folders stay on disc, and what folders get
synchronized with underlying (HD) FS either
synchronously or asynchronously.
 GridGain is working on adding native MapReduce
component which will provide native complete Hadoop
integration without changes in API, like Spark
currently forces you to do. Essentially GridGain
MR+GGFS will allow to bring Hadoop completely or
partially in-memory
in Plug-n-Play fashion without any API changes.
XtreemFS  XtreemFS is a general purpose storage system and covers 1. XtreemFS site
most storage needs in a single deployment. 2. Flink on

 It is open-source, requires no special hardware or kernel XtreemFS.

modules, and can be mounted on Linux, Windows and OS Spark
X. XtreemFS

 XtreemFS runs distributed and offers resilience through

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
replication. XtreemFS Volumes can be accessed through a
FUSE component that offers normal file interaction with
POSIX like semantics.
 An implementation of Hadoop File System interface is
included which makes XtreemFS available for use with
Hadoop, Flink and Spark out of the box. XtreemFS is
licensed under the New BSD license.
 The XtreemFS project is developed by Zuse Institute Berlin.
Distributed Programming
Apache  MapReduce is a programming model for processing large 1. Apache
MapReduce MapReduce
data sets with a parallel, distributed algorithm on a
2. Google
cluster.
MapReduce
 Apache MapReduce was derived from Google MapReduce:
Simplified paper
Data Processing on Large Clusters paper. The current 3. Writing
Apache
YARN
applications

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
MapReduce version is built over Apache YARN
Framework. YARN
stands for “Yet-Another-Resource-Negotiator”.
 It is a new framework that facilitates writing arbitrary
distributed processing frameworks and applications.
YARN’s execution model is more generic than the
earlier MapReduce implementation.
 YARN can run applications that do not follow the
MapReduce model, unlike the original Apache Hadoop
MapReduce (also called MR1). Hadoop YARN is an
attempt to take Apache Hadoop beyond
MapReduce for data-processing.
Apache Pig  Pig provides an engine for executing data flows in parallel 1. pig.apache.or
on Hadoop. g/
 It includes a language, Pig Latin, for expressing these data 2. Pig
flows.
examples
 Pig Latin includes operators for many of the traditional
by Alan
data operations (join, sort, filter, etc.), as well as the
Gates
ability for users to develop their own functions for
reading, processing, and writing data.
 Pig runs on Hadoop. It makes use of both the Hadoop
Distributed File
System, HDFS, and Hadoop’s processing system, MapReduce.
 Pig uses MapReduce to execute all of its data processing.
 It compiles the Pig Latin scripts that users write into a
series of one or more MapReduce jobs that it then
executes.
 Pig Latin looks different from many of the programming
languages you have seen.
 There are no if statements or for loops in Pig Latin.
 This is because traditional procedural and object-
oriented
programming languages describe control flow, and data
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
flow is a side effect of the program. Pig Latin instead
focuses on data flow.
JAQL  JAQL is a functional, declarative programming 1. JAQL in
language designed especially for working with large Google
volumes of structured, semi- structured and Code
unstructured data. 2. What
is Jaql?
by IBM
 As its name implies, a primary use of JAQL is to handle
data stored as JSON documents, but JAQL can work on
various types of data.
 For example, it can support XML, comma-separated values
(CSV) data and flat files.
 A "SQL within JAQL" capability lets programmers work
with structured
SQL data while employing a JSON data model that's less
restrictive than its Structured Query Language
counterparts.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
 The Apache Apex platform is supplemented by Apache Apex-
Malhar, which is a library of operators that implement
common business logic functions needed by customers who
want to quickly develop applications.
 These operators provide access to HDFS, S3, NFS, FTP, and
other file systems; Kafka, ActiveMQ, RabbitMQ, JMS, and
other message systems; MySql, Cassandra, MongoDB, Redis,
HBase, CouchDB and other databases along with JDBC
connectors.
 The library also includes a host of other common business logic
patterns that help users to significantly reduce the time it
takes to go into production. Ease of integration with all other
big data technologies is
one of the primary missions of Apache Apex-Malhar.
Netflix PigPen  PigPen is map-reduce for Clojure which compiles to Apache Pig. 1. PigPen on
GitHub
Clojure is dialect of the Lisp programming language created by
Rich Hickey, so is a functional general-purpose language, and
runs on the Java Virtual Machine, Common Language Runtime,
and JavaScript engines.
 In PigPen there are no special user defined functions (UDFs).
Define Clojure functions, anonymously or named, and use them
like you would in any Clojure program.
 This tool is open sourced by Netflix, Inc. the American
provider of on-
demand Internet streaming media.
AMPLab SIMR  Apache Spark was developed thinking in Apache YARN. 1. SIMR on
GitHub
 It has been relatively hard to run Apache Spark on Hadoop
MapReduce v1 clusters, i.e. clusters that do not have YARN
installed.
 Typically, users would have to get permission to install
Spark/Scala on some subset of the machines, a process that could

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
be time consuming. SIMR allows anyone with access to a Hadoop
MapReduce v1 cluster to run Spark out of the box.
 A user can run Spark directly on top of Hadoop MapReduce v1
without
any administrative rights, and without having Spark or Scala
installed on any of the nodes.
Facebook Corona  “The next version of Map-Reduce" from Facebook, based in 1. Corona on
Github
own fork of Hadoop. The current Hadoop implementation of
the MapReduce technique uses a single job tracker, which
causes scaling issues for very
large data sets.
23

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
 The Apache Hadoop developers have been creating their own
next- generation MapReduce, called YARN, which Facebook
engineers looked at but discounted because of the highly-
customised nature of the company's deployment of Hadoop
and HDFS. Corona, like YARN,
spawns multiple job trackers (one for each job, in Corona's case).
Apache Twill  Twill is an abstraction over Apache Hadoop® YARN that 1. Apache
reduces the complexity of developing distributed applications, Twill
allowing developers to focus more on their business logic. Incubator
 Twill uses a simple thread-based model that Java programmers
will find familiar.
 YARN can be viewed as a compute fabric of a cluster,
which means YARN applications like Twill will run on any
Hadoop 2 cluster.YARN is an open source application that
allows the Hadoop cluster to turn into a collection of virtual
machines.
 Weave, developed by Continuuity and initially housed on
Github, is a complementary open source application that
uses a programming model similar to Java threads, making
it easy to write distributed applications. In order to remove
a conflict with a similarly named project on Apache, called
"Weaver," Weave's name changed to Twill when it moved
to Apache incubation.
 Twill functions as a scaled-out proxy. Twill is a
middleware layer in between YARN and any application on
YARN. When you develop a Twill app, Twill handles APIs in
YARN that resemble a multi-threaded application familiar
to Java. It is very easy to build multi-processed
distributed applications in Twill.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Damballa Parkour  Library for develop MapReduce programs using the LISP like 1. Parkour
language Clojure. Parkour aims to provide deep Clojure GitHub
integration for Hadoop. Project
 Programs using Parkour are normal Clojure programs, using
standard Clojure functions instead of new framework
abstractions.
 Programs using Parkour are also full Hadoop programs, with
complete access to absolutely everything possible in raw
Java Hadoop
MapReduce.
Apache Hama  Apache Top-Level open source project, allowing you to do 1. Hama site
advanced
analytics beyond MapReduce.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Haeinsa  Haeinsa is linearly scalable multi-row, multi-table
transaction library for
HBase. Use Haeinsa if you need strong ACID semantics on
your HBase cluster. Is based on Google Perlocator
concept.
SenseiDB  Open-source, distributed, realtime, semi-structured 1. SenseiDB site

database. Some Features: Full-text search, Fast

realtime updates, Structured and faceted search, BQL:
SQL-like query language, Fast key-value lookup, High
performance under concurrent heavy update and query
volumes,
Hadoop integration
Sky  Sky is an open source database used for flexible, high 1. SkyDB site

performance analysis of behavioral data. For certain kinds

of data such as clickstream data and log data, it can be
several orders of magnitude faster than
traditional approaches such as SQL databases or Hadoop.
BayesDB  BayesDB, a Bayesian database table, lets users query 1. BayesDB site
the probable implications of their tabular data as easily as
an SQL database lets them query the data itself. Using the
built-in Bayesian Query Language (BQL), users with no
statistics training can solve basic data science problems,
such as detecting predictive relationships between
variables, inferring missing values, simulating
probable observations, and identifying
statistically similar database entries.
InfluxDB  InfluxDB is an open source distributed time series 1. InfluxDB site

database with no external dependencies. It's useful for

recording metrics, events, and performing analytics. It
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
has a built-in HTTP API so you don't have to write
any server side code to get up and running. InfluxDB is
designed to be scalable, simple to install and manage, and
fast to get data in and out. It aims to answer queries in real-
time. That means every data point is indexed as it comes
in and is immediately available in queries that
should return under 100ms.
SQL-On-Hadoop
Apache Hive  Data Warehouse infrastructure developed by 1. Apache
Facebook. Data HIVE site
summarization, query, and analysis. It’s provides SQL-like 2. Apache

language (not SQL92 compliant): HiveQL. HIVE GitHub

Project
Apache HCatalog  HCatalog’s table abstraction presents users with a
relational view of
data in the Hadoop Distributed File System (HDFS) and
ensures that users need not worry about where or in
what format their data is

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
stored. Right now HCatalog is part of Hive. Only old
versions are
separated for download.

Trafodion:  Trafodion is an open source project sponsored by HP, 1. Trafodion wiki

Transactional incubated at HP Labs and HP-IT, to develop an enterprise-class
SQL-on- SQL-on-HBase solution
HBase targeting big data transactional or operational workloads.

Apache HAWQ  Apache HAWQ is a Hadoop native SQL query engine that 1. Apache
HAWQ site
combines key technological advantages of MPP database
2. HAWQ
evolved from Greenplum
GitHub
Database, with the scalability and convenience of Hadoop.
Project

Apache Drill  Drill is the open source version of Google's Dremel 1. Apache
system which is available as an infrastructure service called Incubator Drill
Google BigQuery. In recent years open source systems
have emerged to address the need for scalable batch
processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired
by Google's internal MapReduce system, is used by
thousands of organizations processing large-scale datasets.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Apache Hadoop is designed to achieve very high throughput,
but is not designed to achieve the sub-second latency
needed for interactive data analysis and exploration.
Drill, inspired by Google's internal Dremel system, is intended
to address this
need

Cloudera Impala  The Apache-licensed Impala project brings scalable parallel 1. Cloudera
database technology to Hadoop, enabling users to issue low- Impala site
latency SQL queries to data stored in HDFS and Apache 2. Impala
HBase without requiring data movement or GitHub
transformation. It's a Google Dremel clone (Big Query Project
google).

Data Ingestion

Apache Flume  Flume is a distributed, reliable, and available service for 1. Apache
efficiently collecting, aggregating, and moving large amounts Flume
of log data. It has a simple and flexible architecture based project site
on streaming data flows. It is robust and fault tolerant with
tunable reliability mechanisms and many failover and
recovery mechanisms. It uses a simple extensible data
model that allows for online analytic application.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Apache Sqoop  System for bulk data transfer between HDFS and structured 1. Apache
datastores
Sqoop
as RDBMS. Like Flume but from HDFS to RDBMS.
project site

Facebook Scribe  Log agregator in real-time. It’s a Apache Thrift Service. TODO

Apache Chukwa  Large scale log aggregator, and analytics. TODO

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Apache Kafka  Distributed publish-subscribe system for processing large 1. Apache
amounts of streaming data. Kafka is a Message Queue developed Kafka
by LinkedIn that persists messages to disk in a very 2. GitHub source
performant manner. Because messages are persisted, it has code

the interesting ability for clients to rewind a stream and

consume the messages again. Another upside of the disk
persistence is that bulk importing the data into HDFS for offline
analysis can be done very quickly and efficiently. Storm,
developed by BackType (which was acquired by Twitter a
year ago), is more about
transforming a stream of messages into new streams.
Netflix Suro  Suro has its roots in Apache Chukwa, which was initially TODO
adopted by
Netflix. Is a log agregattor like Storm, Samza.
Apache Samza  Apache Samza is a distributed stream processing framework. TODO

It uses Apache Kafka for messaging, and Apache Hadoop YARN

to provide fault tolerance, processor isolation, security, and
resource management.
Developed by http://www.linkedin.com/in/jaykreps Linkedin.
Cloudera  Cloudera Morphlines is a new open source framework that TODO
Morphline
reduces the time and skills necessary to integrate, build, and
change Hadoop processing applications that extract,
transform, and load data into Apache Solr, Apache HBase,
HDFS, enterprise data warehouses, or
analytic online dashboards.
HIHO  This project is a framework for connecting disparate data TODO

sources with the Apache Hadoop system, making them

interoperable. HIHO connects Hadoop with multiple
RDBMS and file systems, so that data
can be loaded to Hadoop and unloaded from Hadoop
Apache NiFi  Apache NiFi is a dataflow system that is currently under 1. Apache NiFi

incubation at the Apache Software Foundation. NiFi is based on

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
the concepts of flow- based programming and is highly
configurable. NiFi uses a component based extension model to
rapidly add capabilities to complex dataflows. Out of the
box NiFi has several extensions for dealing with file-based
dataflows such as FTP, SFTP, and HTTP integration as well as
integration with HDFS. One of NiFi’s unique features is a
rich, web-
based interface for designing, controlling, and monitoring a
dataflow.
Apache  Apache ManifoldCF provides a framework for connecting 1. Apache
ManifoldCF source ManifoldCF
content repositories like file systems, DB, CMIS, SharePoint,
FileNet ... to target repositories or indexes, such as Apache Solr
or ElasticSearch.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
It's a kind of crawler for multi-content repositories, supporting
a lot of
sources and multi-format conversion for indexing by means of
Apache Tika Content Extractor transformation filter.
Service Programming
Apache Thrift  A cross-language RPC framework for service creations. It’s the 1. Apache Thrift
service base for Facebook technologies (the original Thrift
contributor). Thrift provides a framework for developing and
accessing remote services.
 It allows developers to create services that can be consumed
by any application that is written in a language that there are
Thrift bindings for.
 Thrift manages serialization of data to and from a service, as well
as the protocol that describes a method invocation, response,
etc. Instead of writing all the RPC code -- you can just get
straight to your service logic.
Thrift uses TCP and so a given service is bound to a particular
port.
Apache Zookeeper  It’s a coordination service that gives you the tools you need 1. Apache
Zookeeper
to write correct distributed applications.
2. Google
 ZooKeeper was developed at Yahoo! Research. Several
Chubby
Hadoop projects are already using ZooKeeper to coordinate
paper
the cluster and provide highly-available distributed services.
 Perhaps most famous of those are Apache HBase, Storm,
Kafka. ZooKeeper is an application library with two principal
implementations of the APIs—Java and C—and a service
component implemented in Java that runs on an ensemble of
dedicated servers.
 Zookeeper is for building distributed systems, simplifies
the development process, making it more agile and enabling
more robust implementations. Back in 2006, Google published a
paper on "Chubby", a distributed lock service which gained

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
wide adoption within their data centers.
 Zookeeper, not surprisingly, is a close clone of Chubby designed to
fulfill
many of the same roles for HDFS and other Hadoop infrastructure.
Apache Avro  Apache Avro is a framework for modeling, serializing and 1. Apache Avro
making Remote Procedure Calls (RPC).
 Avro data is described by a schema, and one interesting feature
is that the schema is stored in the same file as the data it
describes, so files are
self-describing.

 Avro does not require code generation. This framework can

compete with other similar tools like:
 Apache Thrift, Google Protocol Buffers, ZeroC ICE, and so on.
Apache Curator  Curator is a set of Java libraries that make using Apache TODO
ZooKeeper
much easier.
Apache karaf  Apache Karaf is an OSGi runtime that runs on top of any TODO
OSGi framework and provides you a set of services, a
powerful provisioning
concept, an extensible shell and more.
Twitter Elephant  Elephant Bird is a project that provides utilities (libraries) 1. Elephant
Bird
for working with LZOP-compressed data. It also provides a Bird
container format that supports working with Protocol GitHub
Buffers, Thrift in MapReduce, Writables, Pig LoadFuncs,
Hive SerDe, HBase miscellanea. This open
source library is massively used in Twitter.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Linkedin Norbert  Norbert is a library that provides easy cluster management 1. Linedin
and workload distribution. Project
 With Norbert, you can quickly distribute a simple 2. GitHub source
client/server architecture to create a highly scalable architecture code

capable of handling heavy traffic.

 Implemented in Scala, Norbert wraps ZooKeeper, Netty
and uses Protocol Buffers for transport to make it easy to
build a cluster aware application.
 A Java API is provided and pluggable load balancing
strategies are
supported with round robin and consistent hash strategies
provided out of the box.
Scheduling
Apache Oozie  Workflow scheduler system for MR jobs using DAGs (Direct 1. Apache
Acyclical Graphs). Oozie Coordinator can trigger jobs by Oozie
time (frequency) and 2. GitHub source
data availability code
Linkedin Azkaban  Hadoop workflow management. A batch job scheduler can be TODO
seen as a combination of the cron and make Unix utilities
combined with a
friendly UI.
Apache Falcon  Apache Falcon is a data management framework for TODO
simplifying data lifecycle management and processing
pipelines on Apache Hadoop.
 It enables users to configure, manage and orchestrate data
motion,
pipeline processing, disaster recovery, and data retention
workflows.

Instead of hard-coding complex data lifecycle capabilities,

Hadoop applications can now rely on the well-tested Apache
Falcon framework for these functions.
 Falcon’s simplification of data management is quite useful to
anyone building apps on Hadoop. Data Management on Hadoop

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
encompasses data motion, process orchestration, lifecycle
management, data discovery, etc. among other concerns that
are beyond ETL.
 Falcon is a new data processing and management platform for
Hadoop that solves this problem and creates additional
opportunities by building on existing components within the
Hadoop ecosystem (ex. Apache Oozie, Apache Hadoop
DistCp etc.) without reinventing the
wheel.
Schedoscope  Schedoscope is a new open-source project providing a GitHub source
code
scheduling framework for painfree agile development,
testing, (re)loading, and monitoring of your datahub, lake, or
whatever you choose to call your Hadoop data warehouse these
days. Datasets (including dependencies) are defined using a scala
DSL, which can embed MapReduce jobs, Pig scripts, Hive
queries or Oozie workflows to build the dataset. The tool
includes a test framework to verify logic and a command line
utility to
load and reload data.
Machine Learning
Apache Mahout  Machine learning library and math library, on top of TODO
MapReduce.
WEKA  Weka (Waikato Environment for Knowledge Analysis) is a TODO

popular suite of machine learning software written in Java,

developed at the University of Waikato, New Zealand.
Weka is free software available
under the GNU General Public License.
Cloudera Oryx  The Oryx open source project provides simple, real-time 1. Oryx at
GitHub
large-scale machine learning / predictive analytics
2. Cloudera
infrastructure.
forum for
 It implements a few classes of algorithm commonly used in
business Machine
applications: collaborative filtering / recommendation, Learning

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
classification / regression, and clustering.
MADlib  The MADlib project leverages the data-processing 1. MADlib
Community
capabilities of an RDBMS to analyze data. The aim of this
project is the integration of
statistical data analysis into databases.

Hadoop and nearly 1 petabyte of user data

Figure 2.4: Yahoo Hadoop Cluster

i. A small Hadoop cluster includes a single master node and multiple worker or slave
node. As discussed earlier, the entire cluster contains two layers.
ii. One of the layer of MapReduce Layer and another is of HDFS Layer.
iii. Each of these layer have its own relevant component.

The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.

iv. A slave or worker node consists of a DataNode and TaskTracker.
v. It is also possible that slave node or worker node is only data or compute node.
The matter of the fact that is the key feature of the Hadoop.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Figure 2.4: NameNode Cluster

B. Hadoop Cluster Architecture:

Figure 2.5: Hadoop Cluster Architecture

Hadoop Cluster would consists of

 110 different racks
 Each rack would have around 40 slave machine
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
 At the top of each rack there is a rack switch
 Each slave machine(rack server in a rack) has cables coming out it from both
the ends
 Cables are connected to rack switch at the top which means that top rack
switch will have around 80 ports
 There are global 8 core switches
 The rack switch has uplinks connected to core switches and hence connecting all
other racks with uniform bandwidth, forming the Cluster
 In the cluster, you have few machines to act as Name node and as JobTracker.
They are referred as Masters. These masters have different configuration
favoring more DRAM and CPU and less local storage.
Hadoop cluster has 3 components:
1. Client
2. Master
3. Slave

Figure 2.6: Hadoop Core Component

1. Client:

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
i. It is neither master nor slave, rather play a role of loading the data into cluster,
submit MapReduce jobs describing how the data should be processed and then
retrieve the data to see the response after job completion.

Figure 2.6: Hadoop Client

2. Masters:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.

Figure 2.7: MapReduce - HDFS

i. NameNode:
 NameNode does NOT store the files but only the file's metadata. In later section we
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
will see it is actually the DataNode which stores the files.

Figure 2.8: NameNode

 NameNode oversees the health of DataNode and coordinates access to the data
stored in DataNode.
 Name node keeps track of all the file system related information such as to
 Which section of file is saved in which part of the cluster
 Last access time for the files
 User permissions like which user have access to the file

ii. JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.
To know more about JobTracker, please read the article All You Want to Know about
MapReduce (The Heart of Hadoop)

iii. Secondary Name Node:

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Figure 2.9: Secondary NameNode
 The job of Secondary Node is to contact NameNode in a periodic manner after
certain time interval (by default 1 hour).
 NameNode which keeps all filesystem metadata in RAM has no capability to
process that metadata on to disk.
 If NameNode crashes, you lose everything in RAM itself and you don't have any
backup of filesystem.
 What secondary node does is it contacts NameNode in an hour and pulls copy of
metadata information out of NameNode.
 It shuffle and merge this information into clean file folder and sent to back again
to NameNode, while keeping a copy for itself.
 Hence Secondary Node is not the backup rather it does job of housekeeping.
 In case of NameNode failure, saved metadata can rebuild it easily.

3. Slaves:
i. Slave nodes are the majority of machines in Hadoop Cluster and are responsible
to
 Store the data
 Process the computation
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Figure 2.10: Slaves
ii. Each slave runs both a DataNode and Task Tracker daemon which communicates to
their masters.
iii. The Task Tracker daemon is a slave to the Job Tracker and the DataNode daemon a
slave to the NameNode

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
CHAPTER-3
NO-SQL
3.1 What is NoSQL? NoSQL business drivers; NoSQL case studies
1. NoSQL is a whole new way of thinking about a database.
2. Though NoSQL is not a relational database, the reality is that a relational database
model may not be the best solution for all situations.
3. Example: Imagine that you have coupons that you wanted to push to mobile
customers that purchase a specific item. This is a customer facing system of
engagement requires location data, purchase data, wallet data, and so on. You want to
engage the mobile customer in real-time.
4. NoSQL is a whole new way of thinking about a database. NoSQL is not a relational
database.
5. The easiest way to think of NoSQL, is that of a database which does not adhering to
the traditional relational database management system (RDMS) structure. Sometimes you
will also see it revered to as 'not only SQL'.
6. It is not built on tables and does not employ SQL to manipulate data. It also may not
provide full ACID (atomicity, consistency, isolation, durability) guarantees, but still has a
distributed and fault tolerant architecture.
7. The NoSQL taxonomy supports key-value stores, document store, Big Table, and graph
databases.
8. MongoDB, for example, uses a document model, which can be thought of as a row in
a RDBMS. Documents, a set of fields (key-value pairs) map nicely to programming
language data types.
9. A MongoDB database holds a collection which is a set of documents. Embedded
documents and arrays reduce need for joins, which is key for high performance and
speed.

I. Why NoSQL?
1. It's high performance with high availability, and offers rich query language and easy
scalability.
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
2. NoSQL is gaining momentum, and is supported by Hadoop, MongoDB and others.

3. The NoSQL Database site is a good reference for someone looking for more
information.

NoSQ
Figure 3.1: Web application Data Growth
L is:

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
 More than rows in tables—NoSQL systems store and retrieve data from many
formats; key-value stores, graph databases, column-family (Bigtable) stores,
document stores and even rows in tables.
 Free of joins—NoSQL systems allow you to extract your data using simple
interfaces without joins.
 Schema free—NoSQL systems allow you to drag-and-drop your data into a folder
and then query it without creating an entity-relational model.
 Compatible with many processors—NoSQL systems allow you to store your
database on multiple processors and maintain high-speed performance.
 Usable on shared-nothing commodity computers—Most (but not all) NoSQL
systems leverage low cost commodity processors that have separate RAM and
disk.
 Supportive of linear scalability—NoSQL supports linear scalability; when you add
more processors you get a consistent increase in performance.
 Innovative—NoSQL offers options to a single way of storing, retrieving and
manipulating data. NoSQL supporters (also known as NoSQLers) have an inclusive

attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL
community, NoSQL means not only SQL.

3.2 NoSQL is not:

 About the SQL language—the definition of NoSQL is not an application that uses a
language other than SQL. SQL as well as other query languages are used with
NoSQL databases.
 Not only open source—although many NoSQL systems have an open source model,
commercial products use NOSQL concepts as well as open source initiatives. You
can still have an innovative approach to problem solving with a commercial
product.
 Not only Big Data—many, but not all NoSQL applications, are driven by the inability
of a current application to efficiently scale when Big Data is an issue. While volume

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
and velocity are important, NoSQL also focuses on variability and agility.
 About cloud computing—Many NoSQL systems reside in the cloud to take
advantage of its ability to rapidly scale when the situations dictates. NoSQL
systems can run in the cloud as well as in your corporate data center.
 About a clever use of RAM and SSD—Many NoSQL systems focus on the efficient
use of RAM or solid-state disks to increase performance. While important, NoSQL
systems can run on standard hardware.
 An elite group of products—NoSQL is not an exclusive club with a few
products. There are no membership dues or tests required to join.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Table 3.1: Types of NoSQL data stores

Type Typical usage Examples

 Image stores  Memcache
Key-value stor—A simple data  Key-based file systems  Redis
storage system that uses a key to  Object cache  Riak
access a value  Systems designed to scale  DynamoDB

 Web crawler results  HBase

Column family store—A sparse
 Big Data problems that  Cassandra
matrix system that uses a row and
can relax consistency rules  Hypertable
column as keys

 Neo4J
 AllegroGraph
 Social networks
 Big data (RDF
Graph store—For relationship  Fraud detection
Store)
intensive problems  Relationship heavy data
 InfiniteGraph
(Objectivity)

 High variability data  MongoDB

 Document search (10Gen)
Document store—Storing  Integration hubs  CouchDB
hierarchical data structures directly in  Web content  Couchbase
the database management  MarkLogic
 Publishing  eXist-db

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
CHAPTER-4

CONCLUSION

Big Data Analytics is a transformative field addressing the exponential growth and
complexity of data. Its core lies in the three Vs: Volume (vast data storage in formats
like text, video, and images), Velocity (real-time data processing), and Variety (diverse
data forms, from structured to unstructured). Traditional data approaches are
inadequate, leading to frameworks like Hadoop, which offers scalable, fault-tolerant
distributed systems through components like HDFS and MapReduce. NoSQL
databases complement this ecosystem by providing schema-less, high-performance
solutions for varied data. Technologies like Apache Pig, Hive, and Spark simplify data
querying and analysis, while tools like Kafka and Flume handle real-time data
ingestion. Advanced solutions such as Tachyon and GridGain enhance memory-centric
performance, and innovative NoSQL models (e.g., MongoDB, Cassandra) tackle
specific data needs. Overall, Big Data Analytics integrates storage, computation, and
machine learning to uncover actionable insights, drive efficiency, and support
decision-making in dynamic environments.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
CHAPTER-5
REFERENCES

1. Source: IEEE Xplore

2. Summary: This publication introduces a comprehensive reference model for Big Data
Analytics, encompassing four sub-models: Business Question Extraction Model, Big
Data Analytics Evolution Model, Analytics Algorithm Reference Model, and Goal-
Oriented Analytics Process Model.

3. Summary: This work presents the Big Data Value (BDV) Reference Model, serving as a
common framework to position big data technologies within the overall IT
infrastructure.

4. **"Big Data Analytics in E-commerce: A Systematic Review and Agenda for Future
Research"**
5. Authors: Shahriar Akter, Samuel Fosso Wamba
6. Source: Electronic Markets
7. Summary: This systematic review explores the role of big data analytics in e-commerce,
highlighting current applications and proposing directions for future research.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3

Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Tyit Gis Mcqs
50% (2)
Tyit Gis Mcqs
26 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
A STUDY ON BIG DATA HADOOP Nandha Kumar
No ratings yet
A STUDY ON BIG DATA HADOOP Nandha Kumar
7 pages
(IJCST-V5I4P10) :M Dhavapriya
No ratings yet
(IJCST-V5I4P10) :M Dhavapriya
5 pages
BDA-UNIT-I-LM
No ratings yet
BDA-UNIT-I-LM
14 pages
(Ca) Bda Unit-I
No ratings yet
(Ca) Bda Unit-I
10 pages
Big Data in Business
No ratings yet
Big Data in Business
11 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Unit 5
No ratings yet
Unit 5
63 pages
BD unit 1
No ratings yet
BD unit 1
5 pages
Informatics Engineering, An International Journal (IEIJ)
No ratings yet
Informatics Engineering, An International Journal (IEIJ)
20 pages
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
No ratings yet
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
20 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Unit 1
No ratings yet
Unit 1
18 pages
What Is Data
No ratings yet
What Is Data
20 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
No ratings yet
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
15 pages
BIG data1
No ratings yet
BIG data1
49 pages
BDA Unit 1
No ratings yet
BDA Unit 1
60 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
A Survey - Data Security and Privacy Big Data
No ratings yet
A Survey - Data Security and Privacy Big Data
6 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Unit- 1 (Big data)
No ratings yet
Unit- 1 (Big data)
15 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Big Data Analysis by deshbandhu
No ratings yet
Big Data Analysis by deshbandhu
368 pages
Unit III - Big Data
No ratings yet
Unit III - Big Data
22 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
BIG DATA Research PDF
No ratings yet
BIG DATA Research PDF
9 pages
A_Review_of_Machine_Learning_Techniques
No ratings yet
A_Review_of_Machine_Learning_Techniques
6 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
01 Unit-BDA- Intro BDA
No ratings yet
01 Unit-BDA- Intro BDA
37 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data: Insight: Mrs. S.V. Balshetwar, Dr. R.M.Tugnayat
No ratings yet
Big Data: Insight: Mrs. S.V. Balshetwar, Dr. R.M.Tugnayat
3 pages
Big Data: Concepts, Techniques, Storage and Challenges
No ratings yet
Big Data: Concepts, Techniques, Storage and Challenges
9 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Big Data Analytics and Its Applications
No ratings yet
Big Data Analytics and Its Applications
4 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data
No ratings yet
Big Data
190 pages
Mtech Scheme
No ratings yet
Mtech Scheme
54 pages
Big Data
No ratings yet
Big Data
7 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
Lec 1 - Introduction to Big Data
No ratings yet
Lec 1 - Introduction to Big Data
37 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
BME3
No ratings yet
BME3
1 page
IPR2
No ratings yet
IPR2
2 pages
JanuaryFebruary-2023
No ratings yet
JanuaryFebruary-2023
2 pages
Pa Digital Notes
No ratings yet
Pa Digital Notes
112 pages
Semantic_Tags
No ratings yet
Semantic_Tags
4 pages
SVG_and_Canvas
0% (1)
SVG_and_Canvas
2 pages
Embedding_Content
No ratings yet
Embedding_Content
4 pages
MCA-SEM-III-Syllabus Mobile Computing
No ratings yet
MCA-SEM-III-Syllabus Mobile Computing
12 pages
Project Report (E-Commerce)
100% (2)
Project Report (E-Commerce)
22 pages
Embedded Databases NexusDB
No ratings yet
Embedded Databases NexusDB
7 pages
Sai Krishna Ravipati-UI
No ratings yet
Sai Krishna Ravipati-UI
5 pages
Final Presentation
No ratings yet
Final Presentation
17 pages
SAP S4 Hana Zero Downtime Approach
No ratings yet
SAP S4 Hana Zero Downtime Approach
7 pages
(Ebook) Spatial Data Quality by Wenzhong Shi, Peter Fisher, Michael F. Goodchild ISBN 9780415258357, 9780203303245, 0415258359, 0203303245 download
100% (2)
(Ebook) Spatial Data Quality by Wenzhong Shi, Peter Fisher, Michael F. Goodchild ISBN 9780415258357, 9780203303245, 0415258359, 0203303245 download
52 pages
Lecture Notes On Electronic Resources (II)
No ratings yet
Lecture Notes On Electronic Resources (II)
95 pages
Srs-Place Value Finding Game
No ratings yet
Srs-Place Value Finding Game
25 pages
Chapter 1: Introduction To Database Management System
No ratings yet
Chapter 1: Introduction To Database Management System
47 pages
Telenor
No ratings yet
Telenor
24 pages
Data Mining:: Knowledge Discovery in Databases
No ratings yet
Data Mining:: Knowledge Discovery in Databases
14 pages
System Design Interview: Amazon/ Flipkart/ Ebay o
No ratings yet
System Design Interview: Amazon/ Flipkart/ Ebay o
19 pages
Concern Core: at The
No ratings yet
Concern Core: at The
20 pages
DataShed Administrator Workbook 2017
100% (1)
DataShed Administrator Workbook 2017
11 pages
Framing The Future of Information Systems in Afghan Dynamics
No ratings yet
Framing The Future of Information Systems in Afghan Dynamics
4 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Erasmus Mundus Joint Master Degree
No ratings yet
Erasmus Mundus Joint Master Degree
31 pages
Auto Sys AE User Guide
No ratings yet
Auto Sys AE User Guide
597 pages
06_Handout_1(80)
No ratings yet
06_Handout_1(80)
5 pages
9.2 Desirable Features of Good Distributed File System
No ratings yet
9.2 Desirable Features of Good Distributed File System
20 pages
Online Telon Overview
No ratings yet
Online Telon Overview
14 pages
Aungkanat,+1-18 Albert+Patrick+J.+David
No ratings yet
Aungkanat,+1-18 Albert+Patrick+J.+David
18 pages
ARM 2021-4 Administrator Guide
No ratings yet
ARM 2021-4 Administrator Guide
1,244 pages
Index Page
No ratings yet
Index Page
11 pages
Krishna Sap
No ratings yet
Krishna Sap
4 pages
BMS& Scada 1
No ratings yet
BMS& Scada 1
38 pages
7050-1712772706277-Unit 04 - Database Design and Development - 2024
No ratings yet
7050-1712772706277-Unit 04 - Database Design and Development - 2024
13 pages

big-data-analytics

Uploaded by

big-data-analytics

Uploaded by

Table of Contents

1. Introduction to Big Data........................................................................................................1

1.4 Traditional Vs Big Data Business Approach

2.3 Core Hadoop Components

Figure 1.1: Big Data Sphere

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

1.2 Big data Characteristics

Figure 1.2: Characteristics of Big Data

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Figure 1.4: Volume, Velocity, Variety

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

2. HDFS(Hadoop distributed file system)

1.4 Traditional Vs Big Data Business Approach

1. Schema less and Column oriented Databases (No Sql)

Figure 1.6: Big Data Architecture

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Figure 1.9: Big Data Architecture

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Figure 1.10: Big Data Infrastructure

Figure 1.11: Life Cycle of Data

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

2.1 What is Hadoop?

2.2 Why Use Hadoop?

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

2.3 Core Hadoop Components

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Tachyon  Tachyon is an open source memory-centric distributed 1. Tachyon site

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

 It is open-source, requires no special hardware or kernel XtreemFS.

 XtreemFS runs distributed and offers resilience through

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

database. Some Features: Full-text search, Fast

performance analysis of behavioral data. For certain kinds

database with no external dependencies. It's useful for

language (not SQL92 compliant): HiveQL. HIVE GitHub

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Trafodion:  Trafodion is an open source project sponsored by HP, 1. Trafodion wiki

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Apache Chukwa  Large scale log aggregator, and analytics. TODO

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

the interesting ability for clients to rewind a stream and

It uses Apache Kafka for messaging, and Apache Hadoop YARN

sources with the Apache Hadoop system, making them

incubation at the Apache Software Foundation. NiFi is based on

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

 Avro does not require code generation. This framework can

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

capable of handling heavy traffic.

Instead of hard-coding complex data lifecycle capabilities,

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

popular suite of machine learning software written in Java,

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Hadoop and nearly 1 petabyte of user data

Figure 2.4: Yahoo Hadoop Cluster

The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

B. Hadoop Cluster Architecture:

Figure 2.5: Hadoop Cluster Architecture

Hadoop Cluster would consists of

Figure 2.6: Hadoop Core Component

AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

Figure 2.6: Hadoop Client

Figure 2.7: MapReduce - HDFS

Figure 2.8: NameNode