big-data-analytics
big-data-analytics
2. Introduction to Hadoop........................................................................................................13
2.1 What is Hadoop?
2.2 Why Use Hadoop?
ii. Velocity
The data growth and social media explosion have changed how we look at
the data.
There was a time when we used to believe that data of yesterday is recent.
The matter of the fact newspapers is still following that logic.
However, news channels and radios have changed how fast we receive the
news.
Today, people reply on social media to update them with the latest
happening. On social media sometimes a few seconds old messages (a
tweet, status updates etc.) is not something interests users.
They often discard old messages and pay attention to recent updates. The
data movement is now almost real time and the update window has
reduced to fractions of the seconds.
3. Hive
Hive was originally developed by Facebook, now it is made open source for some
time.
Hive works something like a bridge in between sql and Hadoop, it is basically used to
make Sql queries on Hadoop clusters.
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Apache Hive is basically a data warehouse that provides ad-hoc queries, data
summarization and analysis of huge data sets stored in Hadoop compatible file
systems
"NoSQL (not only Sql) is a type of databases that does not primarily rely upon schema based structure and
does not use Sql for data processing."
viii. The structured approach designs the database as per the requirements in
tuples and columns.
ix. Working on the live coming data, which can be an input from the ever changing
scenario cannot be dealt in the traditional approach.
x. The Big data approach is iterative.
2. Above image gives good overview of how in Big Data Infrastructure various
components are associated with each other.
3. In Big Data various different data sources are part of the architecture hence
extract, transform and integration are one of the most essential layers of the
architecture.
4. Most of the data is stored in relational as well as non-relational data marts and
data warehousing solutions.
5. As per the business need various data are processed as well converted to proper
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
6. Just like software the hardware is almost the most important part of the Big Data
Infrastructure.
7. In the big data architecture hardware infrastructure is extremely important and
failure over instances as well as redundant physical infrastructure is usually
implemented.
Life cycle of Data
6. Job Tracker: Job Tracker responsibility is to schedule the client’s jobs. Job tracker
creates map and reduce tasks and schedules them to run on the DataNodes (task
trackers). Job Tracker also checks for any failed tasks and reschedules the failed
tasks on another DataNode. Job tracker can be run on the NameNode or a
separate node.
7. Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility
is to run the map or reduce tasks assigned by the NameNode and to report the status
of the tasks to the NameNode.
Besides above two core components Hadoop project also contains following
modules as well.
1. Hadoop Common: Common utilities for the other Hadoop modules
2. Hadoop Yarn: A framework for job scheduling and cluster resource management
GridGain GridGain is open source project licensed under Apache 2.0. 1. GridGain site
One of the
main pieces of this platform is the In-Memory
Apache Hadoop
22
Apache HAWQ Apache HAWQ is a Hadoop native SQL query engine that 1. Apache
HAWQ site
combines key technological advantages of MPP database
2. HAWQ
evolved from Greenplum
GitHub
Database, with the scalability and convenience of Hadoop.
Project
Apache Drill Drill is the open source version of Google's Dremel 1. Apache
system which is available as an infrastructure service called Incubator Drill
Google BigQuery. In recent years open source systems
have emerged to address the need for scalable batch
processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired
by Google's internal MapReduce system, is used by
thousands of organizations processing large-scale datasets.
Cloudera Impala The Apache-licensed Impala project brings scalable parallel 1. Cloudera
database technology to Hadoop, enabling users to issue low- Impala site
latency SQL queries to data stored in HDFS and Apache 2. Impala
HBase without requiring data movement or GitHub
transformation. It's a Google Dremel clone (Big Query Project
google).
Data Ingestion
Apache Flume Flume is a distributed, reliable, and available service for 1. Apache
efficiently collecting, aggregating, and moving large amounts Flume
of log data. It has a simple and flexible architecture based project site
on streaming data flows. It is robust and fault tolerant with
tunable reliability mechanisms and many failover and
recovery mechanisms. It uses a simple extensible data
model that allows for online analytic application.
Facebook Scribe Log agregator in real-time. It’s a Apache Thrift Service. TODO
2. Masters:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.
ii. JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.
To know more about JobTracker, please read the article All You Want to Know about
MapReduce (The Heart of Hadoop)
3. Slaves:
i. Slave nodes are the majority of machines in Hadoop Cluster and are responsible
to
Store the data
Process the computation
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
Figure 2.10: Slaves
ii. Each slave runs both a DataNode and Task Tracker daemon which communicates to
their masters.
iii. The Task Tracker daemon is a slave to the Job Tracker and the DataNode daemon a
slave to the NameNode
I. Why NoSQL?
1. It's high performance with high availability, and offers rich query language and easy
scalability.
AVANTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
Gunthapally(V), Abdullapurment(M), R.R. District-501512. 3
2. NoSQL is gaining momentum, and is supported by Hadoop, MongoDB and others.
3. The NoSQL Database site is a good reference for someone looking for more
information.
NoSQ
Figure 3.1: Web application Data Growth
L is:
attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL
community, NoSQL means not only SQL.
About the SQL language—the definition of NoSQL is not an application that uses a
language other than SQL. SQL as well as other query languages are used with
NoSQL databases.
Not only open source—although many NoSQL systems have an open source model,
commercial products use NOSQL concepts as well as open source initiatives. You
can still have an innovative approach to problem solving with a commercial
product.
Not only Big Data—many, but not all NoSQL applications, are driven by the inability
of a current application to efficiently scale when Big Data is an issue. While volume
Neo4J
AllegroGraph
Social networks
Big data (RDF
Graph store—For relationship Fraud detection
Store)
intensive problems Relationship heavy data
InfiniteGraph
(Objectivity)
CONCLUSION
Big Data Analytics is a transformative field addressing the exponential growth and
complexity of data. Its core lies in the three Vs: Volume (vast data storage in formats
like text, video, and images), Velocity (real-time data processing), and Variety (diverse
data forms, from structured to unstructured). Traditional data approaches are
inadequate, leading to frameworks like Hadoop, which offers scalable, fault-tolerant
distributed systems through components like HDFS and MapReduce. NoSQL
databases complement this ecosystem by providing schema-less, high-performance
solutions for varied data. Technologies like Apache Pig, Hive, and Spark simplify data
querying and analysis, while tools like Kafka and Flume handle real-time data
ingestion. Advanced solutions such as Tachyon and GridGain enhance memory-centric
performance, and innovative NoSQL models (e.g., MongoDB, Cassandra) tackle
specific data needs. Overall, Big Data Analytics integrates storage, computation, and
machine learning to uncover actionable insights, drive efficiency, and support
decision-making in dynamic environments.
3. Summary: This work presents the Big Data Value (BDV) Reference Model, serving as a
common framework to position big data technologies within the overall IT
infrastructure.
4. **"Big Data Analytics in E-commerce: A Systematic Review and Agenda for Future
Research"**
5. Authors: Shahriar Akter, Samuel Fosso Wamba
6. Source: Electronic Markets
7. Summary: This systematic review explores the role of big data analytics in e-commerce,
highlighting current applications and proposing directions for future research.