BDA-U2
BDA-U2
Introducing Technologies for Handling Big Data: Distributed and Parallel Computing for Big
Data, Introducing Hadoop, Cloud Computing and Big Data, In‐Memory Computing
Technology for Big Data, Understanding Hadoop Ecosystem.
HADOOP
Hadoop is a distributed system like distributed database.
Hadoop is a ‘software library’ that allows its users to process large datasets across distributed
clusters of computers, thereby enabling them to gather, store and analyze huge sets of data.
It provides various tools and technologies, collectively termed as the Hadoop Ecosystem.
Hadoop cluster consist of single Master Node and a multiple Worker Nodes.
Cloud computing and big data are two closely related technologies that complement each other to
drive innovation, efficiency, and scalability in data-driven applications. As organizations generate
and process massive volumes of data, cloud computing provides the infrastructure and tools to
handle big data in a scalable, cost-effective, and flexible manner. Here’s a detailed overview of
how cloud computing supports big data:
● Elastic Resources: Cloud platforms provide the ability to scale resources up or down based
on demand. This is crucial for big data workloads, which can be highly variable in size. Whether
it's data storage or processing power, cloud infrastructure can dynamically adjust to handle large
datasets without the need for physical hardware changes.
● Pay-as-You-Go Model: The cloud's pricing model allows companies to pay only for the
resources they use, making it more economical to manage big data projects, which can have
unpredictable workloads and data growth patterns.
● Distributed Storage Systems: Cloud platforms offer distributed storage solutions that are
essential for big data. Systems like Amazon S3, Google Cloud Storage, and Azure Blob Storage
provide scalable storage that can handle petabytes or even exabytes of data.
● Data Lakes: Many organizations build data lakes in the cloud. A data lake is a central
repository that allows organizations to store all structured and unstructured data at scale, making it
easier to perform advanced analytics or machine learning tasks.
● Cloud-Native Big Data Tools: Cloud computing platforms provide a range of big data
processing tools that leverage distributed architectures. For example:
○ Amazon EMR (Elastic MapReduce): A cloud-native implementation of Hadoop and
Spark for processing vast amounts of data.
○ Google BigQuery: A serverless, fully managed data warehouse that allows for fast SQL
queries on large datasets.
○ Azure Synapse Analytics: A cloud analytics service that combines big data and data
warehousing.
● Stream Processing: The cloud supports real-time data processing tools, such as Amazon
Kinesis or Azure Stream Analytics, to handle streaming data from sources like IoT devices, social
media, and transaction logs.
● Advanced Analytics: Cloud platforms provide a wide range of tools for performing
advanced data analytics on big data. Services like AWS Redshift, Google BigQuery, and Azure
Data Lake Analytics allow users to query massive datasets quickly.
● Machine Learning Services: Many cloud providers offer machine learning platforms that
can process big data and build predictive models. For instance, AWS SageMaker, Google AI
Platform, and Azure Machine Learning allow data scientists and developers to train machine
learning models on large datasets without managing the underlying infrastructure.
5. Cost Efficiency
● Security Features: Cloud providers invest heavily in security, offering features like data
encryption (in-transit and at rest), multi-factor authentication, and identity management. These are
crucial for big data applications, especially in industries with strict regulatory requirements.
● Compliance: Many cloud providers offer compliance certifications, such as GDPR,
HIPAA, and SOC 2, making it easier for organizations to ensure that their big data applications
meet legal and regulatory standards.
● Global Accessibility: Cloud computing enables distributed teams to access and work with
big data from anywhere in the world. This global accessibility allows for real-time collaboration on
data projects, improving productivity and innovation.
● Data Sharing: Cloud environments support secure data sharing across different teams,
departments, or even with external partners. This facilitates collaborative analytics and
decision-making.
8. Data Integration
● High Availability: Cloud providers offer robust disaster recovery options, ensuring that big
data is available even in the event of hardware failures or natural disasters. Cloud regions and
availability zones provide redundancy, ensuring that data is backed up and recoverable.
● Data Replication: Big data stored in the cloud can be replicated across multiple locations
to protect against data loss, making cloud-based solutions more resilient than traditional
on-premises systems.
● Amazon Web Services (AWS): Offers services like S3, EMR, Redshift, and SageMaker to
support big data storage, processing, and machine learning.
● Google Cloud Platform (GCP): Includes BigQuery, Dataflow, Pub/Sub, and AI tools
designed for big data analytics and real-time processing.
● Microsoft Azure: Provides services like Azure Data Lake, Synapse Analytics, and Azure
ML to cater to big data and advanced analytics workloads.
In-memory computing (IMC) is a transformative technology that enables faster data processing by
storing data directly in the main memory (RAM) instead of traditional disk-based storage. This
technology is particularly beneficial for big data applications, which require handling vast amounts
of data quickly and efficiently. Below are some key aspects of in-memory computing technology as
it relates to big data:
2. Scalability
3. Data Analytics
● Fast Analytics: In-memory computing is ideal for big data analytics platforms, where
massive datasets are processed and analyzed for insights. Data can be quickly aggregated,
filtered, and queried without the traditional bottlenecks of disk I/O.
● Machine Learning: Many machine learning models require rapid access to data for
training. In-memory computing provides the speed needed for real-time model training and
updating.
4. Data Caching
● In-Memory Data Grids: These systems cache frequently accessed data in memory,
reducing the need to read from disk and speeding up response times. Solutions like Redis or
Memcached are widely used for this purpose.
● Session and Query Caching: For big data environments, in-memory caching can store
session information, query results, or interim computations, improving application
performance.
6. Challenges
● Cost: RAM is more expensive than traditional storage, and the cost of scaling in-memory
systems can be prohibitive for some organizations.
● Data Persistence: RAM is volatile, meaning data is lost when the system is powered down.
To mitigate this, hybrid solutions that combine in-memory processing with disk-based
backups are often used.
● Complexity: Implementing and managing in-memory systems, particularly in distributed
environments, adds complexity to the architecture.
In-memory computing is playing a critical role in enabling organizations to handle the massive
scale and speed requirements of big data in today's information-driven world.
HADOOP ECOSYSTEM
Below are the Hadoop components, that together form a Hadoop ecosystem.
Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
HDFS creates a level of abstraction over the resources, from where we can see the whole
HDFS as a single unit.
It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it
requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires
more storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason, why
Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
YARN
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities
by allocating resources and scheduling tasks.
It has two major components, i.e. Resource Manager and Node Manager.
1. Resource Manager is again a main node in the processing department.
2. It receives the processing requests, and then passes the parts of requests to
corresponding Node Managers accordingly, where the actual processing takes
place.
3. Node Managers are installed on every Data Node. It is responsible for execution
of task on every single Data Node.
∙
1. Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
2. Applications Manager: While Applications Manager accepts the job submission,
negotiates to containers (i.e. the Data node environment where process executes)
for executing the application specific Application Master and monitoring the
progress. ApplicationMasters are the deamons which reside on DataNode and
communicates to containers for execution of tasks on each DataNode.
3. ResourceManager has two components: Schedulers and application manager
MAPREDUCE
We have a sample case of students and their respective departments. We want to calculate the
number of students in each department. Initially, Map program will execute and calculate the
students appearing in each department, producing the key value pair as mentioned above. This
key value pair is the input to the Reduce function. The Reduce function will then aggregate each
department and calculate the total number of students in each department and produce the given
result.
APACHE PIG
PIG has two parts: Pig Latin, the language and the pig runtime, for the execution
environment. You can better understand it as Java and JVM.
It supports pig latin language, which has SQL like command structure.
But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes.
The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that’s an abstraction (which works like black box).
PIG was initially developed by Yahoo.
It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
In PIG, first the load command, loads the data. Then we perform various functions on it like
grouping, filtering, joining, sorting, etc. At last, either you can dump the data on the screen or
you can store the result back in HDFS.
APACHE HIVE
Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them
feel at home while working in a Hadoop Ecosystem.
Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
It supports all primitive data types of SQL.
You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.
APACHE MAHOUT
Now, let us talk about Mahout which is renowned for machine learning. Mahout provides an
environment for creating machine learning applications which are scalable.
Machine learning algorithms allow us to build self-learning machines that evolve by itself
without being explicitly programmed. Based on user behaviour, data patterns and past
experiences it makes important future decisions. You can call it a descendant of Artificial
Intelligence (AI).
It performs collaborative filtering, clustering and classification. Some people also consider
frequent item set missing as Mahout’s function. Let us understand them individually:
1. Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the users. The
typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can contain blogs,
news, research papers etc.
3. Classification: It means classifying and categorizing data into various sub- departments
like articles can be categorized into blogs, news, essay, research papers and other
categories.
4. Frequent item set missing: Here Mahout checks, which objects are likely to be
appearing together and make suggestions, if they are missing. For example, cell phone
and cover are brought together in general. So, if you search for a cell phone, it will also
recommend you the cover and cases.
Mahout provides a command line to invoke various algorithms. It has a predefined set of library
which already contains different inbuilt algorithms for different use cases.
APACHE SPARK
Apache Spark is a framework for real time data analytics in a distributed computing
environment.
The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
It executes in-memory computations to increase speed of data processing over Map-
Reduce.
It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than
Map-Reduce.
As you can see, Spark comes packed with high-level libraries, including support for R, SQL,
Python, Scala, Java etc. These standard libraries increase the seamless integrations in complex
workflow. Over this, it also allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.
. Apache Spark best fits for real time processing, whereas Hadoop was designed to store
unstructured data and execute batch processing over it. When we combine, Apache Spark’s
ability, i.e. high processing speed, advance analytics and multiple integration support with
Hadoop’s low cost operation on commodity hardware, it gives the best results.
That is the reason why, Spark and Hadoop are used together by many companies for processing
and analyzing their Big Data stored in HDFS.
APACHE HBASE
For better understanding, let us take an example. You have billions of customer emails and you
need to find out the number of customers who has used the word complaint in their emails. The
request needs to be processed quickly (i.e. at real time). So, here we are handling a large data set
while retrieving a small amount of data. For solving these kind of problems, HBase was
designed.
APACHE DRILL
Apache Drill is used to drill into any kind of data. It’s an open source application which works
with distributed environment to analyze large data sets.
So, basically the main aim behind Apache Drill is to provide scalability so that we can process
petabytes and exabytes of data efficiently (or you can say in minutes).
The main power of Apache Drill lies in combining a variety of data stores just by using
a single query.
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
Before Zookeeper, it was very difficult and time consuming to coordinate between different
services in Hadoop Ecosystem. The services earlier had many problems with interactions like
common configuration while synchronizing data. Even if the services are configured, changes in
the configurations of the services make it complex and difficult to handle. The grouping and
naming was also a time-consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time by performing
synchronization, configuration maintenance, grouping and naming.
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache
jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as
one logical work.
1. Oozie workflow: These are sequential set of actions to be executed. You can assume it as
a relay race. Where each athlete waits for the last one to complete his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same
manner as we respond to an external stimulus, an Oozie coordinator responds to the
availability of data and it rests otherwise.
APACHE FLUME
The Flume is a service which helps in ingesting unstructured and semi-structured data
into HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.
Now, let us understand the architecture of Flume from the below diagram:
There is a Flume agent which ingests the streaming data from various data sources to HDFS.
From the diagram, you can easily understand that the web server indicates the data source.
Twitter is among one of the famous sources for streaming data.
1. Source: it accepts the data from the incoming streamline and stores the data in the
channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a temporary
storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and commits
or writes the data in the HDFS permanently.
APACHE SQOOP
When we submit our Job, it is mapped into Map Tasks which brings the chunk of data from
HDFS. These chunks are exported to a structured data destination. Combining all these exported
chunks of data, we receive the whole data at the destination, which in most of the cases is an
RDBMS (MYSQL/Oracle/SQL Server).
APACHE SOLR & LUCENE
Apache Solr and Apache Lucene are the two services which are used for searching and indexing
in Hadoop Ecosystem.
It uses the Lucene Java search library as a core for search and full indexing.
APACHE AMBARI
Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable.
It includes software for provisioning, managing and monitoring Apache Hadoop clusters.
Hadoop HDFS - 2007 - A distributed file system for reliably storing huge amounts of
unstructured, semi-structured and structured data in the form of files.
Hadoop MapReduce - 2007 - A distributed algorithm framework for the parallel processing of
large datasets on HDFS filesystem. It runs on Hadoop cluster but also supports other database
formats like Cassandra and HBase.
Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and
asynchronous masterless replication.
HBase - 2008 - A key-value pair NoSQL database, with column family data representation,
with master-slave replication. It uses HDFS as underlying storage.
Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on
Paxos algorithm variant called Zab.
Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting
interface over native Java MapReduce programming.
Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL
interface over native Java MapReduce programming.
Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for
finding meaningful patterns in HDFS datasets.
Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export
back.
YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the
cluster resources like memory and CPU.
Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into
HDFS.
Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message
semantics.
Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It
provides libraries for Machine Learning, SQL interface and near real-time Stream Processing.
Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability.
Solr Cloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It
uses Lucene library for data indexing.
Hadoop Environment Big Data Analytics Hadoop is changing the perception of handling Big Data
especially the unstructured data. Let’s know how Apache Hadoop software library, which is a framework,
plays a vital role in handling Big Data. Apache Hadoop enables surplus data to be streamlined for any
distributed processing system across clusters of computers using simple programming models. It truly is
made to scale up from single servers to many machines, each and every offering local computation, and
storage space.
Instead of depending on hardware to provide high availability, the library itself is built to detect and handle
breakdowns at the application layer, so providing an extremely available service along with a cluster of
computers, as both versions might be vulnerable to failures.
Hadoop Community Package Consists of
●File system and OS level abstractions
●A MapReduce engine (either MapReduce or YARN)
●The Hadoop Distributed File System (HDFS)
●Java ARchive (JAR) files
●Scripts needed to start Hadoop
●Source code, documentation and a contribution section