0% found this document useful (0 votes)
15 views32 pages

BDA-U2

The document discusses various technologies for handling big data, including distributed and parallel computing, Hadoop, cloud computing, and in-memory computing. It highlights the importance of these technologies in efficiently processing large datasets, improving speed and scalability, and facilitating advanced analytics. Additionally, it outlines the Hadoop ecosystem and its components, as well as the role of cloud services in big data management.

Uploaded by

Anonymous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views32 pages

BDA-U2

The document discusses various technologies for handling big data, including distributed and parallel computing, Hadoop, cloud computing, and in-memory computing. It highlights the importance of these technologies in efficiently processing large datasets, improving speed and scalability, and facilitating advanced analytics. Additionally, it outlines the Hadoop ecosystem and its components, as well as the role of cloud services in big data management.

Uploaded by

Anonymous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT II

Introducing Technologies for Handling Big Data: Distributed and Parallel Computing for Big
Data, Introducing Hadoop, Cloud Computing and Big Data, In‐Memory Computing
Technology for Big Data, Understanding Hadoop Ecosystem.

Distributed and parallel Computing for Big Data:


Among the technologies that are used to handle, process, and analyze big data the most popular and
effective innovations have been in the field of distributed and parallel processing, Hadoop, in
memory computations, big data cloud. Most popular one is Hadoop. Organizations use it to extract
maximum output from normal data usage practices at a rapid pace. Cloud computing helps
companies to save cost and better manage resources. Big Data can’t be handled by traditional data
storage and processing systems. For handling such type of data, Distributed and Parallel
Technologies are more suitable. Multiple computing resources are connected in a network and
computing tasks are distributed across these resources.
● Increases the Speed.
● Increases Efficiency.
● More suitable to process huge amounts of data in a limited time.
Parallel Computing: Also improves the processing capability of a computer system by adding
additional computational resources to it. Divide complex computations into subtasks, handled
individually by processing units, running in parallel. Concept – processing capability will increase
with the increase in the level of parallelism.

BIG DATA PROCESSING TECHNIQUES:


With the increase in data, forcing organizations to adopt a data analysis strategy that can be used for
analyzing the entire data in a very short time.
Done by Powerful h/w components and new s/w
programs. The procedure followed by the s/w applications
are:
1) Break up the given task
2) Surveying the available resources
3) Assigning the subtask to the nodes
Resources develop some technical problems and fail to respond to virtualization. Some processing
and analytical tasks are delegated to other resources.
Latency: can be defined as the aggregate delay in the s/m because of delays in the completion of
individual tasks. System delay: Also affects data management and communication.
Affecting the productivity & profitability of an organization.
PARALLEL COMPUTING TECHNIQUES:
Cluster or Grid Computing
• primarily used in Hadoop.
• based on a connection of multiple servers in a network (clusters)
• servers share the workload among them.
• overall cost may be very
high. Massively Parallel Processing
(MPP)
• used in data warehouses. Single machine working as a grid is used in the MPP platform.
• Capable of handling storage, memory and computing activities.
• Software written specifically for MPP platform is used for optimization.
• MPP platforms, EMC Greenplum, Paracel , suited for high-value use cases.
3) High Performance Computing (HPC) : Offer high performance and scalability by using IMC.
Suitable for processing floating point data at high speeds.
Used in research and business organization where the result is more valuable than the cost or where
strategic importance of project is of high priority.

HADOOP
Hadoop is a distributed system like distributed database.
Hadoop is a ‘software library’ that allows its users to process large datasets across distributed
clusters of computers, thereby enabling them to gather, store and analyze huge sets of data.
It provides various tools and technologies, collectively termed as the Hadoop Ecosystem.
Hadoop cluster consist of single Master Node and a multiple Worker Nodes.

Cloud computing and big data

Cloud computing and big data are two closely related technologies that complement each other to
drive innovation, efficiency, and scalability in data-driven applications. As organizations generate
and process massive volumes of data, cloud computing provides the infrastructure and tools to
handle big data in a scalable, cost-effective, and flexible manner. Here’s a detailed overview of
how cloud computing supports big data:

1. Scalability and Flexibility

● Elastic Resources: Cloud platforms provide the ability to scale resources up or down based
on demand. This is crucial for big data workloads, which can be highly variable in size. Whether
it's data storage or processing power, cloud infrastructure can dynamically adjust to handle large
datasets without the need for physical hardware changes.
● Pay-as-You-Go Model: The cloud's pricing model allows companies to pay only for the
resources they use, making it more economical to manage big data projects, which can have
unpredictable workloads and data growth patterns.

2. Storage for Big Data

● Distributed Storage Systems: Cloud platforms offer distributed storage solutions that are
essential for big data. Systems like Amazon S3, Google Cloud Storage, and Azure Blob Storage
provide scalable storage that can handle petabytes or even exabytes of data.
● Data Lakes: Many organizations build data lakes in the cloud. A data lake is a central
repository that allows organizations to store all structured and unstructured data at scale, making it
easier to perform advanced analytics or machine learning tasks.

3. Big Data Processing

● Cloud-Native Big Data Tools: Cloud computing platforms provide a range of big data
processing tools that leverage distributed architectures. For example:
○ Amazon EMR (Elastic MapReduce): A cloud-native implementation of Hadoop and
Spark for processing vast amounts of data.
○ Google BigQuery: A serverless, fully managed data warehouse that allows for fast SQL
queries on large datasets.
○ Azure Synapse Analytics: A cloud analytics service that combines big data and data
warehousing.
● Stream Processing: The cloud supports real-time data processing tools, such as Amazon
Kinesis or Azure Stream Analytics, to handle streaming data from sources like IoT devices, social
media, and transaction logs.

4. Data Analytics and Machine Learning

● Advanced Analytics: Cloud platforms provide a wide range of tools for performing
advanced data analytics on big data. Services like AWS Redshift, Google BigQuery, and Azure
Data Lake Analytics allow users to query massive datasets quickly.
● Machine Learning Services: Many cloud providers offer machine learning platforms that
can process big data and build predictive models. For instance, AWS SageMaker, Google AI
Platform, and Azure Machine Learning allow data scientists and developers to train machine
learning models on large datasets without managing the underlying infrastructure.

5. Cost Efficiency

● No Upfront Investment: Traditional big data infrastructures often require a significant


upfront investment in hardware, storage, and maintenance. In contrast, cloud computing allows
organizations to avoid this capital expenditure, opting instead for an operational expense model
where they pay only for what they use.
● Cost Control Mechanisms: Cloud providers offer tools that allow users to monitor and
optimize the cost of their big data operations. Auto-scaling, data lifecycle policies, and cold storage
options help manage costs as data grows over time.

6. Data Security and Compliance

● Security Features: Cloud providers invest heavily in security, offering features like data
encryption (in-transit and at rest), multi-factor authentication, and identity management. These are
crucial for big data applications, especially in industries with strict regulatory requirements.
● Compliance: Many cloud providers offer compliance certifications, such as GDPR,
HIPAA, and SOC 2, making it easier for organizations to ensure that their big data applications
meet legal and regulatory standards.

7. Collaboration and Accessibility

● Global Accessibility: Cloud computing enables distributed teams to access and work with
big data from anywhere in the world. This global accessibility allows for real-time collaboration on
data projects, improving productivity and innovation.
● Data Sharing: Cloud environments support secure data sharing across different teams,
departments, or even with external partners. This facilitates collaborative analytics and
decision-making.

8. Data Integration

● Hybrid and Multi-Cloud Architectures: Many organizations use hybrid or multi-cloud


strategies to manage their big data. Hybrid cloud solutions combine on-premises infrastructure with
cloud resources, allowing companies to keep sensitive data on-premises while leveraging cloud
computing for analytics and processing.
● Data Integration Services: Cloud platforms offer tools for integrating data from multiple
sources, including databases, APIs, and IoT devices. Services like AWS Glue, Google Cloud
Dataflow, and Azure Data Factory enable the seamless extraction, transformation, and loading
(ETL) of big data.

9. Disaster Recovery and Data Backup

● High Availability: Cloud providers offer robust disaster recovery options, ensuring that big
data is available even in the event of hardware failures or natural disasters. Cloud regions and
availability zones provide redundancy, ensuring that data is backed up and recoverable.
● Data Replication: Big data stored in the cloud can be replicated across multiple locations
to protect against data loss, making cloud-based solutions more resilient than traditional
on-premises systems.

10. Examples of Cloud Providers for Big Data

● Amazon Web Services (AWS): Offers services like S3, EMR, Redshift, and SageMaker to
support big data storage, processing, and machine learning.
● Google Cloud Platform (GCP): Includes BigQuery, Dataflow, Pub/Sub, and AI tools
designed for big data analytics and real-time processing.
● Microsoft Azure: Provides services like Azure Data Lake, Synapse Analytics, and Azure
ML to cater to big data and advanced analytics workloads.

CLOUD DEPLOYMENT MODELS:


Depending upon the architecture used in forming the n/w, services and applications used,
and the target consumers, cloud services form various deployment models. They are,
Public Cloud
Private Cloud
Community Cloud
Hybrid Cloud
• Public Cloud (End-User Level Cloud)
- Owned and managed by a company than the one using it.
- Third party administrator.
- Eg : Verizon, Amazon Web Services, and Rackspace.
- The workload is categorized because of service category, h/w customization is possible to
provide optimized performance.
- The process of computing becomes very flexible and scalable through customized h/w resources.
- The primary concern with a public cloud include security and latency.
• Private Cloud (Enterprise Level Cloud)
- Remains entirely in the ownership of the organization using it.
- Infrastructure is solely designed for a single organization.
- Can automate several processes and operations that require manual handling in a public cloud.
- Can also provide firewall protection to the cloud, solving latency and security concerns.
- A private cloud can be either on-premises or hosted externally. on premises: service is
exclusively used and hosted by a single organization. hosted externally: used by a single
organization and are not shared with other organizations.
• Community Cloud
-Type of cloud that is shared among various organizations with a common tie.
-Managed by third party cloud services.
-Available on or off premises.
Eg. In any state, the community cloud can provided so that almost all govt. organizations of that
state can share the resources available on the cloud. Because of the sharing of resources on
community cloud, the
data of all citizens of that state can be easily managed by the govt. organizations.
• Hybrid Cloud
-various internal or external service providers offer services to many organizations.
-In hybrid clouds, an organization can use both types of cloud, i.e. public and private together –
situations such as cloud bursting.
Organization uses its own computing infrastructure, high load requirement, access clouds.
The organization using the hybrid cloud can manage an internal private cloud for general use and
migrate the entire or part of an application to the public cloud during the peak periods.

CLOUD SERVICES FOR BIG DATA


In big data Iaas, Paas and Saas clouds are used in following manner.
Iaas:- Huge storage and computational power requirement for big data are fulfilled by limitless
storage space and computing ability obtained by iaas cloud.
Paas:- offerings of various vendors have started adding various popular big data platforms that
include MapReduce, Hadoop. These offerings save organizations from a lot of hassles which
occur in managing individual hardware components and software applications.
Saas:- Various organizations require identifying and analyzing the voice of customers
particularly on social media. Social media data and platform are provided by SAAS
vendors. In addition, private cloud facilitates access to enterprise data which enables these
analyses.

IN MEMORY COMPUTING TECHNOLOGY

In-memory computing (IMC) is a transformative technology that enables faster data processing by
storing data directly in the main memory (RAM) instead of traditional disk-based storage. This
technology is particularly beneficial for big data applications, which require handling vast amounts
of data quickly and efficiently. Below are some key aspects of in-memory computing technology as
it relates to big data:

1. Speed and Performance

● Real-Time Processing: In-memory computing significantly accelerates the processing of


data, enabling real-time analytics. With data stored in RAM, processing speeds increase by
orders of magnitude compared to accessing data from disk-based systems.
● Low Latency: Applications that need to handle high-throughput and low-latency
transactions, such as financial trading platforms or IoT sensors, greatly benefit from IMC.

2. Scalability

● Horizontal Scalability: In-memory systems are designed to scale horizontally across


distributed architectures. As data volumes increase, adding more RAM across multiple
nodes helps maintain performance.
● Cluster Computing: Technologies like Apache Spark and Apache Ignite use in-memory
computing to create distributed in-memory processing clusters, enabling the handling of big
data workloads.

3. Data Analytics

● Fast Analytics: In-memory computing is ideal for big data analytics platforms, where
massive datasets are processed and analyzed for insights. Data can be quickly aggregated,
filtered, and queried without the traditional bottlenecks of disk I/O.
● Machine Learning: Many machine learning models require rapid access to data for
training. In-memory computing provides the speed needed for real-time model training and
updating.

4. Data Caching

● In-Memory Data Grids: These systems cache frequently accessed data in memory,
reducing the need to read from disk and speeding up response times. Solutions like Redis or
Memcached are widely used for this purpose.
● Session and Query Caching: For big data environments, in-memory caching can store
session information, query results, or interim computations, improving application
performance.

5. Use Cases in Big Data

● Real-Time Analytics: In fields like finance, telecommunications, or e-commerce, real-time


decision-making powered by in-memory computing can provide competitive advantages.
● Fraud Detection: High-speed detection of fraudulent transactions is enabled by in-memory
systems that can analyze transactions in real-time.
● Personalization and Recommendations: E-commerce and media companies use
in-memory computing to provide instant, personalized recommendations based on user
behavior.

6. Challenges

● Cost: RAM is more expensive than traditional storage, and the cost of scaling in-memory
systems can be prohibitive for some organizations.
● Data Persistence: RAM is volatile, meaning data is lost when the system is powered down.
To mitigate this, hybrid solutions that combine in-memory processing with disk-based
backups are often used.
● Complexity: Implementing and managing in-memory systems, particularly in distributed
environments, adds complexity to the architecture.

7. Technologies and Tools

● Apache Spark: A fast, general-purpose cluster-computing system with in-memory data


processing capabilities.
● Apache Ignite: An in-memory computing platform that includes caching, distributed data
storage, and processing.
● SAP HANA: A high-performance in-memory database designed for analytics and
transactional processing.
● Redis: An in-memory key-value store often used for caching and real-time analytics.

In-memory computing is playing a critical role in enabling organizations to handle the massive
scale and speed requirements of big data in today's information-driven world.
HADOOP ECOSYSTEM

Hadoop Ecosystem is neither a programming language nor a service, it is a platform or


framework which solves big data problems. You can consider it as a suite which encompasses a
number of services (ingesting, storing, analyzing and maintaining) inside it.

Below are the Hadoop components, that together form a Hadoop ecosystem.

​ HDFS -> Hadoop Distributed File System


​ YARN -> Yet Another Resource Negotiator
​ MapReduce -> Data processing using programming
​ Spark -> In-memory Data Processing
​ PIG, HIVE-> Data Processing Services using Query (SQL-like)
​ HBase -> NoSQL Database
​ Mahout, Spark MLlib -> Machine Learning
​ Apache Drill -> SQL on Hadoop
​ Zookeeper -> Managing Cluster
​ Oozie -> Job Scheduling
​ Flume, Sqoop -> Data Ingesting Services
​ Solr & Lucene -> Searching & Indexing
​ Ambari -> Provision, Monitor and Maintain cluster
HDFS

​ Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
​ HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
​ HDFS creates a level of abstraction over the resources, from where we can see the whole
HDFS as a single unit.
​ It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
​ HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it
requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires
more storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason, why
Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
YARN
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities
by allocating resources and scheduling tasks.

​ It has two major components, i.e. Resource Manager and Node Manager.
1. Resource Manager is again a main node in the processing department.
2. It receives the processing requests, and then passes the parts of requests to
corresponding Node Managers accordingly, where the actual processing takes
place.
3. Node Managers are installed on every Data Node. It is responsible for execution
of task on every single Data Node.


1. Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
2. Applications Manager: While Applications Manager accepts the job submission,
negotiates to containers (i.e. the Data node environment where process executes)
for executing the application specific Application Master and monitoring the
progress. ApplicationMasters are the deamons which reside on DataNode and
communicates to containers for execution of tasks on each DataNode.
3. ResourceManager has two components: Schedulers and application manager

MAPREDUCE

It is the core component of processing in a


Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a
software framework which helps in writing applications that processes large data sets using
distributed and parallel algorithms inside Hadoop environment.

​ In a MapReduce program, Map() and Reduce() are two functions.


1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as
the input for Reduce function.
Let us take the above example to have a better understanding of a MapReduce program.

We have a sample case of students and their respective departments. We want to calculate the
number of students in each department. Initially, Map program will execute and calculate the
students appearing in each department, producing the key value pair as mentioned above. This
key value pair is the input to the Reduce function. The Reduce function will then aggregate each
department and calculate the total number of students in each department and produce the given
result.
APACHE PIG

​ PIG has two parts: Pig Latin, the language and the pig runtime, for the execution
environment. You can better understand it as Java and JVM.
​ It supports pig latin language, which has SQL like command structure.

10 line of pig latin = approx. 200 lines of Map-Reduce Java code

But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes.

​ The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that’s an abstraction (which works like black box).
​ PIG was initially developed by Yahoo.

​ It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.

How Pig works?

In PIG, first the load command, loads the data. Then we perform various functions on it like
grouping, filtering, joining, sorting, etc. At last, either you can dump the data on the screen or
you can store the result back in HDFS.
APACHE HIVE

​ Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them
feel at home while working in a Hadoop Ecosystem.
​ Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.

HIVE + SQL = HQL

​ The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
​ It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
​ The Hive Command line interface is used to execute HQL commands.
​ While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.

​ Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
​ It supports all primitive data types of SQL.
​ You can use predefined functions, or write tailored user defined functions (UDF) also to
accomplish your specific needs.
APACHE MAHOUT

Now, let us talk about Mahout which is renowned for machine learning. Mahout provides an
environment for creating machine learning applications which are scalable.

Machine learning algorithms allow us to build self-learning machines that evolve by itself
without being explicitly programmed. Based on user behaviour, data patterns and past
experiences it makes important future decisions. You can call it a descendant of Artificial
Intelligence (AI).

What Mahout does?

It performs collaborative filtering, clustering and classification. Some people also consider
frequent item set missing as Mahout’s function. Let us understand them individually:

1. Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the users. The
typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can contain blogs,
news, research papers etc.
3. Classification: It means classifying and categorizing data into various sub- departments
like articles can be categorized into blogs, news, essay, research papers and other
categories.
4. Frequent item set missing: Here Mahout checks, which objects are likely to be
appearing together and make suggestions, if they are missing. For example, cell phone
and cover are brought together in general. So, if you search for a cell phone, it will also
recommend you the cover and cases.

Mahout provides a command line to invoke various algorithms. It has a predefined set of library
which already contains different inbuilt algorithms for different use cases.
APACHE SPARK

​ Apache Spark is a framework for real time data analytics in a distributed computing
environment.
​ The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
​ It executes in-memory computations to increase speed of data processing over Map-
Reduce.
​ It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than
Map-Reduce.

As you can see, Spark comes packed with high-level libraries, including support for R, SQL,
Python, Scala, Java etc. These standard libraries increase the seamless integrations in complex
workflow. Over this, it also allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.

. Apache Spark best fits for real time processing, whereas Hadoop was designed to store
unstructured data and execute batch processing over it. When we combine, Apache Spark’s
ability, i.e. high processing speed, advance analytics and multiple integration support with
Hadoop’s low cost operation on commodity hardware, it gives the best results.

That is the reason why, Spark and Hadoop are used together by many companies for processing
and analyzing their Big Data stored in HDFS.
APACHE HBASE

​ HBase is an open source, non-relational distributed database. In other words, it is a


NoSQL database.
​ It supports all types of data and that is why, it’s capable of handling anything and
everything inside a Hadoop ecosystem.
​ It is modelled after Google’s BigTable, which is a distributed storage system designed to
cope up with large data sets.
​ The HBase was designed to run on top of HDFS and provides BigTable like capabilities.
​ It gives us a fault tolerant way of storing sparse data, which is common in most Big Data
use cases.
​ The HBase is written in Java, whereas HBase applications can be written in REST, Avro
and Thrift APIs.

For better understanding, let us take an example. You have billions of customer emails and you
need to find out the number of customers who has used the word complaint in their emails. The
request needs to be processed quickly (i.e. at real time). So, here we are handling a large data set
while retrieving a small amount of data. For solving these kind of problems, HBase was
designed.
APACHE DRILL

Apache Drill is used to drill into any kind of data. It’s an open source application which works
with distributed environment to analyze large data sets.

​ It is a replica of Google Dremel.


​ It supports different kinds NoSQL databases and file systems, which is a powerful feature
of Drill. For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB,
MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.

So, basically the main aim behind Apache Drill is to provide scalability so that we can process
petabytes and exabytes of data efficiently (or you can say in minutes).

​ The main power of Apache Drill lies in combining a variety of data stores just by using
a single query.

​ Apache Drill basically follows the ANSI SQL.


​ It has a powerful scalability factor in supporting millions of users and serve their query
requests over large scale data.
APACHE ZOOKEEPER

​ Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem.
​ Apache Zookeeper coordinates with various services in a distributed environment.

Before Zookeeper, it was very difficult and time consuming to coordinate between different
services in Hadoop Ecosystem. The services earlier had many problems with interactions like
common configuration while synchronizing data. Even if the services are configured, changes in
the configurations of the services make it complex and difficult to handle. The grouping and
naming was also a time-consuming factor.

Due to the above problems, Zookeeper was introduced. It saves a lot of time by performing
synchronization, configuration maintenance, grouping and naming.

Although it’s a simple service, it can be used to build powerful solutions.


APACHE OOZIE

Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache
jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as
one logical work.

There are two kinds of Oozie jobs:

1. Oozie workflow: These are sequential set of actions to be executed. You can assume it as
a relay race. Where each athlete waits for the last one to complete his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same
manner as we respond to an external stimulus, an Oozie coordinator responds to the
availability of data and it rests otherwise.

APACHE FLUME

Ingesting data is an important part of our Hadoop Ecosystem.

​ The Flume is a service which helps in ingesting unstructured and semi-structured data
into HDFS.
​ It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
​ It helps us to ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.

Now, let us understand the architecture of Flume from the below diagram:
There is a Flume agent which ingests the streaming data from various data sources to HDFS.
From the diagram, you can easily understand that the web server indicates the data source.
Twitter is among one of the famous sources for streaming data.

The flume agent has 3 components: source, sink and channel.

1. Source: it accepts the data from the incoming streamline and stores the data in the
channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a temporary
storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and commits
or writes the data in the HDFS permanently.

APACHE SQOOP

The major difference between Flume and Sqoop is that:

​ Flume only ingests unstructured data or semi-structured data into HDFS.


​ While Sqoop can import as well as export structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.

Let us understand how Sqoop works using the below diagram:


When we submit Sqoop command, our main task gets divided into sub tasks which is handled by
individual Map Task internally. Map Task is the sub task, which imports part of data to the
Hadoop Ecosystem. Collectively, all Map tasks imports the whole data.

Export also works in a similar manner.

When we submit our Job, it is mapped into Map Tasks which brings the chunk of data from
HDFS. These chunks are exported to a structured data destination. Combining all these exported
chunks of data, we receive the whole data at the destination, which in most of the cases is an
RDBMS (MYSQL/Oracle/SQL Server).
APACHE SOLR & LUCENE

Apache Solr and Apache Lucene are the two services which are used for searching and indexing
in Hadoop Ecosystem.

​ Apache Lucene is based on Java, which also helps in spell checking.


​ If Apache Lucene is the engine, Apache Solr is the car built around it. Solr is a complete
application built around Lucene.

​ It uses the Lucene Java search library as a core for search and full indexing.
APACHE AMBARI

Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable.

It includes software for provisioning, managing and monitoring Apache Hadoop clusters.

The Ambari provides:

1. Hadoop cluster provisioning:


▪ It gives us step by step process for installing Hadoop services across a number of
hosts.
▪ It also handles configuration of Hadoop services over a cluster.
2. Hadoop cluster management:
▪ It provides a central management service for starting, stopping and re-
configuring Hadoop services across the cluster.
3. Hadoop cluster monitoring:
▪ For monitoring health and status, Ambari provides us a dashboard.
▪ The Amber Alert framework is an alerting service which notifies the user,
whenever the attention is needed. For example, if a node goes down or low disk
space on a node, etc.
1. Hadoop Ecosystem owes its success to the whole developer community, many big companies like
Facebook, Google, Yahoo, University of California (Berkeley) etc. have contributed their part to increase
Hadoop’s capabilities.
2. Inside a Hadoop Ecosystem, knowledge about one or two tools (Hadoop components) would not help in
building a solution. You need to learn a set of Hadoop components, which works together to build a
solution.
3. Based on the use cases, we can choose a set of services from Hadoop Ecosystem and create a tailored
solution for an organization.

Understanding Hadoop Eco System


;

Hadoop HDFS - 2007 - A distributed file system for reliably storing huge amounts of
unstructured, semi-structured and structured data in the form of files.
Hadoop MapReduce - 2007 - A distributed algorithm framework for the parallel processing of
large datasets on HDFS filesystem. It runs on Hadoop cluster but also supports other database
formats like Cassandra and HBase.
Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and
asynchronous masterless replication.
HBase - 2008 - A key-value pair NoSQL database, with column family data representation,
with master-slave replication. It uses HDFS as underlying storage.
Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on
Paxos algorithm variant called Zab.
Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting
interface over native Java MapReduce programming.
Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL
interface over native Java MapReduce programming.
Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for
finding meaningful patterns in HDFS datasets.
Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export
back.
YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the
cluster resources like memory and CPU.
Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into
HDFS.
Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message
semantics.
Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It
provides libraries for Machine Learning, SQL interface and near real-time Stream Processing.
Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability.
Solr Cloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It
uses Lucene library for data indexing.
Hadoop Environment Big Data Analytics Hadoop is changing the perception of handling Big Data
especially the unstructured data. Let’s know how Apache Hadoop software library, which is a framework,
plays a vital role in handling Big Data. Apache Hadoop enables surplus data to be streamlined for any
distributed processing system across clusters of computers using simple programming models. It truly is
made to scale up from single servers to many machines, each and every offering local computation, and
storage space.
Instead of depending on hardware to provide high availability, the library itself is built to detect and handle
breakdowns at the application layer, so providing an extremely available service along with a cluster of
computers, as both versions might be vulnerable to failures.
Hadoop Community Package Consists of
●File system and OS level abstractions
●A MapReduce engine (either MapReduce or YARN)
●The Hadoop Distributed File System (HDFS)
●Java ARchive (JAR) files
●Scripts needed to start Hadoop
●Source code, documentation and a contribution section

You might also like