0% found this document useful (0 votes)

22 views20 pages

HCIA Big Data

The document discusses the big data industry in China, outlining the national strategy, definitions, applications, and challenges of big data, as well as detailing the Hadoop Distributed File System (HDFS) and MapReduce processing techniques. It highlights the architecture, features, and operational mechanisms of HDFS, including data replication, fault tolerance, and security measures. Additionally, it covers YARN's resource management and scheduling capabilities within the Hadoop ecosystem, emphasizing its dynamic resource allocation and task management features.

Uploaded by

AdilHoubbane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views20 pages

HCIA Big Data

Uploaded by

AdilHoubbane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Chapter 1 : Big data industry and Technological Trends :

Nation big data strategy in china tasks :

promote the innovation and development of big data
build digital economy with data as key enabler
improve the country's capability in governance with big data
improve people's livelihood by applying big data
protect national data security

Definition of big data :

Is a data sets that are so voluminous and complex that traditional data-processing
application software is inadequate to deal with them.

Volume : Huge amÙount of data

Variety : Various types of data
Velocity : Data processing speed
Value : Low data value density

Data as a platform (DaaP)

Database Big data

Data scale Small(in Mb) Large (in GB,TB,PB)
Data type Single(Structured) Diversified
Relationship Schema and data Data first and schema
Processing tool One size fits all No size fits all

Taobao 50000 GB data generated per day

40 millions GB data storage volume

Big data applications

Politics
Finance
education
transportation
tourism
Government and puplic security
traffic planning
sports

Industry applications of big data

Operation(telecom and finance), Management(finance), Supervision (government),
Profession(government)
Big data challenges :
Business departments do not have clear requirements on big data
data fragments ( data is not shared in entreprise)
low data availability and poor quality
data management technology and architecture
data security
insufficient big data talents
balance between data openness and privacy

NOTES:
Panthera : Intel open source project for instant analysis document storgae and sql
engine
In memory computing : Google PowerDrill
Streaming computing : IBM infoSphere

Huawei Big data solution : FusionInsight

FusionInsight Architecture:
Hadoop : data processing environment
DataFarm: Data inside & include porter for data integration services, Miner ; data-mining,
Farmer: data service frameworks
Manager: DIstributed management framework that include system management, Service
governance, security management
LibrA: entreprise-level relational database: its support role storage and column storage

Big data platform partners in CHINA:

Industrial and commercial bank of china
China merchants bank(cmb)
Pacific Insurance
China mobile
China unicom

Chapter 2 : HDFS - Hadoop Distibuted File System

-HDFS is developed based on Google File System
-It's deployed -on multiple independent physical machines
-Single point NameNode failure problem is solved by the use of zookeeper to implement
the primary node which reflect the hight reliability of HDFS
-HDFS relies on metadata persistence to back up metadata in memory.
-HDFS can choose different strategies for data storage, including tiered storage and
label storage.
The purpose of Colocation is to store the data of the associated relationship or the data
that may be associated with the same node

HDFS features :
Hight fault tolerance
Hight throughput : support apps with large amount of data
large file storage : TB and PB level data storage

NOTE :
-HDFS is built around the idea that the most efficient data processing pattern is write
onece read many times pattern.
-the directory and the data block of a file each is about 150 bytes
-Files is divided into segments called S-blocks
-the default block size is 128 MB
-HDFS runs multiple clients instances

HDFS is inapplicable to :
Store massive small files : the number of files that can be stored to the system is
restricted by NaemeNode memory capacity
Random write : no support for multiple writers on a file
Low-delay read

HDFS system architecture :

-NameNode : store and generate metadata of the file system
-DataNode: Is used to store actual data and report back to the NameNode with data
blocks under it's management
Create, delete, replicate blocks according to the instructins of the NameNode
Block is the minimum amount of data that HDFS can read or write
-Client

Process of writing data in HDFS :

service app invokes an API provided by HDFS Client for data writing
HDFS client create the file by calling the create method of Distributed FIleSystem
DFS send RPC request to the NameNode to create a new file in the namespace
DFS return FSDataOutputStream for the client
HDFS Client obtains the data block number and location from NameNode connect to the
datanode and establish a pipeline

process of reading data in HDFS :

FSDataInputStream

Key-Features of HDFS:

Federation storage:
-HDFS Federation allows a cluster to scale by adding NameNodes
-each one manages a portion of the filesystem NameSpace (Each NameNode manage a
Namespace volume)
-Each Namespace has a block pool that contains all the blocks for the files in the
Namespace
Data storage policy:

By default HDFS NameNode automatically selects DataNodes to store data replicas.

3 senarios :
Store data in different storage devices :
-RAM_DISK, DISK, ARCHIVE, SSD
Store data with labels (TAG Storage)
Store data in highly reliable node groups :
The first replica is written to a mandatory rack group, if there are no available nodes in
this mandatory rack group, data write fails.
The second replica is written to a random node from a non mandatory rack group where
local client is located.
If the client is located in the mandatory rack group, then the second replica is written to
a node from other rack groups.

HA : Hight AVailibility:
-eflected in active / standby NameNode elected by zookeeper
-active NameNode provide services
-it's store EditLog which records user operation generated by active NameNode.
-standby NameNode backs up meatadata as hot spare
-Standby NameNode loads EditLog on Active NameNode to synchronize metadata
-Zookeeper is used to store status files and active/standby status information
-ZKFC (Zookeeper Failover Controller) is used for monitoring the active/ standby status

Data organization:
Data is stored by block in the HDFS

Multiple access modes:

in HDFS data can be accessed through Java APIs, HTTP, or Shell commands

Space reclamation :

HDFS support recycle bin mechanism

Files can be recovered if accidentally deleted

NameNode/DataNode in master/ slave mode

Unified file system namespace:

HDFS presents itself as one unified file system externally.

Data replication:
-HDFS replicate file blocks and stores in different dataNodes
-An app can specify the number of the replica and that number can be changed.
Replica policy on HDFS :
First replica on the same node as the client
Second replica is placed on a remote rack
Third replica placed on another rack
Otherwise the first and third replica are placed on diffrent nodes on the same rack
Other replicas are placed randomly on the cluster

metadata persistence:
-the standby NameNode informe the active NameNode of generating a new file called
EditLog.new to continue recording logs in the new file on the other hand the standby
NameNode can obtain the old Editlog and download the FSimage which stores the file
mirroring periodically each hour from Active NameNode
-The standby NameNode merge the old EditLog with FSimage and generate a new
metadata file called FSImage.ckpt then he upload this file to the active NameNode
-The name of FSImage generated is changed to FSImage to overwrite the original
FSImage file
-The Editlog.new file is also renamed as EditLog
- This operation is triggerd every hour or when the capacity of EditLog is 64MB

Robustness

HDFS Colocation :

file-level colocation
The blocks of multiple associated files are distributed on the same storage node =>
Facilitate quick file access and avoid high costs of data migration and reduce resource
consumption

HDFS Data Integrity Assurance :

Methods in case of failure of each component in HDFS :

Replica Reconstruction:

DataNode sends heartbeat messages to the NameNode periodically to report the data
status
The NameNode checks whether the data blocks are completely reported
if the data block are not reported because the DISK of the DataNode is damaged the
NameNode initiates the replica reconstruction to recover the lost replica
Data Balance Mechanism:

Ensure the even distribution of the data among all DataNodes

Metadata reliability :

Metadata in HDFS is stored on both active and standby NameNodes

Data can be recovered in time with this snapshot mechanism if misoperation occurs.

Security Mode :

DATA in HDFS become read-only

Create & delete operation will fail.
HDFS provides a unique security mode to prevent fault spreading when a DataNode or
hard disk is faulty
(La propagation des erreurs)
HDFS exist the security mode after the fault is resolved and data is recovered.

HDFS Common Shell Commands :

dfs (-cat, -ls, -rm, -put, -get, -mkdir, -chmod/chown)

dfsadmin (-safe mode, -report)

Chapter 3 : MapReduce

MapReduce :
Is processing technique and a program model for distributed parallel computing of
massive data set
Highlights:
Easy to program
Capabilities can be improved by adding nodes
Hight fault tolerance using policies or data migration

YARN (Yet Another Resource Negotiator) :

Is hadoop new resource manager which is used for managing and scheduling resources
for applications

Working process of MapReduce :

Assuming that the files to be processed are stored in HDFS

-MapReduce submits requests to ResourceManager

-ResourceManager creates jobs (On app map to one job)
-Before jobs are submitted, the files to be processed are split, by default a Block is Split
=> Map Task.
-ResourceManager allocate resource for the job
-it selects an appropriate NodeManager in the cluster to schedule application master
based on the workloads of NodeMangers
-after a NodeManger is selected application master initiate jobs and applies for resources
from ResourceManager
-The NodeManager start the Container for Task Execution
-Map Task process : it's take a set of data and converts it into another set of data where
elements are grouped into key value pairs.
-Before writing data in the buffer, the data is divided into partitions
-The number of ReduceTasks is determined by user
-The number of partitions is determined by that of Reduce Tasks
-The data with the same key value are sent to the same Reduce tasks for processing
-With each partition map output would be sorted by key then combine(Optional) if
necessary and merged as a MapOutFile
-Each time the memory buffer reached DISK spill threshold a new spill file is created
-Before the task are finished the spill files are merged into a single partition and sorted
output file, this file is MapOuFile
-When map tasks are finished and the ouput progress of MOF reaches 3% the reduce
tasks are started and copy MOF files from each Map Task
-The MOF files outputted by Map Tasks map to Reduce tasks.
-The Map Tasks may finish at different times so the reduce task srats copying their
outputs as soon as each complete
-Then Reduce task sort and merge multiple MapOuFile into One
-During the reduce phase, the reduce function is invoked for each key in the sorted
output
-The output of this phase is written directly to the output file system typically HDFS.

Architecture of YARN :
It's consist of three components :

Client : Submit jobs to ResourceManager.

ResourceManager : Manage the use of resource across the cluster (Only one
resourceManager in a HADOOP cluster).
NodeManager: Runing in all nodes in the cluster to launch and monitor containers, also
report node status to ResourceManager periodically
ApplicationMaster : Responsible for negotiating resources from the ResourceManager
and Working with the NodeManager to execute and monitor the component tasks
Container: Is abstraction of YARN resources and it excecute an application specific
process with the constraint of resources( Memory & CPU)
-It also reports the status of MapReduce tasks to Application Master

Task Scheduling of MapReduce on YARN:

-After submission of a job to ResourceManager from client

-The resourceManager receive the job and allocates the first container for the application
program
-Then ResourceManager asks NodeManager to start applicationMaster in the container
-Then ApplicationMaster registres with ResourceManager
-After that ApplicationMaster applies for an obtained resource from ResourceManager in
pulling mode using RPC protocol for each task and monitors the tasks unitl finished
-Once obtaining resources ApplicationMaster asks NodeManager to start the tasks
-Each task uses RPC to report its status and progress to ApplicationMaster (So
applicationMaster can restart the task if one fails)
-Users can use RPC to obtain the operating status of the application program from
ApplicationMaster at any time.
-After application program ends, ApplicationMaster cancels it's registration with
ResourceManager and close itself.

YARN HA Solution :

Redundant ResourceManager:

Similar to HDFS HA solution when the active ResourceManager fails, the fail-over can be
triggered automatically or manually to switch active and standby state.
-When Automatic Fail-over is not enabled, administrators must run a command to
manually switch one of the ResourceManagement node to the active state.
-When the active ResourceManager goes down or becomes unresponsive another
ResourceManager is automatically elected to be the active node.

YARN APPMaster Fault Tolerant Mechanism :

If an ApplicationMaster goes down ResourceManager will close all the containers that it
manages, including containers where tasks are running
-Then ResourceManager will start a new ApplicationMaster on another computing node.
-YARN supports the feature of preserving the container status when starting a new
ApplicationMaster so that the tasks in these containers can run continuously without any
failures.

Resource Management in YARN :

YARN supports management and allocation of two types of resources : Memory & CPU
-Users can configure the memory and the number of CPU for each NodeManager

-yarn.nodemanager.resource.memory-mb :
This configuration indicates the physical memory size of the container running on
NodeManager (in MB, the value must be smaller than the memory size of the
NodeManager Server)
-yarn.nodemanager.vmem-pmem-ratio:
It's indicates the ration of the maximum available virtual memory to the physical memory
of a Container (The propotion of virtual memory utilization exceeding the assigned value
is not allowed to be greater than that ratio).

-yarn.nodemanager.resource.cpu-vcore :
It's indicates the number of CPU cores that can be allocated to a Container( It's
recommended that this parameter value to be set to 1.5 to 2 times the number of CPU
cores)

Resource Allocation Model in YARN :

In YARN Resource Scheduler organize resources throught hierarchical queues
Users can submit application to one or more queues, then Resource Scheduler select a
queue based on certain algorithms, and selects an application in the queue, and then
attempts to allocate request resources to the application.
-Queues are classified into Parent queues and Leaf queues.
-A parent queue can have multiple Leaf queues
-tasks are running on leaf queues.
-if a queue is specified when the task is submitted, the scheduler allocates the taskt to
the queue, if not the task is allocated to the default queue.
-If resources fails to be allocated to an application due to the limit of some parameters,
the Resource Scheduler will selects the next application.
-Queue(Memory & CPU) for a container.

Capacity Scheduler Overview :

-It's allocate resources by queue,

-Administrators can restrict the resource used by a queue,user, or job.

Highlights of Capacity Scheduler :

Capacity assurance: administrators can set upper and lower limits for the resource
usage of each queue. (Queue shared resources)

Flexibility :
The remaining recources of a queue can be used by others queues that require
resources.

Priority : Priority queuing is supported (FIFO by default)

Multi-leasing : multiple users can hare a cluster, and multiple apps can run concurrently.
Administrators can add multiple restrictions to prevent cluster resources from being
excljusively occupied by an application, user, or queue.
Dynamic update of configuration files :
Administrators can dynamically modify configuration parameters to manage clusters
online.

Task Selection by Capacity Scheduler :

The queue is selected based on the following policies :

-The queue with the lower resource usage is allocated first.

-Resources are allocated to the queue with the minimum queue hierarchy first.
-Resources are allocated to the resource reclamation request queue first.
- A task is then selected based on the task priority and submission squence as well as
the limit of user resources and memory.

User and Task Limitation in FusionInsghit :

Tenant>Dynamic Resource Plan>Queue Config

There ara two parameters :

-Minimum resource assurance of a user :

If tasks of multiple users are running at the same time in a queue, the resource usage of
each user fluctuates between the minimum value and the maximum value which
determinated by the number of running tasks, on the other hand the minimum value is
determinated by minimum-user-limit-percent parameter.

-Maximum resource usage of a user (multiple of queue capacity)

This parameter is used to set the resources that can be obtained by a user, with a default
of 1.
yarn.scheduler.capacity.root.QueueD.user-limit-factor=1

Task Limitation :

They are three parameters :

Maximum number of active tasks in a cluster (running and suspended tasks), by default
10000
if task requests reaches the limit, new tasks will be rejected.

Maximum number of tasks in queue:

by default 1000
if the number of submitted task request in queue reaches the limit, new tasks will be
rejected.

Maximum number of tasks submitted by a user:

It's depends on the Maximum number of tasks in queue parameter
this parameter is calculated by this way :
? = MaximumNumberOfTasksInQueue * yarn.scheduler.capacity.root.QueueA.mimimum-
user-limit-percent *
yarn.scheduler.capacity.root.QueueA.user-limit-factor

Enhanced Features of YARN in FusionInsight :

YARN Dynamic Memory Management:

Optimize the memory usage of containers in NodeManager

by Queuing overused containers only when the overall memory usage of all containers in
NodeManager reach a certain threshold value.
This value is the memory threshold state ofr NodeManager which can be calculated as
follows :

NM MEM thrshold = yarn.nodemanager.resource.memory-

mb*1024*1024*yarn.nodemanager.dynamic.memory.usage.threshold.

YARN Label-based Scheduling

Before this feature it's impossible to control which node a task is submitted to.
With Label-based scheduling we can specify the node to which tasks are submitted.

Application that have common resource requirements

Application that have demanding memory requirements

Applications that have demanding I/O requirements

NOTES :
-ResourceManager is responsible for the unified management and allocation of all
resources in the cluster
-NodeManager is the agent on each Node
-ApplicationMaster is responsible for all the jobs within the application lifecycle
-Each Map Task has a circular memory buffer which is 100MB by default (The size can be
configured by user).
-The threshold size of the buffer is 80% when it's reached the data in the buffer will be
written to local disks.
-Shuffle : it's the data transfr process between the Map Task and Reduce Task which
involves that reduce task copy MOF files from the Map Tasks and then Sort and Merge
MOF files
-Users can view the information Queue by FusionInsight WebUI

Chapter 4 : Spark2x In-memory Distributed Computing Engine

Spark is a distributed computing engine based on memory, it’s stores the intermediate
processing data in memory.

Apache Spark is a one-stop solution that integrates batch processing,real-time stream

processing, interative query, graph computing, and machine learning.

Batch processing can be used for data ETL

Machine learning can be used for shopping websites to judge whether customer reviews
are good or bad.

Sql query can be used for quering data in HIVE

Stream processing can be used for real-time businesss sych as page-click stream
analysis, recommendation systems, and public opinion analysis.

Spark Highlights :

Light : Spark uses Scala language, 30000 lines.

Fast: dalay for small datasets reaches the sub-second level
Flexible : Spark offers different levels of flexibility : it's supports new data operators, new
data sources, new language bytings ...
Smart : it uses exsisting big data components

Spark EcoSystem :

Spark interacte with multiple

applications: hive, mahout, Flume
environments:Hadoop, Docker, Mesos
data sources:HBase;ELastic,Kafka,Mysql

Spark Vs MapReduce :

Intermediate data of Spark is stored in memory whether in MapReduce is stored on local

disks
So Spark is improving the copmuting efficianty and reduce the delay of iterative
operations and batch processing

Spark performance is over 100 times high than of MapReduce ( For multiple iterations)

Spark System Architecture

Standalone : resource management system of spark it's support also the resource
management system of YARN and Mesos
Spark Core : Ditributed computing framework similar of MapReduce
Spark Sql ; Spark component mainly for processing structured data and conducting SQL
query on data (Formats : JSON, Parquet, ORC,)
Structured Steaming : engine built onf Spark SQL to process streaming data it's
programmed by using Scala and it's fault tolerant
Spark streaming : streaming engine for micro batch processing, stream data is sliced
and then processed in the computation engine of Spark Core
MLib : Spark's machine learning library
GraphX; Graph parallel computing
SparkR: R package that provides a lightweight frontend to use Apache Spark from R.

Core Concepts of Spark - RDD:

RDD (Resilient Distributed Datasets) is a fundamental data structure of Spark, it's a read-
only partitioned collection of records

-RDD is stored in memory by default and will overflow to disks in case of insufficient
memory

-Each RDD is divided into logical partitions which may be computed on different nodes
of the cluster (This feature improve performance by ther locality)

RDD can be created from a HADOOP file system such HDFS or Storage system that
Hadoop is compatible with.

RDD supports a lineage mechanism which indicate the dependency chain

RDD remembers how it's invloved from another RDD using a lineage so data can be
recovred quickly when data loss occurs

RDD Dependencies :

Two types :
Narrow dependencies:
Each partition of the parent RDD is used by one partition of the child RDD
advantage :
supports of multiple commands in pipeline mode
failure reconvery of narrow dependency is more effective (because only the last parent
partiton needs to be recomputed) and recomputation can be performed concurrently on
different nodes)
Wide dependencies:
Each partition of the parent RDD may be used by multiple partition of the child RDD

NOTE :
Spark Scheduler reversely traverses the whole dependency chain from the end of the
DAG ( DIrected Acyclic Graph)

Stage Division of RDD:

When a wide dependency is encountered the dependency chain is broken

When a narrow dependency is encountered the RDD partition is added to the current
stage

the number of tasks in a stage is determined b the number of RDD partitions at the end of
stage

RDD Operators :

Transformation :

It's invoked to generate a new RDD from one or more existing RDDs ( map, flatMap, filter,
reduceByKey)

Action :

A job is immediately started when action operators are invoked (take,count,

saveAsTextFile)

NOTE :
-all transformations in Spark is Lazy
-the transformations are only computed when an action requres results to be returned
the the driver programme

Major Roles of Spark :

Driver:responsible for the app business logic and operation planning (DAG)
ApplicationMaster:is used to manage application resources and applies for resources
based on applications
CLient: is to submit applications
ResourceManager:is responsible for scheduling and allocation resources in the whole
cluster
NodeManager:is responsible for resource management of it's node
Executor:Actual executor of a task, An application is split for multiple executors to
compute.

Spark on YARN - Client Operation Process :

Two modes :
YARN Client :
The driver is deployed and runs on the client

-The client sends the spark app request to ResourceManager and packages all the
information required to start ApplicationMaster and sends information to
ResourceManager
-ResourceManager returns the results (Application ID, upper limit, lower limit of available
resources )
-ResourceManager finds a proper mode for applicationMaster and start it on it’s node
-ApplicationMaster is a role on YARN, the process name in Spark is ExecuterLauncher
-ApplicationMaster can applies for series of containers to run tasks
-After receiving the newly allocated container lists from ResourceManager,
ApplicationMaster sends informations to the related NodeManager to start container
-ResourceManager allocate a container to ApplicationMaster then ApplicationMaster
communicates with the related NodeManager and starts Executor on the obtain container
-After Executor is started, it registers in Driver and applies for tasks
-Driver allocates tasks to Executor
-Executors do the tasks an report the operating status to Driver

YARN cluster :
Client generate the application information and sends the information to
ResourceManager
-ResourceManager allocates the Container (ApplicationMaster) to SparkApplication
-Then starts Driver on the container node
-ApplicationMaster applies for resources from ResourceManager to run Executor
-The ResourceManager allocates the container to ApplicationMaster
-ApplicationMaster communicates with the related NodeManager and starts Executer on
obtained Container
-After Executor is started, it registers in Driver and applies for tasks
-Driver allocates tasks to Executor
-Executors do the tasks an report the operating status to Driver

Differences between YARN-client and YARN-cluster :

In YARN-cluster mode, Driver runs in ApplicationMaster which is responsible for
applying resources from YARN and monitoring the running status of a job
Client can be closed and the job continues running on YARN
-YARN-cluster mode is not suitable for running interactive jobs

In YARN-client mode, ApplicationMaster applies only Executor from YARN

Client communicates with the obtained Container to schedule tasks, therefore, client
cannot be closed
YARN-cluster is suitable for production because the output of application can be quickly
generated
YARN-client is suitable for testing
If the task submission node in YARN-client mode is down then the entire task fails, such
a situation in YARN-cluster mode will not affect the entire task.

SparkSQL :
Module in Spark for structured data processing which can parse SQL language to RDDs
and then use SparkCore to execute

Dataframe is a distributed collection in which data is organized into named columns

Spark SQL reuses the HIVE frontend processing logic and metadata processing module
With SparkSQL you can directly query existing Hive data
SparkSQL also provides API,CLI and JDBC interface allowing diverse inputs to the client
Dataset is a strongly typed collection of domain specific object that can be transformed
in parallel using functional or relational operations

Datasets are similar to RDDs but does not use Java serialization or kryo encoders to
serialize object for processing or transmission over the network (Stored in encoded
binary form)

Dataframe can be created using : structured dataset,Hive table, external Database or

RDD
Dataframe has data structure information, which is schema

RDD,Dataframe,Datasets :

RDD :
Safe type, object oriented
Disadvantages : high performances overhead for serialization and deserialization
requires serialization and deserialization of data and data structures
Dataset and Dataframe have exactly the same functions but the different types of the
data in each row.
For the DataFrame, data in each row is of the row type (getAs, pandas matching in
column to obtain a specific field)
DateFrame advantages :
Schema information to reduce serialization and deserialization overhead.
Disadvantages : not object oriented, insecure during compiling.

Dataset :
Fast: performance is superior to RDD
Encoders are better that kryo or Java Serialization
Secure type: similar to RDD
Dataset has the advantages of RDD and DataFrame, and avoids their disadvantages.

NOTE :
Dataset, DataFrame, RDD can be converted to each other
Spark SQL vs Hive :
Differences :
Spark SQL uses Spark core as it’s execution engine while Hive uses MapReduce
The execution speed of SparkSQL is 10 to 100 times faster than Hive
Spark SQL does not support buckets, but Hive does.
Dependencies:
SparkSQL depends on the metadata of Hive and compatible with most syntax and
functions of HIVE
Also it can use user-defined functions in Hive

Spark Structure Streaming :

Structured Streaming is an engine built on SparkSQL to process streaming data
Programmed by using Scala, and it’s fault tolerance capability

NOTE : Spark uses standard SQL statement for query to obtain data from the
incrementally and Unbounded table
Consider the data input stream as the input table :
Every data itel that is arriving on the stream is like a new row being appended to the
input table
Each query operation will generate a result table
At each trigger interval (every 1 sec)updated data will be synchronized to the result table
Whenever the result table is updated
the updated result will be written into an external storage system
3 types of storage mode of structured streaming at the output phase :
Complete mode: a connector of an external system writes the updated result set into the
external storage system.
Append mode:if an interval is triggered only added data in the result table will be return
into an external system ( applicable only to a result set that has already existed and not
updated)
Update mode:if an interval is triggered onlyu updated data in the result table wiil be
written into an external system.

Spark Streaming :

Spark Streaming is a real time computing framework built on the Spark for processing
massive streaming data (Sources data: Kafka, HDFS)

Spark streaming is programming is based on DStream

Spark streaming is to segment the input data by second or millisecond and periodically
submits the data after segmentation which decomposes streaming programming into a
series of short batch jobs

Workflow :
Spark Streaming receives live input data streams and divides the data into batches
Then there are processed by the spark engine to generate the final stream of results in
batches
NOTE: Spark streaming provides a high level of obstruction called discretize stream or
DStream( Continuous stream of data)
DSqtreams can be created either from input data streams or from sources such as Kafka,
FLume, or by applying high level operation(Reduce,join, window...) on other DStreams
DSTream is repsented as a sequence of RDDs

Spark streaming fault tolerance mechanism :

Spark streaming performs computing based on RDDs, so any partition encountering
errors can be regenerated based on the parent RDD using the RDD lineage mechanism
If the parent RDD is lost look up for its parent RDD until the original data in the disk is
found.

Spark WEBUI :
Service status, roles, and open configuration items, perform management operations
(starting and stopping spark, downloading the spark client, synchronize configurations,
view the running instance health status and service overview)

Spark and others components :

HDFS: spark reads or writes data in the HDFS(mandatory)
YARN: YARN schedules and manages resources to support the running of spark
tasks(mandatory)
Hive:Spark SQL shares metadata and data files with Hive(mandatory)
Zookeeper:HA of JDBCServer depends on the coordination of ZooKeeper(mandatory)
Kafka:Spark can receive data streams sent by kafka(optional)
HBase: Spark can perform operations on HBase tables (optional)

NOTE: JDBCServer provides the JDBC interface will directly sending external JDBC
requests to compute and parse structured data.

Chapter 5 : HBase - Distributed NoSQL Database :

HBase uses memstore and storefile to store update data for the table.
The secondary index implements the function of indexing according to the values of
some columns.
HBase is a highly reliable, high performance, column-oriented, scalable distributed
storage system.
The HBase cluster has two roles: HMaster and HRegionServer.

HBase is a column-based distributed database built on top of HDFS

HBase uses HDFS as it’s file storage system with high reliability, performance, and
scalability
It’s provides real time read or write access to data in HDFS
Also HBase uses Zookeeper as the collaboration service
HBase is no SQL not only SQL (Non relational database) like MongoDB and REDIS

HBase vs RDB
HBase is column oriented : store read calculate data by column
HBase has dynamic extension of columns on the other hand RDB uses predefined data
structure
HBase supports common commercial hardware contrariwise RDB -tuses I/O intensive
and cost

Application Scenarios of HBase

Massive data(TB and PB)
ACID (normal database transaction) feature is not required
Hight throughput
Efficient random reading of massive data
Hight scalability
Simultaneous processing of structured and unstructured data

KeyValue Storage Model :

Data model in HBase is keyValue which means data is stored in the form of keyValue
pairs
Key is used to quickly query a data record
Value is used to store user data
Each key corresponds to many value
Data is read and written by block, different columns are not associated, so are tables

Subregion is a basic distributed storage unit stored in different modes

KeyValue contains key information such as timestamp and type, etc. (7 parts in total)
Each KeyValue has a column qualifier
There can be multiple KeValues associated with the same rowKey and column qualifier
When data is updated in HBase,there are multiple versions of the same data record and
they’re distinguished using timestamp.
columnFamily consists of one or more columns

HBase architecture :
HMaster : involves the active HMaster and standby HMaster in HA mode
It’s manages RegionServers in HBase including : create, delete, modify and query of a
table, balances the load of Region Servers, adjust the distribution of regions, split
regions and distributes regions after they’re split and migrate regions after a region
server failed
Standby HMaster takes over services when active one fails
The original active HMaster serves as the standby HMaster when the fold is ratified.

PCP Assighment New Long
No ratings yet
PCP Assighment New Long
54 pages
HashSet and HashMap
No ratings yet
HashSet and HashMap
2 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
AMR Case Study Solution March 2019
100% (1)
AMR Case Study Solution March 2019
6 pages
Database Commands - Class 12
No ratings yet
Database Commands - Class 12
9 pages
Question Paper Mca 2 Sem Database Management Systems Kca204 2022.pdfmca 2 Sem Database Management Systems Kca204 2022
No ratings yet
Question Paper Mca 2 Sem Database Management Systems Kca204 2022.pdfmca 2 Sem Database Management Systems Kca204 2022
3 pages
DMBI Sort
No ratings yet
DMBI Sort
89 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Essentials of Marketing Research 5th Edition Joseph F. Hair Jr. instant download
100% (2)
Essentials of Marketing Research 5th Edition Joseph F. Hair Jr. instant download
44 pages
Da 01
No ratings yet
Da 01
31 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Big Data Notes (2)
No ratings yet
Big Data Notes (2)
191 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Exp-8 Mern
No ratings yet
Exp-8 Mern
15 pages
T-SQL Backup
No ratings yet
T-SQL Backup
48 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
LIT FTK Specification Guide 6.3
No ratings yet
LIT FTK Specification Guide 6.3
8 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Paper Hdfs Summary
No ratings yet
Paper Hdfs Summary
5 pages
The Ultimate C - C - THR97 - 2011 - SAP Certified Application Associate - SAP SuccessFactors Onboarding 2H2020
No ratings yet
The Ultimate C - C - THR97 - 2011 - SAP Certified Application Associate - SAP SuccessFactors Onboarding 2H2020
2 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Logbook GROUP CLASS 12 R2D2 BOT (1).docx_20241110_115544_0000
No ratings yet
Logbook GROUP CLASS 12 R2D2 BOT (1).docx_20241110_115544_0000
42 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
HDFS
No ratings yet
HDFS
1 page
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Sad Thesis Guidelines Finals
No ratings yet
Sad Thesis Guidelines Finals
13 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Moving ASM Database Files From One Diskgroup To Another
No ratings yet
Moving ASM Database Files From One Diskgroup To Another
4 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Clutter Builder
No ratings yet
Clutter Builder
22 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Unit 3
No ratings yet
Unit 3
5 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Chap4_BigDataStorageAndManagement
No ratings yet
Chap4_BigDataStorageAndManagement
46 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
4
No ratings yet
4
53 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Applied Energy
No ratings yet
Applied Energy
9 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
bdh_unit_3
No ratings yet
bdh_unit_3
25 pages
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
100% (1)
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
3 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Data Replication in A Mobile Environment
100% (1)
Data Replication in A Mobile Environment
29 pages
Unit 2
No ratings yet
Unit 2
56 pages
Vsan 673 Administration Guide
No ratings yet
Vsan 673 Administration Guide
104 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Content Part 1 PDF
No ratings yet
Content Part 1 PDF
6 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Module.+5+ +Business+Reporting+and+Visual+Analytics
No ratings yet
Module.+5+ +Business+Reporting+and+Visual+Analytics
14 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Student Profiling On Academic
No ratings yet
Student Profiling On Academic
8 pages
Business Analytics CDS
No ratings yet
Business Analytics CDS
5 pages
GPFS Easy
100% (1)
GPFS Easy
24 pages
Ebook - CANOpen
100% (1)
Ebook - CANOpen
207 pages
Data Mining Notes
100% (1)
Data Mining Notes
178 pages
Ims DB
No ratings yet
Ims DB
53 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Umesh Kudhekar Major Project Synopsis
No ratings yet
Umesh Kudhekar Major Project Synopsis
8 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
The Architecture of Storage Networks
From Everand
The Architecture of Storage Networks
Pasquale De Marco
No ratings yet
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet