0% found this document useful (0 votes)
22 views20 pages

HCIA Big Data

The document discusses the big data industry in China, outlining the national strategy, definitions, applications, and challenges of big data, as well as detailing the Hadoop Distributed File System (HDFS) and MapReduce processing techniques. It highlights the architecture, features, and operational mechanisms of HDFS, including data replication, fault tolerance, and security measures. Additionally, it covers YARN's resource management and scheduling capabilities within the Hadoop ecosystem, emphasizing its dynamic resource allocation and task management features.

Uploaded by

AdilHoubbane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

HCIA Big Data

The document discusses the big data industry in China, outlining the national strategy, definitions, applications, and challenges of big data, as well as detailing the Hadoop Distributed File System (HDFS) and MapReduce processing techniques. It highlights the architecture, features, and operational mechanisms of HDFS, including data replication, fault tolerance, and security measures. Additionally, it covers YARN's resource management and scheduling capabilities within the Hadoop ecosystem, emphasizing its dynamic resource allocation and task management features.

Uploaded by

AdilHoubbane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 1 : Big data industry and Technological Trends :

Nation big data strategy in china tasks :


promote the innovation and development of big data
build digital economy with data as key enabler
improve the country's capability in governance with big data
improve people's livelihood by applying big data
protect national data security

Definition of big data :


Is a data sets that are so voluminous and complex that traditional data-processing
application software is inadequate to deal with them.

Volume : Huge amÙount of data


Variety : Various types of data
Velocity : Data processing speed
Value : Low data value density

Data as a platform (DaaP)

Database Big data


Data scale Small(in Mb) Large (in GB,TB,PB)
Data type Single(Structured) Diversified
Relationship Schema and data Data first and schema
Processing tool One size fits all No size fits all

Taobao 50000 GB data generated per day


40 millions GB data storage volume

Big data applications


Politics
Finance
education
transportation
tourism
Government and puplic security
traffic planning
sports

Industry applications of big data


Operation(telecom and finance), Management(finance), Supervision (government),
Profession(government)
Big data challenges :
Business departments do not have clear requirements on big data
data fragments ( data is not shared in entreprise)
low data availability and poor quality
data management technology and architecture
data security
insufficient big data talents
balance between data openness and privacy

NOTES:
Panthera : Intel open source project for instant analysis document storgae and sql
engine
In memory computing : Google PowerDrill
Streaming computing : IBM infoSphere

Huawei Big data solution : FusionInsight


FusionInsight Architecture:
Hadoop : data processing environment
DataFarm: Data inside & include porter for data integration services, Miner ; data-mining,
Farmer: data service frameworks
Manager: DIstributed management framework that include system management, Service
governance, security management
LibrA: entreprise-level relational database: its support role storage and column storage

Big data platform partners in CHINA:


Industrial and commercial bank of china
China merchants bank(cmb)
Pacific Insurance
China mobile
China unicom

Chapter 2 : HDFS - Hadoop Distibuted File System


-HDFS is developed based on Google File System
-It's deployed -on multiple independent physical machines
-Single point NameNode failure problem is solved by the use of zookeeper to implement
the primary node which reflect the hight reliability of HDFS
-HDFS relies on metadata persistence to back up metadata in memory.
-HDFS can choose different strategies for data storage, including tiered storage and
label storage.
The purpose of Colocation is to store the data of the associated relationship or the data
that may be associated with the same node

HDFS features :
Hight fault tolerance
Hight throughput : support apps with large amount of data
large file storage : TB and PB level data storage

NOTE :
-HDFS is built around the idea that the most efficient data processing pattern is write
onece read many times pattern.
-the directory and the data block of a file each is about 150 bytes
-Files is divided into segments called S-blocks
-the default block size is 128 MB
-HDFS runs multiple clients instances

HDFS is inapplicable to :
Store massive small files : the number of files that can be stored to the system is
restricted by NaemeNode memory capacity
Random write : no support for multiple writers on a file
Low-delay read

HDFS system architecture :


-NameNode : store and generate metadata of the file system
-DataNode: Is used to store actual data and report back to the NameNode with data
blocks under it's management
Create, delete, replicate blocks according to the instructins of the NameNode
Block is the minimum amount of data that HDFS can read or write
-Client

Process of writing data in HDFS :


service app invokes an API provided by HDFS Client for data writing
HDFS client create the file by calling the create method of Distributed FIleSystem
DFS send RPC request to the NameNode to create a new file in the namespace
DFS return FSDataOutputStream for the client
HDFS Client obtains the data block number and location from NameNode connect to the
datanode and establish a pipeline

process of reading data in HDFS :


FSDataInputStream

Key-Features of HDFS:

Federation storage:
-HDFS Federation allows a cluster to scale by adding NameNodes
-each one manages a portion of the filesystem NameSpace (Each NameNode manage a
Namespace volume)
-Each Namespace has a block pool that contains all the blocks for the files in the
Namespace
Data storage policy:

By default HDFS NameNode automatically selects DataNodes to store data replicas.


3 senarios :
Store data in different storage devices :
-RAM_DISK, DISK, ARCHIVE, SSD
Store data with labels (TAG Storage)
Store data in highly reliable node groups :
The first replica is written to a mandatory rack group, if there are no available nodes in
this mandatory rack group, data write fails.
The second replica is written to a random node from a non mandatory rack group where
local client is located.
If the client is located in the mandatory rack group, then the second replica is written to
a node from other rack groups.

HA : Hight AVailibility:
-eflected in active / standby NameNode elected by zookeeper
-active NameNode provide services
-it's store EditLog which records user operation generated by active NameNode.
-standby NameNode backs up meatadata as hot spare
-Standby NameNode loads EditLog on Active NameNode to synchronize metadata
-Zookeeper is used to store status files and active/standby status information
-ZKFC (Zookeeper Failover Controller) is used for monitoring the active/ standby status

Data organization:
Data is stored by block in the HDFS

Multiple access modes:

in HDFS data can be accessed through Java APIs, HTTP, or Shell commands

Space reclamation :

HDFS support recycle bin mechanism


Files can be recovered if accidentally deleted

NameNode/DataNode in master/ slave mode

Unified file system namespace:


HDFS presents itself as one unified file system externally.

Data replication:
-HDFS replicate file blocks and stores in different dataNodes
-An app can specify the number of the replica and that number can be changed.
Replica policy on HDFS :
First replica on the same node as the client
Second replica is placed on a remote rack
Third replica placed on another rack
Otherwise the first and third replica are placed on diffrent nodes on the same rack
Other replicas are placed randomly on the cluster

metadata persistence:
-the standby NameNode informe the active NameNode of generating a new file called
EditLog.new to continue recording logs in the new file on the other hand the standby
NameNode can obtain the old Editlog and download the FSimage which stores the file
mirroring periodically each hour from Active NameNode
-The standby NameNode merge the old EditLog with FSimage and generate a new
metadata file called FSImage.ckpt then he upload this file to the active NameNode
-The name of FSImage generated is changed to FSImage to overwrite the original
FSImage file
-The Editlog.new file is also renamed as EditLog
- This operation is triggerd every hour or when the capacity of EditLog is 64MB

Robustness

HDFS Colocation :

file-level colocation
The blocks of multiple associated files are distributed on the same storage node =>
Facilitate quick file access and avoid high costs of data migration and reduce resource
consumption

HDFS Data Integrity Assurance :

Methods in case of failure of each component in HDFS :


Replica Reconstruction:

DataNode sends heartbeat messages to the NameNode periodically to report the data
status
The NameNode checks whether the data blocks are completely reported
if the data block are not reported because the DISK of the DataNode is damaged the
NameNode initiates the replica reconstruction to recover the lost replica
Data Balance Mechanism:

Ensure the even distribution of the data among all DataNodes

Metadata reliability :

Metadata in HDFS is stored on both active and standby NameNodes


Data can be recovered in time with this snapshot mechanism if misoperation occurs.

Security Mode :

DATA in HDFS become read-only


Create & delete operation will fail.
HDFS provides a unique security mode to prevent fault spreading when a DataNode or
hard disk is faulty
(La propagation des erreurs)
HDFS exist the security mode after the fault is resolved and data is recovered.

HDFS Common Shell Commands :


dfs (-cat, -ls, -rm, -put, -get, -mkdir, -chmod/chown)

dfsadmin (-safe mode, -report)

Chapter 3 : MapReduce

MapReduce :
Is processing technique and a program model for distributed parallel computing of
massive data set
Highlights:
Easy to program
Capabilities can be improved by adding nodes
Hight fault tolerance using policies or data migration

YARN (Yet Another Resource Negotiator) :


Is hadoop new resource manager which is used for managing and scheduling resources
for applications

Working process of MapReduce :


Assuming that the files to be processed are stored in HDFS

-MapReduce submits requests to ResourceManager


-ResourceManager creates jobs (On app map to one job)
-Before jobs are submitted, the files to be processed are split, by default a Block is Split
=> Map Task.
-ResourceManager allocate resource for the job
-it selects an appropriate NodeManager in the cluster to schedule application master
based on the workloads of NodeMangers
-after a NodeManger is selected application master initiate jobs and applies for resources
from ResourceManager
-The NodeManager start the Container for Task Execution
-Map Task process : it's take a set of data and converts it into another set of data where
elements are grouped into key value pairs.
-Before writing data in the buffer, the data is divided into partitions
-The number of ReduceTasks is determined by user
-The number of partitions is determined by that of Reduce Tasks
-The data with the same key value are sent to the same Reduce tasks for processing
-With each partition map output would be sorted by key then combine(Optional) if
necessary and merged as a MapOutFile
-Each time the memory buffer reached DISK spill threshold a new spill file is created
-Before the task are finished the spill files are merged into a single partition and sorted
output file, this file is MapOuFile
-When map tasks are finished and the ouput progress of MOF reaches 3% the reduce
tasks are started and copy MOF files from each Map Task
-The MOF files outputted by Map Tasks map to Reduce tasks.
-The Map Tasks may finish at different times so the reduce task srats copying their
outputs as soon as each complete
-Then Reduce task sort and merge multiple MapOuFile into One
-During the reduce phase, the reduce function is invoked for each key in the sorted
output
-The output of this phase is written directly to the output file system typically HDFS.

Architecture of YARN :
It's consist of three components :

Client : Submit jobs to ResourceManager.


ResourceManager : Manage the use of resource across the cluster (Only one
resourceManager in a HADOOP cluster).
NodeManager: Runing in all nodes in the cluster to launch and monitor containers, also
report node status to ResourceManager periodically
ApplicationMaster : Responsible for negotiating resources from the ResourceManager
and Working with the NodeManager to execute and monitor the component tasks
Container: Is abstraction of YARN resources and it excecute an application specific
process with the constraint of resources( Memory & CPU)
-It also reports the status of MapReduce tasks to Application Master

Task Scheduling of MapReduce on YARN:

-After submission of a job to ResourceManager from client


-The resourceManager receive the job and allocates the first container for the application
program
-Then ResourceManager asks NodeManager to start applicationMaster in the container
-Then ApplicationMaster registres with ResourceManager
-After that ApplicationMaster applies for an obtained resource from ResourceManager in
pulling mode using RPC protocol for each task and monitors the tasks unitl finished
-Once obtaining resources ApplicationMaster asks NodeManager to start the tasks
-Each task uses RPC to report its status and progress to ApplicationMaster (So
applicationMaster can restart the task if one fails)
-Users can use RPC to obtain the operating status of the application program from
ApplicationMaster at any time.
-After application program ends, ApplicationMaster cancels it's registration with
ResourceManager and close itself.

YARN HA Solution :

Redundant ResourceManager:

Similar to HDFS HA solution when the active ResourceManager fails, the fail-over can be
triggered automatically or manually to switch active and standby state.
-When Automatic Fail-over is not enabled, administrators must run a command to
manually switch one of the ResourceManagement node to the active state.
-When the active ResourceManager goes down or becomes unresponsive another
ResourceManager is automatically elected to be the active node.

YARN APPMaster Fault Tolerant Mechanism :

If an ApplicationMaster goes down ResourceManager will close all the containers that it
manages, including containers where tasks are running
-Then ResourceManager will start a new ApplicationMaster on another computing node.
-YARN supports the feature of preserving the container status when starting a new
ApplicationMaster so that the tasks in these containers can run continuously without any
failures.

Resource Management in YARN :

YARN supports management and allocation of two types of resources : Memory & CPU
-Users can configure the memory and the number of CPU for each NodeManager

-yarn.nodemanager.resource.memory-mb :
This configuration indicates the physical memory size of the container running on
NodeManager (in MB, the value must be smaller than the memory size of the
NodeManager Server)
-yarn.nodemanager.vmem-pmem-ratio:
It's indicates the ration of the maximum available virtual memory to the physical memory
of a Container (The propotion of virtual memory utilization exceeding the assigned value
is not allowed to be greater than that ratio).

-yarn.nodemanager.resource.cpu-vcore :
It's indicates the number of CPU cores that can be allocated to a Container( It's
recommended that this parameter value to be set to 1.5 to 2 times the number of CPU
cores)

Resource Allocation Model in YARN :


In YARN Resource Scheduler organize resources throught hierarchical queues
Users can submit application to one or more queues, then Resource Scheduler select a
queue based on certain algorithms, and selects an application in the queue, and then
attempts to allocate request resources to the application.
-Queues are classified into Parent queues and Leaf queues.
-A parent queue can have multiple Leaf queues
-tasks are running on leaf queues.
-if a queue is specified when the task is submitted, the scheduler allocates the taskt to
the queue, if not the task is allocated to the default queue.
-If resources fails to be allocated to an application due to the limit of some parameters,
the Resource Scheduler will selects the next application.
-Queue(Memory & CPU) for a container.

Capacity Scheduler Overview :

-It's allocate resources by queue,


-Administrators can restrict the resource used by a queue,user, or job.

Highlights of Capacity Scheduler :

Capacity assurance: administrators can set upper and lower limits for the resource
usage of each queue. (Queue shared resources)

Flexibility :
The remaining recources of a queue can be used by others queues that require
resources.

Priority : Priority queuing is supported (FIFO by default)

Multi-leasing : multiple users can hare a cluster, and multiple apps can run concurrently.
Administrators can add multiple restrictions to prevent cluster resources from being
excljusively occupied by an application, user, or queue.
Dynamic update of configuration files :
Administrators can dynamically modify configuration parameters to manage clusters
online.

Task Selection by Capacity Scheduler :


The queue is selected based on the following policies :

-The queue with the lower resource usage is allocated first.


-Resources are allocated to the queue with the minimum queue hierarchy first.
-Resources are allocated to the resource reclamation request queue first.
- A task is then selected based on the task priority and submission squence as well as
the limit of user resources and memory.

User and Task Limitation in FusionInsghit :

Tenant>Dynamic Resource Plan>Queue Config


There ara two parameters :

-Minimum resource assurance of a user :


If tasks of multiple users are running at the same time in a queue, the resource usage of
each user fluctuates between the minimum value and the maximum value which
determinated by the number of running tasks, on the other hand the minimum value is
determinated by minimum-user-limit-percent parameter.

-Maximum resource usage of a user (multiple of queue capacity)


This parameter is used to set the resources that can be obtained by a user, with a default
of 1.
yarn.scheduler.capacity.root.QueueD.user-limit-factor=1

Task Limitation :

They are three parameters :


Maximum number of active tasks in a cluster (running and suspended tasks), by default
10000
if task requests reaches the limit, new tasks will be rejected.

Maximum number of tasks in queue:


by default 1000
if the number of submitted task request in queue reaches the limit, new tasks will be
rejected.

Maximum number of tasks submitted by a user:


It's depends on the Maximum number of tasks in queue parameter
this parameter is calculated by this way :
? = MaximumNumberOfTasksInQueue * yarn.scheduler.capacity.root.QueueA.mimimum-
user-limit-percent *
yarn.scheduler.capacity.root.QueueA.user-limit-factor

Enhanced Features of YARN in FusionInsight :

YARN Dynamic Memory Management:


Optimize the memory usage of containers in NodeManager

by Queuing overused containers only when the overall memory usage of all containers in
NodeManager reach a certain threshold value.
This value is the memory threshold state ofr NodeManager which can be calculated as
follows :

NM MEM thrshold = yarn.nodemanager.resource.memory-


mb*1024*1024*yarn.nodemanager.dynamic.memory.usage.threshold.

YARN Label-based Scheduling

Before this feature it's impossible to control which node a task is submitted to.
With Label-based scheduling we can specify the node to which tasks are submitted.

Application that have common resource requirements

Application that have demanding memory requirements

Applications that have demanding I/O requirements

NOTES :
-ResourceManager is responsible for the unified management and allocation of all
resources in the cluster
-NodeManager is the agent on each Node
-ApplicationMaster is responsible for all the jobs within the application lifecycle
-Each Map Task has a circular memory buffer which is 100MB by default (The size can be
configured by user).
-The threshold size of the buffer is 80% when it's reached the data in the buffer will be
written to local disks.
-Shuffle : it's the data transfr process between the Map Task and Reduce Task which
involves that reduce task copy MOF files from the Map Tasks and then Sort and Merge
MOF files
-Users can view the information Queue by FusionInsight WebUI

Chapter 4 : Spark2x In-memory Distributed Computing Engine


Spark is a distributed computing engine based on memory, it’s stores the intermediate
processing data in memory.

Apache Spark is a one-stop solution that integrates batch processing,real-time stream


processing, interative query, graph computing, and machine learning.

Batch processing can be used for data ETL


Machine learning can be used for shopping websites to judge whether customer reviews
are good or bad.

Sql query can be used for quering data in HIVE


Stream processing can be used for real-time businesss sych as page-click stream
analysis, recommendation systems, and public opinion analysis.

Spark Highlights :

Light : Spark uses Scala language, 30000 lines.


Fast: dalay for small datasets reaches the sub-second level
Flexible : Spark offers different levels of flexibility : it's supports new data operators, new
data sources, new language bytings ...
Smart : it uses exsisting big data components

Spark EcoSystem :

Spark interacte with multiple


applications: hive, mahout, Flume
environments:Hadoop, Docker, Mesos
data sources:HBase;ELastic,Kafka,Mysql

Spark Vs MapReduce :

Intermediate data of Spark is stored in memory whether in MapReduce is stored on local


disks
So Spark is improving the copmuting efficianty and reduce the delay of iterative
operations and batch processing

Spark performance is over 100 times high than of MapReduce ( For multiple iterations)

Spark System Architecture

Standalone : resource management system of spark it's support also the resource
management system of YARN and Mesos
Spark Core : Ditributed computing framework similar of MapReduce
Spark Sql ; Spark component mainly for processing structured data and conducting SQL
query on data (Formats : JSON, Parquet, ORC,)
Structured Steaming : engine built onf Spark SQL to process streaming data it's
programmed by using Scala and it's fault tolerant
Spark streaming : streaming engine for micro batch processing, stream data is sliced
and then processed in the computation engine of Spark Core
MLib : Spark's machine learning library
GraphX; Graph parallel computing
SparkR: R package that provides a lightweight frontend to use Apache Spark from R.

Core Concepts of Spark - RDD:

RDD (Resilient Distributed Datasets) is a fundamental data structure of Spark, it's a read-
only partitioned collection of records

-RDD is stored in memory by default and will overflow to disks in case of insufficient
memory

-Each RDD is divided into logical partitions which may be computed on different nodes
of the cluster (This feature improve performance by ther locality)

RDD can be created from a HADOOP file system such HDFS or Storage system that
Hadoop is compatible with.

RDD supports a lineage mechanism which indicate the dependency chain

RDD remembers how it's invloved from another RDD using a lineage so data can be
recovred quickly when data loss occurs

RDD Dependencies :

Two types :
Narrow dependencies:
Each partition of the parent RDD is used by one partition of the child RDD
advantage :
supports of multiple commands in pipeline mode
failure reconvery of narrow dependency is more effective (because only the last parent
partiton needs to be recomputed) and recomputation can be performed concurrently on
different nodes)
Wide dependencies:
Each partition of the parent RDD may be used by multiple partition of the child RDD

NOTE :
Spark Scheduler reversely traverses the whole dependency chain from the end of the
DAG ( DIrected Acyclic Graph)

Stage Division of RDD:

When a wide dependency is encountered the dependency chain is broken


When a narrow dependency is encountered the RDD partition is added to the current
stage

the number of tasks in a stage is determined b the number of RDD partitions at the end of
stage

RDD Operators :

Transformation :

It's invoked to generate a new RDD from one or more existing RDDs ( map, flatMap, filter,
reduceByKey)

Action :

A job is immediately started when action operators are invoked (take,count,


saveAsTextFile)

NOTE :
-all transformations in Spark is Lazy
-the transformations are only computed when an action requres results to be returned
the the driver programme

Major Roles of Spark :

Driver:responsible for the app business logic and operation planning (DAG)
ApplicationMaster:is used to manage application resources and applies for resources
based on applications
CLient: is to submit applications
ResourceManager:is responsible for scheduling and allocation resources in the whole
cluster
NodeManager:is responsible for resource management of it's node
Executor:Actual executor of a task, An application is split for multiple executors to
compute.

Spark on YARN - Client Operation Process :


Two modes :
YARN Client :
The driver is deployed and runs on the client

-The client sends the spark app request to ResourceManager and packages all the
information required to start ApplicationMaster and sends information to
ResourceManager
-ResourceManager returns the results (Application ID, upper limit, lower limit of available
resources )
-ResourceManager finds a proper mode for applicationMaster and start it on it’s node
-ApplicationMaster is a role on YARN, the process name in Spark is ExecuterLauncher
-ApplicationMaster can applies for series of containers to run tasks
-After receiving the newly allocated container lists from ResourceManager,
ApplicationMaster sends informations to the related NodeManager to start container
-ResourceManager allocate a container to ApplicationMaster then ApplicationMaster
communicates with the related NodeManager and starts Executor on the obtain container
-After Executor is started, it registers in Driver and applies for tasks
-Driver allocates tasks to Executor
-Executors do the tasks an report the operating status to Driver

YARN cluster :
Client generate the application information and sends the information to
ResourceManager
-ResourceManager allocates the Container (ApplicationMaster) to SparkApplication
-Then starts Driver on the container node
-ApplicationMaster applies for resources from ResourceManager to run Executor
-The ResourceManager allocates the container to ApplicationMaster
-ApplicationMaster communicates with the related NodeManager and starts Executer on
obtained Container
-After Executor is started, it registers in Driver and applies for tasks
-Driver allocates tasks to Executor
-Executors do the tasks an report the operating status to Driver

Differences between YARN-client and YARN-cluster :


In YARN-cluster mode, Driver runs in ApplicationMaster which is responsible for
applying resources from YARN and monitoring the running status of a job
Client can be closed and the job continues running on YARN
-YARN-cluster mode is not suitable for running interactive jobs

In YARN-client mode, ApplicationMaster applies only Executor from YARN


Client communicates with the obtained Container to schedule tasks, therefore, client
cannot be closed
YARN-cluster is suitable for production because the output of application can be quickly
generated
YARN-client is suitable for testing
If the task submission node in YARN-client mode is down then the entire task fails, such
a situation in YARN-cluster mode will not affect the entire task.

SparkSQL :
Module in Spark for structured data processing which can parse SQL language to RDDs
and then use SparkCore to execute

Dataframe is a distributed collection in which data is organized into named columns


Spark SQL reuses the HIVE frontend processing logic and metadata processing module
With SparkSQL you can directly query existing Hive data
SparkSQL also provides API,CLI and JDBC interface allowing diverse inputs to the client
Dataset is a strongly typed collection of domain specific object that can be transformed
in parallel using functional or relational operations

Datasets are similar to RDDs but does not use Java serialization or kryo encoders to
serialize object for processing or transmission over the network (Stored in encoded
binary form)

Dataframe can be created using : structured dataset,Hive table, external Database or


RDD
Dataframe has data structure information, which is schema

RDD,Dataframe,Datasets :

RDD :
Safe type, object oriented
Disadvantages : high performances overhead for serialization and deserialization
requires serialization and deserialization of data and data structures
Dataset and Dataframe have exactly the same functions but the different types of the
data in each row.
For the DataFrame, data in each row is of the row type (getAs, pandas matching in
column to obtain a specific field)
DateFrame advantages :
Schema information to reduce serialization and deserialization overhead.
Disadvantages : not object oriented, insecure during compiling.

Dataset :
Fast: performance is superior to RDD
Encoders are better that kryo or Java Serialization
Secure type: similar to RDD
Dataset has the advantages of RDD and DataFrame, and avoids their disadvantages.

NOTE :
Dataset, DataFrame, RDD can be converted to each other
Spark SQL vs Hive :
Differences :
Spark SQL uses Spark core as it’s execution engine while Hive uses MapReduce
The execution speed of SparkSQL is 10 to 100 times faster than Hive
Spark SQL does not support buckets, but Hive does.
Dependencies:
SparkSQL depends on the metadata of Hive and compatible with most syntax and
functions of HIVE
Also it can use user-defined functions in Hive

Spark Structure Streaming :


Structured Streaming is an engine built on SparkSQL to process streaming data
Programmed by using Scala, and it’s fault tolerance capability

NOTE : Spark uses standard SQL statement for query to obtain data from the
incrementally and Unbounded table
Consider the data input stream as the input table :
Every data itel that is arriving on the stream is like a new row being appended to the
input table
Each query operation will generate a result table
At each trigger interval (every 1 sec)updated data will be synchronized to the result table
Whenever the result table is updated
the updated result will be written into an external storage system
3 types of storage mode of structured streaming at the output phase :
Complete mode: a connector of an external system writes the updated result set into the
external storage system.
Append mode:if an interval is triggered only added data in the result table will be return
into an external system ( applicable only to a result set that has already existed and not
updated)
Update mode:if an interval is triggered onlyu updated data in the result table wiil be
written into an external system.

Spark Streaming :

Spark Streaming is a real time computing framework built on the Spark for processing
massive streaming data (Sources data: Kafka, HDFS)

Spark streaming is programming is based on DStream


Spark streaming is to segment the input data by second or millisecond and periodically
submits the data after segmentation which decomposes streaming programming into a
series of short batch jobs

Workflow :
Spark Streaming receives live input data streams and divides the data into batches
Then there are processed by the spark engine to generate the final stream of results in
batches
NOTE: Spark streaming provides a high level of obstruction called discretize stream or
DStream( Continuous stream of data)
DSqtreams can be created either from input data streams or from sources such as Kafka,
FLume, or by applying high level operation(Reduce,join, window...) on other DStreams
DSTream is repsented as a sequence of RDDs

Spark streaming fault tolerance mechanism :


Spark streaming performs computing based on RDDs, so any partition encountering
errors can be regenerated based on the parent RDD using the RDD lineage mechanism
If the parent RDD is lost look up for its parent RDD until the original data in the disk is
found.

Spark WEBUI :
Service status, roles, and open configuration items, perform management operations
(starting and stopping spark, downloading the spark client, synchronize configurations,
view the running instance health status and service overview)

Spark and others components :


HDFS: spark reads or writes data in the HDFS(mandatory)
YARN: YARN schedules and manages resources to support the running of spark
tasks(mandatory)
Hive:Spark SQL shares metadata and data files with Hive(mandatory)
Zookeeper:HA of JDBCServer depends on the coordination of ZooKeeper(mandatory)
Kafka:Spark can receive data streams sent by kafka(optional)
HBase: Spark can perform operations on HBase tables (optional)

NOTE: JDBCServer provides the JDBC interface will directly sending external JDBC
requests to compute and parse structured data.

Chapter 5 : HBase - Distributed NoSQL Database :

HBase uses memstore and storefile to store update data for the table.
The secondary index implements the function of indexing according to the values of
some columns.
HBase is a highly reliable, high performance, column-oriented, scalable distributed
storage system.
The HBase cluster has two roles: HMaster and HRegionServer.

HBase is a column-based distributed database built on top of HDFS


HBase uses HDFS as it’s file storage system with high reliability, performance, and
scalability
It’s provides real time read or write access to data in HDFS
Also HBase uses Zookeeper as the collaboration service
HBase is no SQL not only SQL (Non relational database) like MongoDB and REDIS

HBase vs RDB
HBase is column oriented : store read calculate data by column
HBase has dynamic extension of columns on the other hand RDB uses predefined data
structure
HBase supports common commercial hardware contrariwise RDB -tuses I/O intensive
and cost

Application Scenarios of HBase


Massive data(TB and PB)
ACID (normal database transaction) feature is not required
Hight throughput
Efficient random reading of massive data
Hight scalability
Simultaneous processing of structured and unstructured data

KeyValue Storage Model :


Data model in HBase is keyValue which means data is stored in the form of keyValue
pairs
Key is used to quickly query a data record
Value is used to store user data
Each key corresponds to many value
Data is read and written by block, different columns are not associated, so are tables

Subregion is a basic distributed storage unit stored in different modes


KeyValue contains key information such as timestamp and type, etc. (7 parts in total)
Each KeyValue has a column qualifier
There can be multiple KeValues associated with the same rowKey and column qualifier
When data is updated in HBase,there are multiple versions of the same data record and
they’re distinguished using timestamp.
columnFamily consists of one or more columns

HBase architecture :
HMaster : involves the active HMaster and standby HMaster in HA mode
It’s manages RegionServers in HBase including : create, delete, modify and query of a
table, balances the load of Region Servers, adjust the distribution of regions, split
regions and distributes regions after they’re split and migrate regions after a region
server failed
Standby HMaster takes over services when active one fails
The original active HMaster serves as the standby HMaster when the fold is ratified.

You might also like