HCIA Big Data
HCIA Big Data
NOTES:
Panthera : Intel open source project for instant analysis document storgae and sql
engine
In memory computing : Google PowerDrill
Streaming computing : IBM infoSphere
HDFS features :
Hight fault tolerance
Hight throughput : support apps with large amount of data
large file storage : TB and PB level data storage
NOTE :
-HDFS is built around the idea that the most efficient data processing pattern is write
onece read many times pattern.
-the directory and the data block of a file each is about 150 bytes
-Files is divided into segments called S-blocks
-the default block size is 128 MB
-HDFS runs multiple clients instances
HDFS is inapplicable to :
Store massive small files : the number of files that can be stored to the system is
restricted by NaemeNode memory capacity
Random write : no support for multiple writers on a file
Low-delay read
Key-Features of HDFS:
Federation storage:
-HDFS Federation allows a cluster to scale by adding NameNodes
-each one manages a portion of the filesystem NameSpace (Each NameNode manage a
Namespace volume)
-Each Namespace has a block pool that contains all the blocks for the files in the
Namespace
Data storage policy:
HA : Hight AVailibility:
-eflected in active / standby NameNode elected by zookeeper
-active NameNode provide services
-it's store EditLog which records user operation generated by active NameNode.
-standby NameNode backs up meatadata as hot spare
-Standby NameNode loads EditLog on Active NameNode to synchronize metadata
-Zookeeper is used to store status files and active/standby status information
-ZKFC (Zookeeper Failover Controller) is used for monitoring the active/ standby status
Data organization:
Data is stored by block in the HDFS
in HDFS data can be accessed through Java APIs, HTTP, or Shell commands
Space reclamation :
Data replication:
-HDFS replicate file blocks and stores in different dataNodes
-An app can specify the number of the replica and that number can be changed.
Replica policy on HDFS :
First replica on the same node as the client
Second replica is placed on a remote rack
Third replica placed on another rack
Otherwise the first and third replica are placed on diffrent nodes on the same rack
Other replicas are placed randomly on the cluster
metadata persistence:
-the standby NameNode informe the active NameNode of generating a new file called
EditLog.new to continue recording logs in the new file on the other hand the standby
NameNode can obtain the old Editlog and download the FSimage which stores the file
mirroring periodically each hour from Active NameNode
-The standby NameNode merge the old EditLog with FSimage and generate a new
metadata file called FSImage.ckpt then he upload this file to the active NameNode
-The name of FSImage generated is changed to FSImage to overwrite the original
FSImage file
-The Editlog.new file is also renamed as EditLog
- This operation is triggerd every hour or when the capacity of EditLog is 64MB
Robustness
HDFS Colocation :
file-level colocation
The blocks of multiple associated files are distributed on the same storage node =>
Facilitate quick file access and avoid high costs of data migration and reduce resource
consumption
DataNode sends heartbeat messages to the NameNode periodically to report the data
status
The NameNode checks whether the data blocks are completely reported
if the data block are not reported because the DISK of the DataNode is damaged the
NameNode initiates the replica reconstruction to recover the lost replica
Data Balance Mechanism:
Metadata reliability :
Security Mode :
Chapter 3 : MapReduce
MapReduce :
Is processing technique and a program model for distributed parallel computing of
massive data set
Highlights:
Easy to program
Capabilities can be improved by adding nodes
Hight fault tolerance using policies or data migration
Architecture of YARN :
It's consist of three components :
YARN HA Solution :
Redundant ResourceManager:
Similar to HDFS HA solution when the active ResourceManager fails, the fail-over can be
triggered automatically or manually to switch active and standby state.
-When Automatic Fail-over is not enabled, administrators must run a command to
manually switch one of the ResourceManagement node to the active state.
-When the active ResourceManager goes down or becomes unresponsive another
ResourceManager is automatically elected to be the active node.
If an ApplicationMaster goes down ResourceManager will close all the containers that it
manages, including containers where tasks are running
-Then ResourceManager will start a new ApplicationMaster on another computing node.
-YARN supports the feature of preserving the container status when starting a new
ApplicationMaster so that the tasks in these containers can run continuously without any
failures.
YARN supports management and allocation of two types of resources : Memory & CPU
-Users can configure the memory and the number of CPU for each NodeManager
-yarn.nodemanager.resource.memory-mb :
This configuration indicates the physical memory size of the container running on
NodeManager (in MB, the value must be smaller than the memory size of the
NodeManager Server)
-yarn.nodemanager.vmem-pmem-ratio:
It's indicates the ration of the maximum available virtual memory to the physical memory
of a Container (The propotion of virtual memory utilization exceeding the assigned value
is not allowed to be greater than that ratio).
-yarn.nodemanager.resource.cpu-vcore :
It's indicates the number of CPU cores that can be allocated to a Container( It's
recommended that this parameter value to be set to 1.5 to 2 times the number of CPU
cores)
Capacity assurance: administrators can set upper and lower limits for the resource
usage of each queue. (Queue shared resources)
Flexibility :
The remaining recources of a queue can be used by others queues that require
resources.
Multi-leasing : multiple users can hare a cluster, and multiple apps can run concurrently.
Administrators can add multiple restrictions to prevent cluster resources from being
excljusively occupied by an application, user, or queue.
Dynamic update of configuration files :
Administrators can dynamically modify configuration parameters to manage clusters
online.
Task Limitation :
by Queuing overused containers only when the overall memory usage of all containers in
NodeManager reach a certain threshold value.
This value is the memory threshold state ofr NodeManager which can be calculated as
follows :
Before this feature it's impossible to control which node a task is submitted to.
With Label-based scheduling we can specify the node to which tasks are submitted.
NOTES :
-ResourceManager is responsible for the unified management and allocation of all
resources in the cluster
-NodeManager is the agent on each Node
-ApplicationMaster is responsible for all the jobs within the application lifecycle
-Each Map Task has a circular memory buffer which is 100MB by default (The size can be
configured by user).
-The threshold size of the buffer is 80% when it's reached the data in the buffer will be
written to local disks.
-Shuffle : it's the data transfr process between the Map Task and Reduce Task which
involves that reduce task copy MOF files from the Map Tasks and then Sort and Merge
MOF files
-Users can view the information Queue by FusionInsight WebUI
Spark Highlights :
Spark EcoSystem :
Spark Vs MapReduce :
Spark performance is over 100 times high than of MapReduce ( For multiple iterations)
Standalone : resource management system of spark it's support also the resource
management system of YARN and Mesos
Spark Core : Ditributed computing framework similar of MapReduce
Spark Sql ; Spark component mainly for processing structured data and conducting SQL
query on data (Formats : JSON, Parquet, ORC,)
Structured Steaming : engine built onf Spark SQL to process streaming data it's
programmed by using Scala and it's fault tolerant
Spark streaming : streaming engine for micro batch processing, stream data is sliced
and then processed in the computation engine of Spark Core
MLib : Spark's machine learning library
GraphX; Graph parallel computing
SparkR: R package that provides a lightweight frontend to use Apache Spark from R.
RDD (Resilient Distributed Datasets) is a fundamental data structure of Spark, it's a read-
only partitioned collection of records
-RDD is stored in memory by default and will overflow to disks in case of insufficient
memory
-Each RDD is divided into logical partitions which may be computed on different nodes
of the cluster (This feature improve performance by ther locality)
RDD can be created from a HADOOP file system such HDFS or Storage system that
Hadoop is compatible with.
RDD remembers how it's invloved from another RDD using a lineage so data can be
recovred quickly when data loss occurs
RDD Dependencies :
Two types :
Narrow dependencies:
Each partition of the parent RDD is used by one partition of the child RDD
advantage :
supports of multiple commands in pipeline mode
failure reconvery of narrow dependency is more effective (because only the last parent
partiton needs to be recomputed) and recomputation can be performed concurrently on
different nodes)
Wide dependencies:
Each partition of the parent RDD may be used by multiple partition of the child RDD
NOTE :
Spark Scheduler reversely traverses the whole dependency chain from the end of the
DAG ( DIrected Acyclic Graph)
the number of tasks in a stage is determined b the number of RDD partitions at the end of
stage
RDD Operators :
Transformation :
It's invoked to generate a new RDD from one or more existing RDDs ( map, flatMap, filter,
reduceByKey)
Action :
NOTE :
-all transformations in Spark is Lazy
-the transformations are only computed when an action requres results to be returned
the the driver programme
Driver:responsible for the app business logic and operation planning (DAG)
ApplicationMaster:is used to manage application resources and applies for resources
based on applications
CLient: is to submit applications
ResourceManager:is responsible for scheduling and allocation resources in the whole
cluster
NodeManager:is responsible for resource management of it's node
Executor:Actual executor of a task, An application is split for multiple executors to
compute.
-The client sends the spark app request to ResourceManager and packages all the
information required to start ApplicationMaster and sends information to
ResourceManager
-ResourceManager returns the results (Application ID, upper limit, lower limit of available
resources )
-ResourceManager finds a proper mode for applicationMaster and start it on it’s node
-ApplicationMaster is a role on YARN, the process name in Spark is ExecuterLauncher
-ApplicationMaster can applies for series of containers to run tasks
-After receiving the newly allocated container lists from ResourceManager,
ApplicationMaster sends informations to the related NodeManager to start container
-ResourceManager allocate a container to ApplicationMaster then ApplicationMaster
communicates with the related NodeManager and starts Executor on the obtain container
-After Executor is started, it registers in Driver and applies for tasks
-Driver allocates tasks to Executor
-Executors do the tasks an report the operating status to Driver
YARN cluster :
Client generate the application information and sends the information to
ResourceManager
-ResourceManager allocates the Container (ApplicationMaster) to SparkApplication
-Then starts Driver on the container node
-ApplicationMaster applies for resources from ResourceManager to run Executor
-The ResourceManager allocates the container to ApplicationMaster
-ApplicationMaster communicates with the related NodeManager and starts Executer on
obtained Container
-After Executor is started, it registers in Driver and applies for tasks
-Driver allocates tasks to Executor
-Executors do the tasks an report the operating status to Driver
SparkSQL :
Module in Spark for structured data processing which can parse SQL language to RDDs
and then use SparkCore to execute
Datasets are similar to RDDs but does not use Java serialization or kryo encoders to
serialize object for processing or transmission over the network (Stored in encoded
binary form)
RDD,Dataframe,Datasets :
RDD :
Safe type, object oriented
Disadvantages : high performances overhead for serialization and deserialization
requires serialization and deserialization of data and data structures
Dataset and Dataframe have exactly the same functions but the different types of the
data in each row.
For the DataFrame, data in each row is of the row type (getAs, pandas matching in
column to obtain a specific field)
DateFrame advantages :
Schema information to reduce serialization and deserialization overhead.
Disadvantages : not object oriented, insecure during compiling.
Dataset :
Fast: performance is superior to RDD
Encoders are better that kryo or Java Serialization
Secure type: similar to RDD
Dataset has the advantages of RDD and DataFrame, and avoids their disadvantages.
NOTE :
Dataset, DataFrame, RDD can be converted to each other
Spark SQL vs Hive :
Differences :
Spark SQL uses Spark core as it’s execution engine while Hive uses MapReduce
The execution speed of SparkSQL is 10 to 100 times faster than Hive
Spark SQL does not support buckets, but Hive does.
Dependencies:
SparkSQL depends on the metadata of Hive and compatible with most syntax and
functions of HIVE
Also it can use user-defined functions in Hive
NOTE : Spark uses standard SQL statement for query to obtain data from the
incrementally and Unbounded table
Consider the data input stream as the input table :
Every data itel that is arriving on the stream is like a new row being appended to the
input table
Each query operation will generate a result table
At each trigger interval (every 1 sec)updated data will be synchronized to the result table
Whenever the result table is updated
the updated result will be written into an external storage system
3 types of storage mode of structured streaming at the output phase :
Complete mode: a connector of an external system writes the updated result set into the
external storage system.
Append mode:if an interval is triggered only added data in the result table will be return
into an external system ( applicable only to a result set that has already existed and not
updated)
Update mode:if an interval is triggered onlyu updated data in the result table wiil be
written into an external system.
Spark Streaming :
Spark Streaming is a real time computing framework built on the Spark for processing
massive streaming data (Sources data: Kafka, HDFS)
Workflow :
Spark Streaming receives live input data streams and divides the data into batches
Then there are processed by the spark engine to generate the final stream of results in
batches
NOTE: Spark streaming provides a high level of obstruction called discretize stream or
DStream( Continuous stream of data)
DSqtreams can be created either from input data streams or from sources such as Kafka,
FLume, or by applying high level operation(Reduce,join, window...) on other DStreams
DSTream is repsented as a sequence of RDDs
Spark WEBUI :
Service status, roles, and open configuration items, perform management operations
(starting and stopping spark, downloading the spark client, synchronize configurations,
view the running instance health status and service overview)
NOTE: JDBCServer provides the JDBC interface will directly sending external JDBC
requests to compute and parse structured data.
HBase uses memstore and storefile to store update data for the table.
The secondary index implements the function of indexing according to the values of
some columns.
HBase is a highly reliable, high performance, column-oriented, scalable distributed
storage system.
The HBase cluster has two roles: HMaster and HRegionServer.
HBase vs RDB
HBase is column oriented : store read calculate data by column
HBase has dynamic extension of columns on the other hand RDB uses predefined data
structure
HBase supports common commercial hardware contrariwise RDB -tuses I/O intensive
and cost
HBase architecture :
HMaster : involves the active HMaster and standby HMaster in HA mode
It’s manages RegionServers in HBase including : create, delete, modify and query of a
table, balances the load of Region Servers, adjust the distribution of regions, split
regions and distributes regions after they’re split and migrate regions after a region
server failed
Standby HMaster takes over services when active one fails
The original active HMaster serves as the standby HMaster when the fold is ratified.