The document discusses the challenges and advantages of big data systems, particularly comparing Relational Database Management Systems (RDBMS) with NoSQL databases. It highlights the architecture and components of Hadoop, including its data storage and processing capabilities, as well as the evolution to YARN for resource management. Additionally, it introduces Apache Spark as a faster alternative to MapReduce, emphasizing its in-memory processing and support for various programming languages.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views
10 - Big Data Architecture and Tools (1)
The document discusses the challenges and advantages of big data systems, particularly comparing Relational Database Management Systems (RDBMS) with NoSQL databases. It highlights the architecture and components of Hadoop, including its data storage and processing capabilities, as well as the evolution to YARN for resource management. Additionally, it introduces Apache Spark as a faster alternative to MapReduce, emphasizing its in-memory processing and support for various programming languages.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31
IE3062: Data and Operating Systems
Security
Big Data System Architecture and Tools
Big Data Systems Drawbacks of Relational Database Management Systems (RDBMS)
● RDBs are difficult to scale (biggest disadvantage with
respect to big data) ● RDBs are difficult to maintain and configure ● Peak provisioning leads to unnecessary costs ● Diversification in available systems complicates its selection ● To avoid these problems NoSQL Databases NoSQL systems
● Advantages of NoSQL systems:
○ Elastic scaling (main advantage for big data) ○ Less administration ○ Better economics ○ Flexible data models
● However, NoSQL drops lot of functionalities of
RDBMS ○ Hard to access without explicit knowledge of the data model (semi-structured models for providing meta-data). ○ Limited Access control ○ Limited indexing
● What is left in NoSQL is massive amount of data in a
clustered environment. Distributed File Systems ● Designed for cluster computing environment. ● Majority of analysis is still run-on files. ○ Machine learning, statistics and data mining methods usually access all available data ○ Most data mining and statistics methods require a well-defined input and not semi- structured objects (data cleaning, transformation...)
● Scalable Data Analytics often suffices with a
distributed file system. ● Analytics methods are parallelized on top of the distributed file systems. Challenges in Data Processing in a Cluster
● Data storage is challenging
○ If one machine(node) fails, all its files would be unavailable. ○ How to organize files? ○ How to find files? ● Computations must be divided into tasks: Can use divide and conquer approach. Challenges in Data Processing in a Cluster
● Parallelization Challenges in processing
○ How do we assign work units to workers? ○ What if we have more work units than workers? ○ What if workers need to share partial results? ○ How do we aggregate partial results? ○ How do we know all the workers have finished? ○ What if workers die?
● What is required? Developer specifies the computation
that needs to be performed and an execution framework (“runtime”) to handle the actual execution managing all above challenges. What Hadoop offers? ● Data storage challenge – Redundant, Fault tolerance data storage with Hadoop Distributed File System (HDFS). ● Parallel processing challenges – Parallel processing framework MapReduce ● Concurrency challenges - Job coordination with YARN. What Hadoop offers ● Scale out not scale up ○ Avoid limitation of symmetric multiprocessing and large shared memory machines
● Move processing to the data
○ Cluster have limited bandwidth
● Process data sequentially, avoid random access
○ Seeks are expensive, disk throughput is reasonable
● Ability to scale on demand
● Ability to use low-cost commodity hardware. Hadoop Background ● A software framework written in JAVA for distributed processing of large datasets across a cluster computing environment. ● Today, Hadoop is widely used as a general-purpose storage and analysis platform for big data Hadoop Cluster Architecture ● Computer nodes are stored on racks (8-64) where nodes within a rack are connected by a network (typically gigabit Ethernet). ● Racks are connected through a switch. ● The bandwidth of intra-rack communication is usually much greater than that of inter-rack communication. Hadoop cluster Components ● Hadoop clusters are composed of a network of master and worker nodes. ● The master nodes typically utilize higher quality hardware and include a Name Node, Secondary Name Node, and Job Tracker (see below). ● The workers consist of nodes running both Data Node and Task Tracker services on commodity hardware (see below). ● The final component is the Client Nodes, which are responsible for loading the data, fetching the results, running client tools etc. (client nodes does not have to be part of the cluster). Components of a Hadoop clusters Hadoop Eco System MapReduce ● MapReduce layer composition ● One Job Tracker runs on master node, many Task Trackers running on slave nodes. ● Job Tracker functionalities: ○ coordinates all the jobs running on a system ○ scheduling tasks to run on TaskTrackers ○ keeps record of the overall progress of each job ○ on task failure, reschedule tasks on different Task Tracker
● Task Tracker functionalities:
○ execute tasks ○ send progress reports to Job Tracker
● Only can run one task at a time, and other limitation:
○ No real-time and ad-hoc analysis (only MapReduce tasks) ○ Inability to use subsets of data for instant response MapReduce 2.0 or YARN ● Solution to limitations of MapReduce 1.0 is YARN. ● YARN - Yet Another Resource Negotiator ○ YARN allocates resources to various applications effectively. In other words, it allows to run multiple applications simultaneously. YARN: Yet Another Resource Negotiator
● idea: split up the two major functionalities of the Job
Tracker into separate daemons: ○ Resource Manager ○ Application Master ● Resource Manager runs on Master Node and on each slave a Node Manager runs-> Data-computation framework. ● Application Master (runs on slave) works with the Node Manager to execute and monitor the tasks. MapReduce vs YARN
● In Hadoop 2.0, MapReduce is an application of YARN.
MapReduce vs YARN YARN Containers ● Container represents collection of resources (RAM, CPU cores, and disks) given to a task on a single node at a given cluster. In simple terms, Container is a place where a YARN application is run. ● An application/job will run on one or more containers. ● A container is supervised by the node manager and scheduled by the resource manager. General workflow of a YARN application
1. Client submits application to Resource Manager.
2. Resource Manager asks any Node Manager to create an Application Master which registers with the Resource Manager. 3. Application Master determines how many resources are needed and requested the necessary resources from the Resource Manager. 4. Resource Manager accepts the requests and queues it up. 5. As the requests resources become available on slave nodes, the Resource Manager grants the Application Master requirements for containers on specific slave nodes. YARN Workflow Motivation ● In MapReduce, you have to write your programs in chain of Map and Reduce steps which is tedious to program. ● Further, for each map and reduce tasks, there is a read from disk and write to the disk. ○ Many MapReduce applications can spend up to 90% of their time reading and writing from disk. ○ Not efficient for iterative tasks such as machine learning algorithms. Apache Spark ● Apache Spark is an open-source unified analytics engine for large-scale data processing. ● Spark provides an interface for programming in entire clusters with implicit data parallelism and fault tolerance. ● Specialty in Spark is that it keeps data between operations in-memory (in-memory caching). Features of Spark ● Spark is extremely fast (way faster than MapReduce). ● Gives a lot of convenience functions (e.g., filter(), join(), flatMapdistinct(), groupByKey(), reduceByKey(), sortByKey() ). ● Native Scala, Java, Python, and R support ● Spark provide libraries for machine learning, graphs, streaming data and spark SQL. ● Developed at AMPLab UC Berkeley, now managed by Databricks. Spark Requirements ● Apache Spark requires a cluster manager and a distributed storage system. ● Cluster management ○ Spark supports standalone (native Spark cluster). It is also possible to run these daemons on a single machine for testing. ○ Hadoop YARN, Apache Mesos, Kubernetes.
● For distributed storage, Spark can interface with a
wide variety of data sources ○ Alluxio, HDFS, MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu, Lustre file system Spark Eco-system Spark Architecture ● Spark has the Master-worker architecture. MapReduce vs. Spark
● Sorting 100TB of data
● Spark sorted the same data 3X faster using 10X fewer machines than Hadoop. Spark - when not to use ● Even though Spark is versatile, that doesn’t mean Spark’s in-memory capabilities are the best fit for all use cases: ○ For many simple use cases Apache MapReduce and Hive might be a more appropriate choice ○ Spark was not designed as a multi-user environment ○ Spark users are required to know that memory they have is sufficient for a dataset ○ Adding more users adds complications, since the users will have to coordinate memory usage to run code Security of Big Data