BigData Unit-4 Complete
BigData Unit-4 Complete
BIG DATA
HADOOP ECOSYSTEM COMPONENTS
HADOOP ECOSYSTEM COMPONENTS
• The Hadoop ecosystem is a collection of open-source software utilities,
frameworks, and tools designed to facilitate big data processing and analytics.
• The Hadoop ecosystem can be broadly categorized into several layers, each
serving a specific purpose in the big data processing and analytics pipeline.
Here are the main layers:
1.Storage Layer:
Hadoop Distributed File System (HDFS): This is the primary storage layer in Hadoop.
HDFS is a distributed file system that provides high-throughput access to application data
by storing it across multiple nodes in a Hadoop cluster.
2.Resource Management Layer:
YARN (Yet Another Resource Negotiator): YARN is a resource management layer in
Hadoop that manages resources in the cluster and schedules tasks. It enables multiple
data processing frameworks to run on the same cluster by decoupling resource
management from job scheduling and monitoring.
HADOOP ECOSYSTEM COMPONENTS
3. Data Processing Layer:
a) MapReduce: Programming model for distributed computing on large datasets.
b) Apache Spark: Cluster computing framework supporting in-memory
processing and various data processing tasks.
4. Query and Analytics Layer:
c) Apache Hive: Data warehousing framework providing a SQL-like interface
for querying and analyzing data stored in HDFS.
d) Apache Pig: High-level platform for creating MapReduce programs.
e) Apache Impala: SQL query engine for interactive analytics on data stored in
HDFS and Apache HBase.
f) Presto: Distributed SQL query engine for interactive analytics on large
datasets.
HADOOP ECOSYSTEM COMPONENTS
5. Data Ingestion and Integration Layer:
a) Apache Kafka: Distributed streaming platform for building real-time data
pipelines.
b) Apache Flume: Service for efficiently collecting and aggregating log data
from various sources.
c) Apache Sqoop: Tool for efficiently transferring bulk data between Hadoop
and relational databases.
SQL NoSQL
These databases are best suited for complex queries These databases are not so good for complex queries
Graph database
Document-oriented
Column family
What is MongoDB?
MongoDB is an open source, document-oriented database designed with both
scalability and developer agility in mind.
Instead of storing your data in tables and rows as you would with a relational database,
in MongoDB you store JSON-like documents with dynamic schemas(schema-free,
schema less).
{
"_id" : ObjectId("5114e0bd42…"),
“FirstName" : "John",
“LastName" : "Doe",
“Age" : 39,
“Interests" : [ "Reading", "Mountain Biking ]
“Favorites": {
"color": "Blue",
"sport": "Soccer“
}
}
MongoDB is Easy to Use
Scheme Free RDBMS vs MongoDB
MongoDB does not need any pre-defined data schema
Every document could have different data!
RDBMS MongoDB
{name: “will”, {name: “jeff”, {name: “brendan”, Database Database
eyes: “blue”, eyes: “blue”, boss: “will”} Table Collection
birthplace: “NY”, loc: [40.7, 73.4],
aliases: [“bill”, “ben”],
loc: [32.7, 63.4],
boss: “ben”} Row Document (JSON, BSON)
boss: ”ben”}
{name: “matt”, Column Field
weight:60, Index Index
height: 72,
{name: “ben”, Join Embedded Document
loc: [44.6, 71.3]}
age:25} Partition Shard
Features Of MongoDB
• Document-Oriented storege
• Full Index Support
• Replication & High Availability
• Auto-Sharding
• Aggregation
• MongoDB Atlas
• Various APIs
• JavaScript, Python, Ruby, Perl, Java, Java, Scala, C#, C++,
Haskell, Erlang
• Community
MongoDB
MongoDB is an open-source NoSQL database system designed for handling large volumes
of data. Unlike traditional relational databases, MongoDB uses a flexible, document-
oriented data model, making it particularly suitable for applications with evolving schemas
and complex data structures.
• Here's a brief introduction to some key concepts in MongoDB:
• Document: In MongoDB, data is stored in flexible, JSON-like documents. A document is a set of key-
value pairs, where keys are strings and values can be various data types, including strings, numbers,
arrays, or even nested documents.
• Collection: Collections are analogous to tables in relational databases. They are groups of MongoDB
documents, and each document within a collection can have a different structure. Collections do
not enforce a schema, allowing for flexibility in data representation.
• Database: A MongoDB database is a container for collections. It holds one or more collections of
documents.
• Document ID: Each document in a collection has a unique identifier called the "_id" field. This field
is automatically indexed and ensures the uniqueness of each document within a collection.
MongoDB
• Query Language: MongoDB provides a powerful query language that allows you to retrieve, filter, and
manipulate data stored in the database. Queries are expressed using JSON-like syntax.
• Indexes: MongoDB supports indexing to improve query performance. Indexes can be created on any field
in a document, including nested fields, and can significantly speed up data retrieval operations.
• Replication: MongoDB supports replica sets, which are groups of MongoDB instances that maintain the
same data set. Replica sets provide high availability and fault tolerance by automatically electing a
primary node to serve read and write operations.
• Sharding: Sharding is a method for distributing data across multiple machines to support horizontal
scalability. MongoDB can automatically partition data across shards based on a shard key, allowing for
high throughput and storage capacity.
• Aggregation Framework: MongoDB provides a powerful aggregation framework for performing data
aggregation operations, such as grouping, sorting, and filtering, similar to SQL's GROUP BY and ORDER BY
clauses.
• GridFS: MongoDB includes a specification called GridFS for storing and retrieving large files, such as
images, videos, and audio files, as separate documents.
• MongoDB is widely used in modern web applications, big data, real-time analytics, and IoT (Internet of Things)
applications due to its flexibility, scalability, and performance. It's a valuable tool for developers seeking to
manage and analyze large volumes of diverse data.
Replication
• Replication provides redundancy and increases data availability.
• With multiple copies of data on different database servers,
replication provides a level of fault tolerance against the loss of a
single database server.
6. Index Management:
• MongoDB provides commands to manage indexes, including creating, dropping, and listing
indexes. You can also view index usage statistics to optimize index performance.
• Indexes play a vital role in optimizing MongoDB performance, especially in scenarios with large
datasets and complex query requirements. Understanding how to create and use indexes
effectively can significantly improve the efficiency of MongoDB databases.
CAPPED COLLECTIONS IN MONGODB
• Capped collections in MongoDB are special types of
collections that have a fixed size and follow a FIFO (First
In, First Out) storage mechanism. They are designed for
use cases where data needs to be stored in a circular buffer
fashion, such as logging or caching systems. Here's an
overview of capped collections in MongoDB:
1.Fixed Size: Capped collections have a predefined size
limit specified during their creation. Once this limit is
reached, MongoDB automatically removes older
documents to accommodate new ones, ensuring that the
collection never exceeds its size limit.
2.Insertion Order: Documents in a capped collection are
stored in the order in which they were inserted. When the
collection reaches its size limit and needs to make space
for new documents, MongoDB removes the oldest
documents first.
CAPPED COLLECTIONS IN MONGODB
3. Automatic Overwrite: In a capped collection, when the collection is
full and a new document is inserted, MongoDB automatically removes
the oldest document to make space for the new one. This behavior is
similar to a circular buffer.
🟩DAG Construction: A spark job is split into multiple stages. Each stage runs a specific task.
ANATOMY OF A SPARK JOB RUN
• There are mainly two types of tasks: shuffle map tasks and result task.
• 🔵Shuffle map tasks: Each shuffle map task runs computation on one RDD
and writes it output to a new set of partition which are fetched in a later stage.
Shuffle map tasks run in all stages except the final stage.
• 🔵Result tasks: It runs in the final stage and returns result to the user's
program. Each result task runs the computation on its RDD partition, then
sends the result back to the driver, which assembles results from all partitions
into a final result. It is to be noted that each task is given a placement
preference by the DAG scheduler to allow task scheduler to take advantage of
data locality. Once the DAG scheduler completes construction of the DAG of
stages, it submits each stage's set of task to the task scheduler. Child stages are
only submitted once their parents have completed successfully.
ANATOMY OF A SPARK JOB RUN
• 🟩Task Scheduling: When the task scheduler receives the set of tasks, it uses its list of
executors that are running for the application and constructs a mapping of tasks to
executors on the basis of placement preference. For a given executor, the scheduler will
first assign process-local tasks, then node-local task, then rack-local task, before assigning
any arbitrary task or random task. Executors send status update to the driver when a task is
completed or has failed. In case of task failure, task scheduler resubmits the task on
another executor. It also launches speculative tasks for tasks that are running slowly if this
feature is enabled. Speculative tasks are duplicates of existing tasks, which the scheduler
may run as a backup if a task is running more slowly than expected.
• 🟩Task Execution: Executor makes sure that the JAR and file dependencies are up to date.
It keeps local cache of dependencies from previous tasks. It deserialize the task
code(which consists of the user's functions) from the serialized bytes that were sent as a
part of the launch task message. Finally the task code is executed. Task can return a result
to the driver. The result is serialized and sent to executor backend, and finally to driver as
status update message.
ANATOMY OF A SPARK JOB RUN
• The anatomy of a Spark job run typically involves several components and stages:
1. Job Submission: This is the initial stage where a user submits a Spark job. The submission
process can be done through various methods like the Spark shell, Spark-submit script, REST
APIs, or interactive notebooks like Jupyter or Zeppelin.
2. Job Scheduler: Once the job is submitted, it enters a queue managed by the Spark job
scheduler. The scheduler determines when and where the job will be executed based on
resource availability and scheduling policies. Popular schedulers include FIFO, Fair, and
Capacity schedulers.
3. Task Generation: When the job is scheduled to run, Spark's driver program translates the job
into a directed acyclic graph (DAG) of stages. Each stage consists of tasks, which are the
smallest unit of work in Spark. Tasks are created based on the transformations and actions
specified in the Spark application code (e.g., RDD transformations, DataFrame operations).
4. Stage Execution: The DAG scheduler divides the job into stages of tasks based on the
dependencies between RDDs (Resilient Distributed Datasets) or DataFrames. These stages are
then submitted to the task scheduler, which runs the tasks on executor nodes in the Spark
cluster. Executors are JVM processes that manage task execution and data storage.
ANATOMY OF A SPARK JOB RUN
5. Data Processing: During stage execution, tasks are performed on partitions of the input
data. Spark processes data in parallel across the executor nodes, leveraging the distributed
nature of the cluster. Intermediate results are cached in memory or spilled to disk if necessary.
6. Shuffle: If there are shuffle operations (e.g., groupByKey, reduceByKey), Spark
redistributes data across the cluster to ensure that records with the same key end up on the
same machine. This involves a data exchange phase between executor nodes.
7. Result Aggregation: After all tasks have completed, Spark aggregates the results of
individual tasks to produce the final output. Depending on the action performed in the Spark
job (e.g., collect, save), the result may be returned to the driver program, stored in external
storage, or displayed to the user.
8. Job Completion: Once the job has finished executing all its tasks and producing the final
output, the Spark application terminates, and resources are released. The driver program cleans
up any remaining resources and exits.
SPARK ON YARN
• Spark on YARN refers to running Apache Spark applications on a Hadoop cluster managed by Yet
Another Resource Negotiator (YARN). YARN is the resource management layer in the Hadoop
ecosystem responsible for managing and allocating resources (CPU, memory) across applications
running on a Hadoop cluster.
• Here's how Spark integrates with YARN:
1. Resource Management: YARN provides resource management capabilities, allowing multiple
applications to run concurrently on the same Hadoop cluster without resource contention. Spark
leverages YARN to request and allocate resources (containers) for its executor nodes.
2. Job Execution: When a Spark application is submitted to run on YARN, it first communicates with
the ResourceManager to request resources for its driver and executor processes. The
ResourceManager allocates containers on various nodes in the cluster based on availability and
resource requirements.
3. Container Execution: Once resources are allocated, Spark launches its driver program in one of
the containers on the cluster. The driver program is responsible for orchestrating the execution of
the Spark application. Additionally, Spark launches executor processes in other containers to
perform the actual data processing tasks.
SPARK ON YARN
4. Task Execution: Within each executor, Spark executes tasks in parallel across the
available cores. Tasks process data partitions independently and communicate with each
other as needed, leveraging YARN for resource isolation and management.
5. Dynamic Resource Allocation: Spark on YARN supports dynamic resource allocation,
allowing it to adjust the number of executor containers based on workload demand. This
helps optimize resource utilization and improves cluster efficiency.
6. Fault Tolerance: YARN provides fault tolerance mechanisms to recover from failures
such as node failures or container crashes. Spark integrates with these mechanisms to
ensure that failed tasks are automatically re-executed on different nodes.
7. Integration with Hadoop Ecosystem: Running Spark on YARN allows seamless
integration with other Hadoop ecosystem components such as HDFS (Hadoop Distributed
File System), Hive, HBase, and others. This enables Spark applications to access data
stored in Hadoop and interact with other Hadoop-based services.
SCALA
• Scala is a powerful programming language that seamlessly combines
object-oriented and functional programming paradigms. It was
created by Martin Odersky, a professor at École Polytechnique
Fédérale de Lausanne (EPFL) in Switzerland and first released in 2003.
SCALA FEATURES
• Concise Syntax: Scala boasts a concise and expressive syntax that enables developers
to write clean and readable code. It draws inspiration from several programming
languages, including Java, Haskell, and Ruby.
• Object-Oriented: Scala is a fully object-oriented language, which means that every
value is an object. It supports classes, inheritance, and traits, allowing developers to
create robust and reusable code.
• Functional Programming: Scala also embraces functional programming principles.
Functions are first-class citizens, meaning they can be assigned to variables, passed as
arguments, and returned from other functions. Scala provides powerful higher-order
functions, pattern matching, and immutable data structures.
• Static Typing: Scala is statically typed, which means that the type of every expression
is known at compile-time. However, Scala's type inference system can often infer
types, reducing the need for explicit type annotations.
SCALA FEATURES
• Concurrency: Scala provides excellent support for concurrent and parallel
programming through features like actors and the Akka framework. These
features make it easier to write scalable and responsive applications.
• Interoperability: Scala runs on the Java Virtual Machine (JVM), which means it
seamlessly interoperates with Java. Scala code can call Java libraries and vice
versa, making it easy to leverage existing Java code and libraries.
• Tooling: Scala has a rich ecosystem of tools and libraries that facilitate
development. Popular build tools like sbt (Simple Build Tool) and dependency
management tools like Maven and Ivy are commonly used in Scala projects.
• Community and Adoption: Scala has a vibrant community and is widely adopted
in industry for a variety of applications, including web development, data analysis,
and distributed systems. It's used by companies like Twitter, LinkedIn, and Airbnb.
Popular libraries include Akka, Play Framework, Apache Spark, and ScalaTest.
SCALA FEATURES
• Type Inference: Scala has a sophisticated type inference system that can
often deduce the types of variables and expressions without explicit type
annotations. This reduces boilerplate code while still providing strong
static typing and compile-time safety.
• Immutable Collections: Scala offers a rich set of immutable collections
(e.g., List, Set, Map) that encourage functional programming practices
and make it easier to reason about code correctness and concurrency.
• Pattern Matching: Scala's pattern matching feature allows developers to
match values against patterns and destructure complex data structures. It
is often used in conjunction with case classes and sealed traits to write
expressive and type-safe code.
SCALA FEATURES
• Lazy Evaluation: Scala supports lazy evaluation, allowing developers
to define values that are computed only when they are accessed for
the first time. This can be useful for optimizing performance and
handling infinite data structures.
• DSLs (Domain-Specific Languages): Scala's flexible syntax and
powerful features make it well-suited for creating internal DSLs.
Developers can use Scala to define custom syntax and abstractions
that closely match the problem domain, leading to more expressive
and maintainable code.
CLASSES IN SCALA
• Definition: Classes define the properties (fields) and behaviors
(methods) of objects of a particular type.
CLASSES IN SCALA
• Fields:
• Fields can be mutable (var) or immutable (val).
• They can have default values, which are specified after the field's type.
• Constructors:
• Primary constructors are defined in the class declaration itself.
• Auxiliary constructors can be defined using ‘this’ keyword.
• Methods:
• Methods define the behaviors of the class.
• They can take parameters and return values.
• Instantiation:
• val obj = new ClassName(arguments)
OBJECTS IN SCALA
• Definition: Objects are single instances of their own definitions. They
are similar to singletons in other languages.
• Syntax:
OBJECTS IN SCALA
• Fields and Methods:
• Objects can contain fields and methods, just like classes.
• They cannot take parameters like constructors, as they cannot be instantiated.
• Usage:
• Objects are typically used to encapsulate utility methods, hold constants, or serve as
entry points to an application.
• Companion Objects:
• When an object shares the same name with a class, it's called a companion object.
• Companion objects can access private members of the class, and vice versa.
• Singleton Pattern:
• Objects are often used to implement the singleton pattern, ensuring that only one
instance of the object exists.
EXAMPLE OF A CLASS AND OBJECT IN SCALA
TYPES IN SCALA
• Integers:
• Byte: 8-bit signed integer (-128 to 127)
• Short: 16-bit signed integer (-32,768 to 32,767)
• Int: 32-bit signed integer (-2,147,483,648 to 2,147,483,647)
• Long: 64-bit signed integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
• Floating-Point Numbers:
• Float: 32-bit floating-point number
• Double: 64-bit floating-point number
• Boolean: Boolean type representing true or false.
• Characters: Char type representing a single 16-bit Unicode character.
• Strings: String type representing a sequence of characters.
• Unit: The Unit type corresponds to the void type in Java and represents absence of value. It's similar to
void in C and Java.
• Null and Nothing: Null is a subtype of all reference types. Nothing is a subtype of all types and has no
instances.
OPERATORS IN SCALA
1.Arithmetic Operators:
• +, -, *, /, % (Addition, Subtraction, Multiplication, Division, Modulus)
2.Relational Operators:
• ==, !=, <, >, <=, >= (Equality, Inequality, Less Than, Greater Than, Less Than or Equal To, Greater Than or Equal To)
3.Logical Operators:
• &&, ||, ! (Logical AND, Logical OR, Logical NOT)
4.Assignment Operators:
• =, +=, -=, *=, /=, %=, etc. (Assignment, Compound Assignment)
5.Bitwise Operators:
• &, |, ^, <<, >>, >>>, ~ (Bitwise AND, Bitwise OR, Bitwise XOR, Left Shift, Right Shift, Unsigned Right Shift, Bitwise NOT)
6.String Concatenation:
• + operator can be used to concatenate strings.
7.Type Casting:
• asInstanceOf is used for explicit type casting.
EXAMPLE
BUILT IN CONTROL STRUCTURES OF
SCALA
• Scala provides several built-in control structures for managing the
flow of execution in a program. Here are some of the key ones:
1.if-else Statements: Scala supports traditional if-else conditional
statements for branching logic.
BUILT IN CONTROL STRUCTURES OF
SCALA
• Pattern Matching: Pattern matching is a powerful feature in Scala
that allows you to match a value against a pattern and execute code
based on the match.
BUILT IN CONTROL STRUCTURES OF
SCALA
• For Loops: Scala supports both traditional for loops as well as for-
comprehensions.
BUILT IN CONTROL STRUCTURES OF
SCALA
• While and Do-While Loops: Scala also supports while and do-while
loops for iterative execution.
BUILT IN CONTROL STRUCTURES OF
SCALA
• Try-Catch-Finally: Scala provides try-catch-finally blocks for
exception handling.
BUILT IN CONTROL STRUCTURES OF
SCALA
• Option and Either: Scala encourages the use of Options and Eithers
for handling potentially absent or exceptional values.
FUNCTIONS IN SCALA
• In Scala, functions are first-class citizens, meaning you can treat
functions like any other value. You can assign functions to variables,
pass them as arguments to other functions, and return them from
functions.
• Defining Functions:
• Functions can be defined using the ‘def’ keyword.
FUNCTIONS IN SCALA
• Anonymous Functions:
• Anonymous functions, also known as function literals or lambda
expressions, can be defined using the ‘=>’ syntax.
• Higher-order Functions:
• Higher-order functions are functions that take other functions as
parameters or return functions.
CLOSURES IN SCALA
• A closure is a function which captures the variables from its lexical
scope. In Scala, closures are created when you use variables from the
enclosing scope inside a function literal.