Cloud Computing Unit 5
Cloud Computing Unit 5
Introduction to Hadoop
Hadoop is an open-source framework designed to process and store large sets of data in a
distributed computing environment. It provides a way to manage big data using a cluster of
computers and offers scalable storage and processing power.
Why Hadoop?
2. Hadoop Architecture
HDFS is the storage layer of Hadoop, designed to store very large files across a distributed
environment. It divides large files into smaller blocks (typically 128MB or 256MB) and
distributes them across multiple nodes in the cluster.
NameNode: The master server that manages the metadata (structure) of the file system,
including the location of data blocks.
DataNode: Worker nodes that store the actual data blocks. They periodically send heartbeats
and block reports to the NameNode.
Secondary NameNode: Performs periodic checkpoints of the NameNode’s metadata to avoid
data loss in case of failure.
HDFS Features:
Block Size: Large block size (128MB or 256MB) reduces the overhead of managing small files.
Replication: Each block is replicated multiple times (default is 3 replicas) across the cluster to
ensure fault tolerance.
High Throughput: Optimized for streaming data access rather than low-latency access.
MapReduce is the computational layer of Hadoop, responsible for processing large volumes of
data in parallel across a distributed system.
1. Map Phase: The input data is divided into smaller chunks, and each chunk is processed by a
mapper. The mapper generates intermediate key-value pairs.
2. Shuffle and Sort Phase: The intermediate key-value pairs are shuffled and sorted by key before
being sent to the reducers.
3. Reduce Phase: The reducers process the sorted data, typically aggregating or summarizing the
results, and then write the output to HDFS.
Key Concepts in MapReduce:
Mapper: Processes the input data and generates intermediate key-value pairs.
Reducer: Aggregates the intermediate data from mappers based on the key and generates the
final output.
JobTracker: The master node that schedules MapReduce jobs and monitors their progress.
TaskTracker: Worker nodes that execute the map and reduce tasks.
YARN is the resource management layer of Hadoop, responsible for managing and scheduling
resources across the cluster. It was introduced in Hadoop 2.0 to separate resource management
and job scheduling from MapReduce.
YARN Components:
ResourceManager: The master daemon responsible for managing resources and scheduling
tasks.
NodeManager: The worker daemon running on each node in the cluster, which monitors
resource usage and reports to the ResourceManager.
ApplicationMaster: Manages the execution of a specific application, including task scheduling
and resource allocation.
YARN Benefits:
Advantages of Hadoop
VirtualBox is a free and open-source virtualization platform developed by Oracle. It allows users
to run multiple operating systems on a single physical machine by creating and managing virtual
machines (VMs). In the context of cloud computing, VirtualBox is often used for creating test
environments, learning virtualization concepts, and building scalable infrastructure setups.
What is Virtualization?
Virtualization is the creation of virtual (rather than physical) versions of resources, such as
servers, storage devices, or networks. In the case of VirtualBox, it allows a single computer to
run multiple guest operating systems (OS) simultaneously.
3. VirtualBox Architecture
VirtualBox consists of several components that work together to create and manage virtual
machines. These components are:
What is a hypervisor?
A hypervisor is a software that you can use to run multiple virtual machines on a single physical
machine. Every virtual machine has its own operating system and applications. The hypervisor
allocates the underlying physical computing resources such as CPU and memory to individual virtual
machines as required
There are two types of hypervisors, each differing in architecture and performance.
Type 1 hypervisor
The type 1 hypervisor sits on top of the metal server and has direct access to the hardware
resources. Because of this, the type 1 hypervisor is also known as a bare-metal hypervisor. The host
machine does not have an operating system installed in a bare-metal hypervisor setup. Instead, the
hypervisor software acts as a lightweight operating system.
Type 2 hypervisor
The type 2 hypervisor is a hypervisor program installed on a host operating system. It is also known
as a hosted or embedded hypervisor. Like other software applications, hosted hypervisors do not
have complete control of the computer resources. Instead, the system administrator allocates the
resources for the hosted hypervisor, which it distributes to the virtual machines.
4. VirtualBox for Cloud Computing
VirtualBox plays an important role in the development and management of cloud computing
infrastructure by enabling virtualization. Here's how it connects to cloud computing concepts:
VirtualBox enables the creation of isolated virtual machines, which is a key feature of cloud
environments. Virtualization is at the heart of cloud computing, as it allows for the efficient use
of hardware resources, isolation, and scalability.
On-Demand Provisioning: VirtualBox allows for quick creation and deletion of virtual
machines, a process similar to provisioning resources in a cloud environment.
Testing & Development: VirtualBox is often used for setting up cloud environments in
test and development scenarios, where developers need to simulate a cloud infrastructure
or experiment with different configurations.
While VirtualBox itself is not a full cloud platform, it can be used to simulate basic cloud-like
environments. For example, it allows users to:
Create clusters of VMs for testing distributed applications (e.g., big data processing).
Deploy software like OpenStack, Kubernetes, or Docker within VMs to simulate cloud
environments.
This helps users understand how cloud platforms like AWS, Microsoft Azure, or Google Cloud
work without needing access to a public cloud provider.
While VirtualBox is not typically used for running production cloud environments, it integrates
well with cloud management platforms for educational or developmental purposes.
Google App Engine (GAE) is a Platform as a Service (PaaS) offering from Google Cloud that
allows developers to build and deploy web applications without worrying about the underlying
infrastructure. GAE abstracts much of the system administration and hardware management,
enabling developers to focus on writing code rather than managing servers.
GAE provides auto-scaling, load balancing, and built-in application services, making it an
attractive solution for developers seeking to deploy apps in the cloud.
Fully Managed Service: Google manages the infrastructure, so developers don’t need to
handle server management, patching, or scaling.
Auto-Scaling: GAE automatically adjusts the number of running instances based on the
app’s traffic. This eliminates the need for manual scaling and ensures that resources are
allocated dynamically.
Integrated with Google Cloud: GAE integrates seamlessly with other Google Cloud
services, such as Cloud Datastore, Google Cloud Storage, and Google Cloud Pub/Sub.
Multi-language Support: Supports several programming languages like Python, Java,
Go, Node.js, Ruby, PHP, and more, allowing developers to use their preferred language.
Serverless Architecture: Developers don't have to worry about provisioning or
managing servers, as App Engine automatically handles it for them.
Managed Security: Built-in security features, including SSL, identity and access
management (IAM), and firewalls, to protect applications from threats.
Development and Deployment Tools: Provides tools like the Google Cloud SDK, local
emulators, and continuous integration to make development and deployment easier.
Deployment: Developers upload their code to App Engine, which automatically manages
the resources required to run the application, including scaling the app and handling
traffic distribution.
Scaling:
o Automatic Scaling: App Engine automatically adjusts the number of running
instances based on incoming traffic. For example, during peak traffic, App Engine
may create new instances to handle the load and then scale down during low
traffic.
o Manual Scaling: In the flexible environment, developers can define the minimum
and maximum number of instances required for their application.
Routing: GAE uses load balancing to distribute incoming requests to the right instances.
It ensures that users’ requests are routed to the most appropriate version of the
application.
Storage: GAE integrates with various Google Cloud storage services such as:
o Cloud Datastore: A NoSQL database for storing structured data.
o Cloud Storage: Used for storing large files, like images and videos.
o Cloud SQL: Managed relational databases supporting MySQL and PostgreSQL.
1. Compute Resources: This includes the number of instance hours used by your
application.
2. Storage: Charges are applied for the data stored in services like Cloud Datastore, Cloud
SQL, and Cloud Storage.
3. Outbound Traffic: You are charged for outgoing data traffic from your app.
4. Additional Services: Google offers various services (e.g., email, messaging, monitoring)
that may incur additional costs.
Pricing Models:
Free Tier: Google App Engine offers a free tier with limited resources, which is suitable
for small applications and learning purposes.
Pay-as-You-Go: For larger applications, pricing is based on actual usage.
1. Web Applications: GAE is ideal for building scalable web applications that need to
handle varying levels of traffic, such as social networking sites, news websites, and
blogs.
2. Mobile Backend: Many mobile applications use App Engine for managing user
authentication, storing data, and handling traffic to scale backend services dynamically.
3. Microservices: App Engine can be used to deploy microservices in a distributed system,
allowing each service to scale independently.
4. Real-Time Applications: GAE is useful for building real-time applications such as chat
apps, gaming apps, and collaboration tools.
5. Machine Learning APIs: Developers can deploy machine learning models as APIs to
serve predictions at scale using Google App Engine.
1. Create a web application (for example, a Python Flask app or a Node.js Express app).
2. Include an app.yaml configuration file that defines the app’s environment and scaling
behavior.
Open-Source: OpenStack is open-source and free to use, making it a cost-effective option for
private and public cloud deployments.
Modular Architecture: OpenStack has a modular architecture with various components that
work together to provide cloud services.
Scalability: It is highly scalable, meaning it can handle everything from small-scale deployments
to large-scale enterprise environments.
Vendor-Neutral: It is compatible with multiple hardware and software platforms, making it
adaptable to various cloud needs.
Multi-Tenancy: OpenStack supports multi-tenancy, allowing multiple organizations or
departments to share the same cloud infrastructure securely.
OpenStack is divided into several key components, each serving a specific function in the cloud
infrastructure. Here are the primary components:
a. Nova (Compute)
Purpose: Nova is responsible for provisioning and managing virtual machines (VMs) and
handling the computing resources in the cloud.
Functionality: It manages the lifecycle of virtual machines (from creation to termination) and
manages various hypervisors (like KVM, VMware, Hyper-V).
Key Features:
o Support for multiple hypervisors.
o VM orchestration and management.
o Integration with other OpenStack services like Neutron and Glance.
d. Neutron (Networking)
e. Horizon (Dashboard)
h. Heat (Orchestration)
Purpose: Heat is used for orchestration, automating the deployment of resources and services.
Functionality: It allows users to define the infrastructure requirements in a template (often
written in YAML) and deploy them in an automated manner.
Key Features:
o Infrastructure-as-Code (IAC) for cloud services.
o Supports auto-scaling, load balancing, and resource provisioning.
i. Ceilometer (Telemetry)
3. OpenStack Architecture
Controller Node: Houses the services responsible for managing the cloud (e.g., Nova, Keystone,
Horizon).
Compute Nodes: Run the virtual machines and are managed by Nova.
Storage Nodes: Provide storage through Swift (object storage) and Cinder (block storage).
Network Nodes: Handle networking tasks and are typically configured with Neutron to manage
network resources.
4. Deployment of OpenStack
DevStack: A tool used to set up OpenStack on a single machine for testing and development
purposes.
Packstack: A deployment tool for OpenStack that simplifies multi-node deployments.
RDO: A community-supported distribution of OpenStack for Red Hat-based systems.
Mirantis OpenStack: A commercial version of OpenStack with enterprise support.
Kolla: A deployment tool that uses Docker containers to deploy OpenStack services.
OpenStack can be deployed on physical servers or virtual machines, and its services can run on
different nodes for redundancy and scalability.
Private Cloud: Organizations can use OpenStack to build their own private cloud to manage
internal resources securely.
Public Cloud: OpenStack is used by some public cloud providers to offer cloud services at scale.
Hybrid Cloud: OpenStack can be integrated with other cloud platforms (e.g., AWS, Google
Cloud) to provide a hybrid cloud environment, enabling workload migration.
Edge Computing: OpenStack is also used in edge computing scenarios where computing power
is needed at the network edge for faster processing.
6. Benefits of OpenStack
Complexity: OpenStack can be complex to deploy and manage, particularly for organizations
without prior experience in cloud infrastructure.
Compatibility Issues: Compatibility between different components and versions can cause
issues, especially in large environments.
Resource Intensive: OpenStack requires significant hardware resources for large-scale
deployments, making it potentially expensive in terms of infrastructure.
1. What is Federation?
Federation in the context of services and applications refers to the concept of linking multiple
independent systems, often from different organizations, so they can interact and share resources
while maintaining separate control. Federation typically involves:
Data Sharing: Enabling secure and seamless data exchange across systems.
Single Sign-On (SSO): Allowing users to authenticate once and access multiple
applications/services across different domains without needing to log in again.
1. Identity Federation
Definition: This is the most common type of federation and refers to the ability to share
and manage user identity across multiple domains or organizations.
Key Concept: It enables Single Sign-On (SSO), allowing users to authenticate once and
gain access to services across different cloud environments or systems without needing to
log in multiple times.
How it Works:
o Identity providers (IdPs) like Microsoft Active Directory, Google Identity, or
other authentication services issue tokens that can be trusted by service providers
(SPs) across different platforms or organizations.
o Common protocols include SAML (Security Assertion Markup Language),
OAuth, and OpenID Connect.
Example: An organization federates its internal directory service (e.g., Active Directory)
with a cloud service provider (e.g., Google Cloud), so employees can access both internal
applications and cloud resources using a single set of credentials.
2. Resource Federation
Definition: Resource federation refers to the ability to share and manage computing
resources like storage, compute power, and networking across multiple clouds or data
centers.
Key Concept: It enables the pooling and sharing of resources across different cloud
environments, creating a unified infrastructure that can scale based on demand.
How it Works:
o Organizations can link their private cloud with public clouds, or multiple public
cloud services, to create a hybrid cloud or multi-cloud environment.
o Federation technologies allow workloads to move seamlessly between clouds to
optimize resource usage, reduce costs, and improve redundancy.
Example: A company running workloads in a private cloud may burst compute resources
to a public cloud during periods of high demand, utilizing cloud bursting for scalability.
3. Service Federation
4. Data Federation
Definition: Data federation refers to the process of integrating and accessing data
distributed across multiple cloud environments or data sources as though it is stored in a
single, unified system.
Key Concept: It allows organizations to access data from multiple disparate data stores
(e.g., databases, cloud storage) without needing to physically move the data.
How it Works:
o Data federation technologies create a virtual layer that unifies the view of data
from multiple locations, enabling queries across various cloud systems without
duplicating or transferring the actual data.
o Often used in data integration platforms or data lakes that aggregate data from
multiple cloud services or data silos.
Example: A business using multiple cloud storage solutions (e.g., AWS S3, Azure Blob
Storage, and Google Cloud Storage) may federate the data into a single virtual data
warehouse to enable unified querying and analytics.
1. Decentralized Control: Different organizations or systems retain control over their own
data and resources while still collaborating.
2. Scalability: Allows for scaling across different domains and services without centralizing
everything.
3. Security: Federation can enable secure data sharing and authentication protocols,
reducing the need to replicate data across systems.
4. Cost Efficiency: By connecting existing systems, federation allows organizations to
collaborate without major infrastructure investments.
5. User Convenience: With federated identity management, users can access a wide range
of services and applications with a single login (SSO).
6. Interoperability: Federated services can bridge the gap between different systems,
enabling them to work together even if they were not originally designed to do so.
1. Data Privacy and Compliance: Federating services can make it difficult to ensure that
data privacy regulations (e.g., GDPR) are maintained, as data crosses organizational
boundaries.
2. Security Risks: Although federation offers centralized authentication, it still poses
security risks, as a compromised identity provider could jeopardize access to multiple
systems.
3. Complexity: Setting up and managing federated systems can be complex, requiring
protocols and standards to be adhered to across various platforms.
4. Latency: Data transfer and synchronization across federated systems may lead to
increased latency, affecting performance.
5. Compatibility Issues: Different federated systems may use different standards,
protocols, or technologies, requiring additional middleware or adapters for seamless
integration.