Cloud Computing unit 2 24-25notes
Cloud Computing unit 2 24-25notes
Enterprise data storage refers to the systems and technologies used by organizations to store, manage,
and secure large volumes of data critical to their operations. These storage solutions are typically
scalable, reliable, and provide high performance to meet the demands of enterprise-level applications.
Here's an overview of some key aspects of enterprise data storage:
● Scalability: Enterprise storage systems need to scale to handle increasing data volumes as
businesses grow. Cloud-based solutions and SANs offer high scalability.
● Performance: High-performance storage is critical for handling large amounts of data at high
speeds, especially for applications like databases and analytics.
● Reliability and Redundancy: Enterprise storage systems often implement redundant
components, ensuring that if one part fails, the system continues to operate without data loss.
● Security: Enterprises need robust data protection features such as encryption, secure access
controls, and backup strategies to prevent data breaches or loss.
● Data Management: Features like automated backup, data archiving, and efficient data
retrieval are key for managing enterprise data.
● Disaster Recovery: Ensuring that data can be restored quickly in the event of a failure is
crucial, with solutions like replication and off-site backups.
Storage Protocols
1. iSCSI (Internet Small Computer Systems Interface): A protocol used for connecting
storage devices over IP networks, commonly in SAN environments.
2. Fibre Channel: A high-speed network protocol for connecting storage devices and servers in
a SAN.
3. NFS (Network File System): A protocol commonly used in NAS solutions for file-level
access over the network.
4. SMB (Server Message Block): Another file-sharing protocol used primarily for
Windows-based NAS environments.
● Data Growth: As the volume of data continues to increase, organizations must ensure their
storage solutions can scale efficiently without compromising performance.
● Data Integrity: Ensuring that data is accurate and intact, and recovering it from corrupt or
lost states.
● Cost Management: While storage costs have reduced significantly over time, managing
large-scale storage solutions still represents a significant investment in terms of hardware,
software, and operational expenses.
● Compliance: Meeting legal and regulatory requirements for data retention, privacy, and
security (e.g., GDPR, HIPAA).
Emerging Trends
● Software-Defined Storage (SDS): SDS decouples the storage hardware from the software
that manages it, offering flexibility in terms of scalability and the ability to choose storage
hardware independently.
● Edge Storage: Storing and processing data closer to where it's generated (e.g., IoT devices),
reducing latency and bandwidth costs.
● AI & Machine Learning Integration: Leveraging AI to optimize storage systems for better
performance and automatic data management, including predictive maintenance and
automated scaling.
Types of DAS
Characteristics of DAS
● No Network Dependency: DAS does not require a network to operate, making it faster than
network storage options for local access.
● Simple Setup: Setting up DAS is straightforward as it’s directly connected to a machine.
● Limited Sharing: The storage is not easily shared across multiple systems (unless additional
software or hardware is used).
● Data Access: Data is accessed via local file systems like NTFS, HFS+, or ext4, depending on
the operating system.
Diagram of DAS
● High Performance: SANs typically use high-speed protocols like Fibre Channel or iSCSI to
ensure low latency and high throughput.
● Centralized Storage: Data is stored in centralized storage arrays that can be accessed by
multiple servers.
● Scalability: SANs are highly scalable, supporting the addition of more storage devices or
servers without significant performance degradation.
● Block-Level Storage: Unlike file-level storage systems like NAS, SANs provide block-level
access to storage, meaning data is managed in blocks rather than files.
● Reliability and Redundancy: SANs typically include redundancy features such as multiple
network paths, mirrored data, and RAID configurations to prevent data loss and ensure high
availability.
Components of a SAN
1. Storage Devices: These are the physical devices that store the data, such as disk arrays, tape
libraries, or SSDs.
2. Switches: Network switches used to connect the servers to storage devices. These switches
support high-speed communication between servers and storage arrays.
3. Host Bus Adapters (HBAs): Hardware interfaces in servers that connect to the SAN
network. They allow servers to communicate with storage devices over the SAN.
4. Cabling: Fibre Channel cables, iSCSI cables, or Ethernet cables are used to physically
connect the components of the SAN.
SAN Protocols
● Fibre Channel (FC): A high-speed network technology commonly used in SANs. It provides
a dedicated, low-latency network for storage access, offering high performance for large-scale
environments.
● iSCSI (Internet Small Computer Systems Interface): A protocol that enables block-level
access over TCP/IP networks, typically used in IP-based SANs. It is less expensive than Fibre
Channel but may have slightly higher latency.
● Fibre Channel over Ethernet (FCoE): A protocol that allows Fibre Channel to be run over
Ethernet networks, enabling the use of standard Ethernet infrastructure while maintaining high
performance.
● Data Centers: Large-scale data centers rely on SANs to provide high-speed, reliable storage
for numerous applications and databases.
● Virtualization: SANs provide the shared storage necessary for virtualization environments
where multiple virtual machines (VMs) need fast and efficient access to data.
● High-Performance Applications: Applications that require high throughput and low latency,
such as databases, video editing, and scientific computing, benefit from SAN architecture.
● Disaster Recovery: SANs allow for data replication across multiple locations, ensuring that
data can be recovered in case of a failure or disaster.
1. Servers (Server 1, Server 2, Server 3): These are the servers that need access to the shared
storage. Each server is equipped with a Host Bus Adapter (HBA) that connects to the SAN
network.
2. SAN Switch Network: The SAN switch network enables the communication between the
servers and storage devices. It connects all servers to the storage arrays through high-speed,
dedicated links (e.g., Fibre Channel or iSCSI).
3. Storage Arrays (Storage Array 1 and Storage Array 2): These are the centralized storage
systems that provide the actual storage capacity. They are connected to the SAN network and
store the data accessed by the servers.
4. High-Speed Communication: The connection between the servers and storage devices is
through high-speed, low-latency connections (typically Fibre Channel or iSCSI).
Benefits of a SAN
● Improved Storage Utilization: With centralized storage, data can be easily shared and
accessed by multiple servers, ensuring that storage is used efficiently.
● High Availability: The SAN can be configured with redundancy and failover mechanisms to
ensure that data remains available even in the event of hardware failures.
● Flexibility: A SAN allows for the easy addition of storage capacity as business needs grow,
without disrupting existing operations.
NAS typically provides file-level storage (as opposed to block-level storage in systems like SAN)
and is ideal for environments that require shared access to data, ease of management, and centralized
backups.
1. File-Level Storage: NAS systems operate at the file level, meaning data is accessed and
managed in terms of files and directories, making it easier for multiple users to access and
modify shared files.
2. Centralized Storage: Data is stored in a single, centralized location, allowing easier
management, backup, and sharing among multiple users and devices.
3. Network Connectivity: NAS is connected to the network (either wired or wireless), which
means it can be accessed from any device on the same network.
4. User and Access Management: NAS typically comes with built-in security features to
control user access, such as password protection, file permissions, and sometimes encryption.
5. Scalability: Many NAS devices allow for easy expansion by adding additional hard drives or
connecting multiple NAS devices to scale up storage as needed.
6. Data Sharing: NAS allows multiple users to share files simultaneously, making it perfect for
collaborative environments and file sharing within a small office, home office, or enterprise.
● Home Office: Storing and sharing family photos, videos, and media files for access across
multiple devices like laptops, smartphones, and TVs.
● Small to Medium Businesses (SMBs): Centralized file sharing, backup, and collaboration for
a group of employees.
● Backup Solution: NAS is often used as a backup device for workstations and servers.
● Media Server: NAS is ideal for storing and streaming large media libraries, including
movies, music, and TV shows, to devices on the network.
+---------------------+
| User Device 1 |
| (Laptop/PC/Tablet) |
+---------------------+
| (Network - Ethernet/Wi-Fi)
+---------------------+
| Storage Disks |
| (Hard Drives/SSDs) |
+---------------------+
1. User Devices (User Device 1, 2, 3): These are the devices that need access to the data stored
on the NAS. They could be laptops, desktop computers, smartphones, or tablets connected to
the same network (via Ethernet or Wi-Fi).
2. NAS (Network Storage): This is the dedicated storage device connected to the network. It
acts as a centralized file server where data can be stored, accessed, and shared by multiple
users.
3. Storage Disks: Inside the NAS are hard drives or solid-state drives (SSDs) where all the files
and data are stored. The NAS organizes the data in a file system, allowing for file access and
management by users on the network.
4. Network Connectivity: The NAS is connected to a network (through Ethernet or Wi-Fi),
allowing multiple devices to communicate with it and access data from anywhere on the same
network.
Benefits of NAS
1. Ease of Setup and Management: NAS devices typically come with user-friendly
management interfaces, making it easy to set up and manage file sharing, security settings,
and backups.
2. File Sharing: Ideal for scenarios where multiple users need to access and work on shared
files, such as in a small business or home office.
3. Centralized Backup: Centralizing your data storage with NAS simplifies backup procedures,
and many NAS systems have automated backup features to external drives or the cloud.
4. Cost-Effective: NAS is more affordable compared to high-performance storage solutions like
SAN, making it a cost-effective option for smaller businesses and home use.
5. Expandability: You can add more storage to a NAS system by adding additional hard drives
or upgrading existing ones.
Effective storage management is crucial for enterprises and individuals to ensure that data is
accessible when needed, backed up correctly, and protected from unauthorized access or loss.
● Primary Storage: High-speed storage used for active data, such as SSD (Solid-State Drives)
or HDD (Hard Disk Drives).
● Secondary Storage: For less frequently accessed data, such as Network-Attached Storage
(NAS) or Direct-Attached Storage (DAS).
● Tertiary Storage: Archival storage, often using tape drives or cloud storage for long-term,
low-cost storage.
● Cloud Storage: A flexible and scalable storage solution offered by cloud service providers
(e.g., AWS, Google Cloud, Azure).
+---------------------+
| Primary Storage | <--- Active, frequently accessed data (e.g., SSD, HDD)
+---------------------+
+---------------------+
| Data Organization | <--- Classifying, tagging, indexing data for easier retrieval
+---------------------+
+--------------------------+
| Performance Management | <--- Ensuring efficient storage performance
+--------------------------+
+------------------------------+
+------------------------------+
+---------------------------+
| Data Security & Access | <--- Ensuring security, encryption, and access control
+---------------------------+
+---------------------+
| Secondary Storage | <--- Less frequently accessed data (e.g., NAS, DAS)
+---------------------+
+---------------------+
+---------------------+
+---------------------+
| Cloud Storage | <--- Scalable, flexible storage option for remote access
+---------------------+
1. Primary Storage: This is where active data resides, stored in high-performance systems
(SSDs or HDDs). This data is used frequently and requires fast access.
2. Data Organization: Once data is stored, it is organized using metadata and indexing. This
ensures quick access to data when needed and supports optimal data lifecycle management.
3. Performance Management: The storage system is continuously monitored to ensure it
performs well and meets the required throughput and latency requirements.
4. Backup & Data Protection: Data is regularly backed up using full, incremental, or
differential backups. Snapshots and replication mechanisms are also employed for disaster
recovery.
5. Data Security & Access: Encryption, authentication, and access control are employed to
protect data from unauthorized access or tampering.
6. Secondary Storage: This storage holds data that is not accessed as often but still needs to be
available when required (e.g., NAS, DAS, or traditional HDD storage).
7. Tertiary Storage: Data that is seldom accessed and is often archived. This can be stored on
tape drives or cloud storage for cost-effective long-term retention.
8. Cloud Storage: Cloud storage is an ideal option for off-site backups, disaster recovery, and
scalability. It provides easy access and can grow as your data requirements increase.
1. Data Tiering: Organize and place data in the most appropriate storage tier based on its access
frequency and importance. For example, active data should be stored on SSDs, while archival
data can go on cheaper cloud storage or tape.
2. Data Deduplication: Removing redundant copies of data to save storage space and optimize
backup processes. This is commonly used in backup management.
3. Automated Backup: Implement automated backup strategies to reduce human error and
ensure that backups are performed on a regular schedule.
4. Data Encryption: Use encryption methods to secure sensitive data both at rest and in transit,
ensuring compliance with security policies and regulations.
5. Monitoring and Reporting: Continuously monitor storage performance, health, and security.
Regular audits and reports will help in proactive maintenance and identifying potential issues
early.
6. Compliance and Retention: Ensure that storage management practices align with regulatory
requirements regarding data retention and compliance (e.g., GDPR, HIPAA).
File System
A File System is a way of organizing and storing files on a storage device, such as a hard disk drive
(HDD), solid-state drive (SSD), or network storage. It provides the structure that allows data to be
stored, accessed, modified, and deleted. The file system defines how data is named, stored, and
managed, and it ensures that files are properly indexed and retrieved.
1. Superblock:
○ The superblock contains important information about the file system, such as the file
system type, size, and block size. It's critical for file system integrity.
2. Inode:
○ An inode is a data structure that stores metadata about a file, including its location on
disk, file size, permissions, and ownership. It does not store the file name but
associates it with the file's location.
3. File Allocation Table (FAT):
○ In file systems like FAT, the File Allocation Table keeps track of which disk clusters
are used by each file. It's a linked list of blocks where the file's data is stored.
4. Data Blocks:
○ Data blocks are the actual locations where the file's content is stored on the disk. The
file system allocates blocks to store the contents of a file.
5. Directory Table:
○ A directory table contains entries for each file stored in a directory. It includes the file
name and a reference to the corresponding inode.
File System Operations
+--------------------------+
| File System |
+--------------------------+
+-------------------+
+-------------------+
+-------------------+
| Inodes | <-- Store metadata for each file (e.g., file size, permissions, location).
+-------------------+
+-------------------+
+-------------------+
+-------------------+
1. Superblock: Contains high-level information about the file system (type, size, block size,
etc.).
2. Inodes: Each file has an associated inode that holds its metadata, but not the name or the
actual data.
3. Directory Table: A directory contains entries that link file names to their respective inodes. It
provides the path to access a file.
4. Data Blocks: The actual data for the file is stored in blocks, which are referenced by the
inode.
FAT32 Simple, widely compatible Limited file size (max 4GB), inefficient
(cross-platform support) for large disks
exFAT Cross-platform support, larger file Less robust than NTFS or ext4
size than FAT32
ZFS High data integrity, self-healing, Requires more resources, primarily used
supports large volumes in enterprise environments
Cloud data stores are typically used to store large amounts of unstructured or structured data, with a
focus on reliability, scalability, and high availability. They are suitable for a wide range of
applications, from simple file storage to complex big data applications.
1. Scalability: Cloud data stores can grow or shrink according to your storage needs without
requiring significant upfront investment or infrastructure changes.
2. Cost Efficiency: With pay-as-you-go pricing models, cloud storage minimizes capital
expenditures and allows businesses to pay only for the storage they use.
3. High Availability and Durability: Cloud providers replicate data across multiple data centers
to ensure uptime and data protection against failures.
4. Security: Cloud providers implement robust security features like encryption, identity
management, and access controls to protect data.
5. Global Accessibility: Data can be accessed from anywhere around the world with an internet
connection, making it easy for teams to collaborate and share data.
6. Disaster Recovery: Cloud storage offers automated backup and disaster recovery solutions,
which ensures data can be restored in case of a failure.
Here’s a simplified diagram showing how Cloud Data Stores interact with applications and users:
+--------------------------+
| User Devices |
| (Laptops, Phones, etc.) |
+--------------------------+
|
v
+--------------------------+
| Cloud Application | <-- Data-driven applications, web services
+--------------------------+
|
+-----------------------------------------------+
| |
+------------------+ +-------------------+
| Object Storage | | Database Storage|
| (e.g., S3, Blob) | | (e.g., RDS, SQL) |
+------------------+ +-------------------+
| |
+--------------------+ +---------------------+
| File Storage | | Data Warehouses |
| (e.g., EFS, Files) | | (e.g., Redshift) |
+--------------------+ +---------------------+
|
+-------------------+
| Block Storage |
| (e.g., EBS, Disk)| <-- Fast access for VMs, databases, and high-performance applications
+-------------------+
1. User Devices: Users or applications can access data stored in the cloud from any device
connected to the internet, such as laptops, phones, or desktops.
2. Cloud Applications: These are software systems (web services, mobile apps, enterprise
applications) that interact with cloud storage to read, write, and manage data.
3. Object Storage: Used to store unstructured data (e.g., images, videos, backups). Cloud
storage services like Amazon S3, Azure Blob Storage, or Google Cloud Storage are popular
object storage services.
4. File Storage: Offers shared access to files over the cloud, using systems similar to traditional
file systems. Examples include Amazon EFS and Google Filestore.
5. Database Storage: Cloud databases store structured data, typically in relational or NoSQL
formats. Examples include Amazon RDS (Relational) and Amazon DynamoDB (NoSQL).
6. Data Warehouses: For analytics and large-scale data processing, services like Amazon
Redshift, Google BigQuery, and Azure Synapse Analytics store and query large datasets.
7. Block Storage: Used for high-performance storage requirements, such as databases or virtual
machines, where low-latency and fast access are essential. Examples include Amazon EBS
and Google Persistent Disk.
A Grid Storage System allows data to be stored across multiple machines in a distributed fashion,
while making sure it is accessible from various locations, enabling enhanced collaboration, fault
tolerance, and load balancing.
Grid storage refers to the distributed storage of data across a grid of interconnected computers.
Unlike traditional data storage systems, which might use a central server or single storage device, grid
storage utilizes resources from multiple devices, such as servers or storage nodes, to store and
manage data. This enables better resource utilization, redundancy, and scalability, making it suitable
for big data processing and high-performance computing (HPC).
● Data Nodes: These are the physical or virtual machines that contribute storage capacity to the
grid.
● Metadata Server: Responsible for managing and indexing data across the grid.
● Data Replication: Ensures that copies of data are stored in different nodes to increase fault
tolerance and availability.
● Grid Software: Software that manages the distribution and access to the data across the grid
nodes.
+-------------------------------+ +-------------------------------+
| | | |
| Application/User Requests |<----->| Metadata Server |
| Data (from anywhere) | | (Manages data locations, |
| | | replication info) |
+-------------------------------+ +-------------------------------+
| |
v v
+-------------------+ +-------------------+
| Data Node 1 | | Data Node 2 |
| (Storage Block A) | | (Storage Block B) |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Data Node N | | Data Node N+1 |
| (Storage Block C) | | (Storage Block D) |
+-------------------+ +-------------------+
\ /
\ /
\ /
+------------+
| Grid |
| Software |
+------------+
1. Scalability:
As storage needs grow, more nodes can be added to the grid to expand the storage capacity.
2. Cost-Effectiveness:
Grid storage uses commodity hardware across multiple nodes, reducing the cost of storing
large amounts of data compared to centralized systems.
3. High Availability:
Data is replicated across multiple nodes, ensuring that it remains available even if one or more
nodes fail.
4. Improved Performance:
Data can be accessed in parallel from different nodes, improving the speed of data retrieval.
5. Fault Tolerance:
Since data is replicated, a node failure will not result in data loss, and the system can continue
to operate smoothly.
1. Complexity:
Managing a grid storage system requires specialized knowledge in distributed systems,
metadata management, and fault tolerance.
2. Latency:
Accessing data across multiple nodes may introduce latency, especially if the nodes are
geographically distributed.
3. Security:
Ensuring the security of distributed data across multiple nodes can be challenging. Data must
be encrypted and access must be tightly controlled to prevent unauthorized access.
4. Data Consistency:
Maintaining data consistency across distributed nodes, especially in cases where data is
replicated, can be complex.
Effective cloud storage data management is crucial for businesses and organizations that rely on the
cloud to store large volumes of data while ensuring security, compliance, accessibility, and cost
management.
The data management lifecycle in cloud storage typically involves the following stages:
Here’s a simplified diagram to visualize the flow of Cloud Storage Data Management:
+---------------------------+
| Data Creation |
| (Upload, Transfer, Create)|
+---------------------------+
|
v
+---------------------------+
| Data Storage | <-- Store in Cloud Storage (Object, Block, File)
+---------------------------+
|
v
+---------------------------+
| Data Access & Sharing | <-- Users/Applications Access Data
+---------------------------+
|
v
+---------------------------+
| Data Archiving | <-- Move Infrequently Accessed Data to Cheap Storage
+---------------------------+
|
v
+---------------------------+
| Data Backup | <-- Automated Backups for Redundancy
+---------------------------+
|
v
+---------------------------+
| Data Deletion | <-- Delete Unnecessary Data after Retention Period
+---------------------------+
|
v
+---------------------------+
| Data Recovery & Restore | <-- Recover Lost or Corrupted Data
+---------------------------+
The process of provisioning cloud storage varies depending on the cloud service provider (such as
AWS, Google Cloud, or Azure), but the core principles remain the same. Cloud storage provisioning
helps businesses ensure that the right amount of storage is available to users or applications, while
also optimizing performance, cost, and data security.
Here's an example of provisioning cloud storage using AWS S3 (Simple Storage Service):
● Set up permissions using IAM policies or ACLs to control who can access the bucket and
what actions they can perform.
● Enable bucket versioning to keep track of changes to the objects stored in the bucket.
● Configure S3 Lifecycle policies to automate data movement between storage classes (e.g.,
move data to S3 Glacier after 30 days).
● Review the settings, and once satisfied, click Create to provision the S3 storage.
Data-intensive technologies in cloud computing are particularly useful for applications that require
processing, storing, and analyzing large volumes of data at high speed, such as data analytics,
machine learning, and IoT systems.
1. Scalability:
○ Cloud technologies can scale horizontally, meaning they can handle an increasing
volume of data without impacting performance. As data grows, cloud services can
dynamically allocate resources to accommodate the load.
2. Cost Efficiency:
○ Cloud platforms offer pay-as-you-go models, meaning that businesses only pay for the
resources they use. This is ideal for data-intensive applications, which often
experience fluctuating workloads.
3. Real-Time Insights:
○ Data-intensive technologies, such as stream processing and real-time analytics, allow
businesses to gain immediate insights from data, which is crucial for applications like
fraud detection, IoT, and customer behavior analysis.
4. Automation and Ease of Use:
○ Managed services and serverless computing automate much of the infrastructure
management, making it easier for businesses to deploy and scale data-intensive
applications without the need for deep technical expertise.
5. Global Availability:
○ Cloud providers have data centers around the world, which allows businesses to store
and process data closer to end users, improving access speed and compliance with data
sovereignty laws.
6. Enhanced Security:
○ Cloud providers offer strong security measures, including encryption, identity
management, and access controls, to ensure that large datasets remain secure in the
cloud.
GFS is a distributed file system designed by Google to meet the specific needs of its large-scale data
processing and storage requirements. It is optimized for large files and designed to work with
Google's infrastructure, which is spread across many machines.
Limitations of GFS:
● GFS is proprietary and was designed for Google's specific needs. It is not open-source, and
thus, it is not directly available to the public.
● Its design is focused on handling very large files, so it may not be the best choice for
applications that require frequent, small random access to data.
HDFS is an open-source distributed file system that was inspired by Google File System (GFS). It
was developed as part of the Apache Hadoop project, which is widely used in the big data ecosystem
for processing large datasets using tools like MapReduce and Spark.
Limitations of HDFS:
● Not Suitable for Small Files: HDFS is optimized for large files, so managing a large number
of small files (which is common in traditional applications) can lead to performance
bottlenecks.
● Latency: While it’s excellent for large, sequential reads and writes, HDFS is not designed for
low-latency operations or frequent small random reads/writes.
● Lack of Native File Locking: HDFS does not support file locking, which can be an issue for
some types of applications that require it.
Purpose Proprietary system designed for Open-source system for distributed data
Google’s large-scale data processing, part of the Hadoop ecosystem
processing needs
File Size Optimized for very large files Optimized for very large files (typically large
blocks)
Fault Tolerance Handles hardware failures with Handles hardware failures with data replication
data replication
Data Access Primarily designed for Primarily designed for sequential access
sequential access
Use Cases Large-scale web crawling, log Big data analytics, batch processing, data
analysis, and big data warehousing
processing
Limitations Not available for public use, Less suited for low-latency or small file
specialized design processing
● Large-Scale Data Processing: GFS is optimized for large-scale data processing in the context
of web search, indexing, and analytics at Google.
● Internal Google Applications: GFS was specifically developed for Google's proprietary data
processing tools like MapReduce.
● Big Data Analytics: HDFS is commonly used in the cloud for running Hadoop clusters that
process large volumes of data. It’s widely used by data scientists and businesses for analytics.
● Data Lakes: HDFS is often used to store raw, unstructured data in a data lake for analytics
processing using frameworks like Apache Spark and Hive.
● Batch Processing: HDFS is well-suited for scenarios that require processing large batches of
data, such as log file processing, data mining, and machine learning.
● IoT Data Storage: As IoT devices generate huge amounts of data, HDFS can be used to store
and analyze IoT data.