IBM GPFS

Sep 3, 2015Download as PPTX, PDF3 likes1,960 views

GPFS (General Parallel File System) is a high-performance clustered file system developed by IBM that can be deployed in shared disk or shared-nothing distributed parallel modes. It was created to address the growing imbalance between increasing CPU, memory, and network speeds, and the relatively slower growth of disk drive speeds. GPFS provides high scalability, availability, and advanced data management features like snapshots and replication. It is used extensively by large companies and supercomputers due to its ability to handle large volumes of data and high input/output workloads in distributed, parallel environments.

GPFS: General Parallel File
System
Why is it needed?
What is GPFS and its features?
Where it is being used?

Growth Rate of Components
• ✓ CPU speed performance has increased 8
to 10 times.
• ✓ DRAM speed performance has increased 7
to 9 times.
• ✓ Network speed performance has increased
100 times.
• ✓ Bus speed performance has increased 20
times.
• ✓ But Hard disk drive (HDD) speed
performance has increased only 1.2 times.

Three Important Functions
of Enterprise Storage
• ✓ Store data
• ✓ Protect data from being lost
• ✓ Feed data to the computer’s processors
(so they can keep doing work)

Existing Solutions Inability
• DAS, NAS, SAN [alone]
• Many data centers have become victims of
“filer-sprawl”
• Data administration and management
(such as migration, backups, archiving)
costs to skyrocket!
• I/O performance & application workflow

What is GPFS
• The General Parallel File System (GPFS) is a high
performance clustered file system. It can be
deployed in shared disk or shared nothing
distributed parallel modes.
• Developer(s): IBM
• Operating system: AIX / Linux / Windows Server
• License: Proprietary
• System Introduced: 1998 (AIX)
• Max. volume size: 8 YB
• Max. file size: 8 EB
• Max. number of files: 264 per file system
• File system permissions: POSIX

GPFS Current Usage
• It is used by many of the world's largest commercial
companies, as well as some of the supercomputers on
the Top 500 List.
• For example, GPFS was the filesystem of the ASC
Purple Supercomputer which was composed of more
than 12,000 processors and 2 petabytes of total disk
storage spanning more than 11,000 disks.
• IBM,s GPFS is extensively used across multiple
industries like Government, Oil and Gas, Life Sciences,
Media/Entertainment, Financial services

GPFS Features
Standard file system interface with POSIX semantics
– Metadata on shared storage
– Distributed locking for read/write semantics
• Highly scalable
– High capacity (up to 2^99 bytes file system size, up to 2^63 files per file
system)
– High throughput (TB/s)
– Wide striping
– Large block size (up to 16MB)
– Multiple nodes write in parallel
• Advanced data management
– ILM (storage pools), Snapshots
– Backup HSM (DMAPI)
– Remote replication, WAN caching
• High availability
– Fault tolerance (node, disk failures)
– On-line system management (add/remove nodes, disks, ...)

References
• GPFS official homepage
• GPFS resources (including download)
• GPFS at Almaden
• GPFS Mailing List
• GPFS User Group
• IBM GPFS Product Documentation
• IBM GPFS Wiki

The document provides instructions for installing and configuring Spectrum Scale 4.1. Key steps include: installing Spectrum Scale software on nodes; creating a cluster using mmcrcluster and designating primary/secondary servers; verifying the cluster status with mmlscluster; creating Network Shared Disks (NSDs); and creating a file system. The document also covers licensing, system requirements, and IBM and client responsibilities for installation and maintenance.

Gpfs introandsetupasihan

This document provides an overview of installing and configuring a 3 node GPFS cluster. It discusses using 8 shared LUNs across the 3 servers to simulate having disks from 2 different V7000 storage arrays for redundancy. The disks will be divided into 2 failure groups, with hdisk1-4 in one failure group representing one simulated array, and hdisk5-8 in the other failure group representing the other simulated array. This is to ensure redundancy in case of failure of an entire storage array.

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

Redisssuserbad56d

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units. Below topics are explained in this Hadoop presentation: 1. What is Hadoop 2. Why Hadoop 3. Big Data generation 4. Hadoop HDFS 5. Hadoop MapReduce 6. Hadoop YARN 7. Use of Hadoop 8. Demo on HDFS, MapReduce and YARN What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Cloudera Hadoop DistributionThisara Pramuditha

CDH is a popular distribution of Apache Hadoop and related projects that delivers scalable storage and distributed computing through Apache-licensed open source software. It addresses challenges in storing and analyzing large datasets known as Big Data. Hadoop is a framework for distributed processing of large datasets across computer clusters using simple programming models. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. The Hadoop ecosystem also includes tools like Kafka, Sqoop, Hive, Pig, Impala, HBase, Spark, Mahout, Solr, Kudu, and Sentry that provide functionality like messaging, data transfer, querying, machine learning, search, and authorization.

APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote

Redisimalik8088

Redis is an in-memory key-value database that is used by companies like Instagram and Stack Overflow for caching and storing non-relational data. It supports common data structures like strings, hashes, lists, sets and sorted sets. Redis provides very fast performance of over 100,000 writes and 80,000 reads per second. While Redis does not support complex queries, it is well suited for simple, fast queries and retrievals. The document discusses how Redis could be used to query and retrieve flight booking data by airport in both coarse and fine-grained data modeling approaches.

Apache Cassandra at the Geek2Geek BerlinChristian Johannsen

This document provides an agenda and introduction for a presentation on Apache Cassandra and DataStax Enterprise. The presentation covers an introduction to Cassandra and NoSQL, the CAP theorem, Apache Cassandra features and architecture including replication, consistency levels and failure handling. It also discusses the Cassandra Query Language, data modeling for time series data, and new features in DataStax Enterprise like Spark integration and secondary indexes on collections. The presentation concludes with recommendations for getting started with Cassandra in production environments.

Distributed Caching in Kubernetes with HazelcastMesut Celik

As Monolith to Microservices migration almost became mainstream, Engineering Teams have to think about how their caching strategies will evolve in cloud-native world. Kubernetes is clear winner in containerized world so caching solutions must be cloud-ready and natural fit for Kubernetes. Caching is an important piece in high performance microservices and choosing right architectural pattern can be crucial for your deployments. Hazelcast is a well known caching solution in open source community and can handle caching piece in microservices based applications. In this talk, you will learn * Distributed Caching With Hazelcast * Distributed Caching Patterns in Kubernetes * Kubernetes Deployment Options and Best Practices * How to Handle Distributed Caching Day 2 Operations

NiFi Developer GuideDeon Huang

CDW: SAN vs. NASSpiceworks

The document discusses the differences between network attached storage (NAS) and storage area network (SAN) solutions for small businesses. It outlines the key benefits and use cases of each technology. NAS is best for file sharing and backup, while SAN provides faster performance for databases and applications. The document also notes that a combination of NAS and SAN can provide the best of both worlds.

Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...xKinAnx

The document provides an overview of IBM Spectrum Scale Active File Management (AFM). AFM allows data to be accessed globally across multiple clusters as if it were local by automatically managing asynchronous replication. It describes the various AFM modes including read-only caching, single-writer, and independent writer. It also covers topics like pre-fetching data, cache eviction, cache states, expiration of stale data, and the types of data transferred between home and cache sites.

Apache Spark ArchitectureAlexey Grishchenko

Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit

The Evaluation of TPC-H on Spark and Spark SQL in ALOJA was conducted at the Big Data Lab to obtain the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. Furthermore, the analysis was partially accomplished in collaboration and close coordination with the Barcelona Super Computer Center. The intention of this research was the integration of a TPC-H on Spark Scala benchmark into ALOJA, an open-source and public platform for automated and cost-efficient benchmarks and to perform an evaluation on the runtime of Spark Scala with or without Hive Metastore compared to Spark SQL. Various alternate file formats with different applied compressions on underlying data and its impact are evaluated. The conducted performance evaluation exposed diverse and captivating outcomes for both benchmarks. Further investigations attempt to detect possible bottlenecks and other irregularities. The aim is to provide an explanation to enhance knowledge of Spark’s engine based on examining the physical plans. Our experiments show, inter alia, that: (1) Spark Scala performs better in case of heavy expression calculation, (2) Spark SQL is the better choice in case of strong data access locality in combination with heavyweight parallel execution. In conclusion, diverse results were observed with the consequence that each API has its advantages and disadvantages. Surprisingly, our findings are well spread between Spark SQL and Spark Scala and contrary to our expectations Spark Scala did not outperform Spark SQL in all aspects but support the idea that applied optimizations appear to be implemented in a different way by Spark for its core and its extension Spark SQL. The API on top of Spark provides extra information about the underlying structured data, which is probably used to perform additional optimizations. In conclusion, our research demonstrates that there are differences in the generation of query execution plans that goes hand-in-hand with similar discoveries leading to inefficient joins, and it underlines the importance of our benchmark to identify disparities and bottlenecks. Speaker Raphael Radowitz, Quality Specialist, SAP Labs Korea

Redis persistence in practiceEugene Fidelin

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit

This document discusses how HDFS tiered storage can be used to reduce storage costs by 5x. It introduces the new HDFS storage model that supports multiple storage types like ARCHIVE, DISK, SSD, and RAM_DISK. Block storage policies like HOT, WARM, and COLD can be defined to control where blocks are stored. eBay uses HDFS tiered storage to archive older data to cheaper ARCHIVE nodes, analyzing access patterns to define archival policies. Data is moved between storage types using the HDFS mover tool while maintaining replication and rack requirements.

Introduction to RedisTO THE NEW | Technology

Ibm spectrum scale fundamentals workshop for americas part 8 spectrumscale ba...xKinAnx

Hadoop hdfsSudipta Ghosh

The document summarizes Hadoop HDFS, which is a distributed file system designed for storing large datasets across clusters of commodity servers. It discusses that HDFS allows distributed processing of big data using a simple programming model. It then explains the key components of HDFS - the NameNode, DataNodes, and HDFS architecture. Finally, it provides some examples of companies using Hadoop and references for further information.

Hive + Tez: A Performance Deep DiveDataWorks Summit

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Storage BasicsMurali Rajesh

Storage systems include disks, disk shelves, controllers, and switches. Servers connect to storage using host bus adapters (HBAs) and software initiators to access disks over Fibre Channel (FCP) or iSCSI. NetApp uses its DataOntap operating system to manage disks aggregated into RAID groups and provisioned into volumes that provide file-level access over protocols like NFS, CIFS, iSCSI, and FC. Volumes contain file systems and can be accessed by servers over dedicated block storage devices called LUNs.

Google File SystemAmgad Muhammad

GFS is a distributed file system designed by Google to store and manage large files on commodity hardware. It is optimized for large streaming reads and writes, with files divided into 64MB chunks that are replicated across multiple servers. The master node manages metadata like file mappings and chunk locations, while chunk servers store and serve data to clients. The system is designed to be fault-tolerant by detecting and recovering from frequent hardware failures.

HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon

In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.

Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The document recommends experimenting with the benchmarks, as performance can vary based on data characteristics and use cases like column projections.

Alfresco Share - Recycle Bin IdeasAlfrescoUE

The document describes how to navigate to and interact with the recycle bin interface in Alfresco: 1. The recycle bin can be accessed from the user dashboard or from within a specific site. 2. Items in the recycle bin are filtered based on the site context or lack thereof. 3. Recent items deleted are also visible without accessing the recycle bin directly. 4. Individual or multiple documents can be selected and moved to the recycle bin to delete.

IBM Platform Computing Elastic StoragePatrick Bouillaud

This document discusses IBM's Elastic Storage product. It provides an overview of Elastic Storage's key features such as extreme scalability, high performance, support for various operating systems and hardware, data lifecycle management capabilities, integration with Hadoop, and editions/pricing. It also compares Elastic Storage to alternative storage solutions and discusses how Elastic Storage can be used to build private and hybrid clouds with OpenStack.

Wheeler w 0450_linux_file_systems1sprdd

This document summarizes new features in file systems and storage for Red Hat Enterprise Linux 6 and 7. Some key points include: - RHEL6 introduced new LVM features like thin provisioning and snapshots that improve storage utilization and reduce administration. Ext4 and XFS were expanded file system options. - RHEL6 also enhanced support for parallel NFS to improve scalability of NFS file systems. GFS2 and XFS saw performance improvements. - RHEL7 is focusing on enhancing performance for high-speed devices like SSDs and new types of persistent memory. It will include block layer caching options and improved thin provisioning alerts. Btrfs support is also being expanded.

More Related Content

What's hot (20)

Apache Cassandra at the Geek2Geek BerlinChristian Johannsen

Distributed Caching in Kubernetes with HazelcastMesut Celik

NiFi Developer GuideDeon Huang

CDW: SAN vs. NASSpiceworks

Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...xKinAnx

Apache Spark ArchitectureAlexey Grishchenko

Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit

Redis persistence in practiceEugene Fidelin

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit

Introduction to RedisTO THE NEW | Technology

Ibm spectrum scale fundamentals workshop for americas part 8 spectrumscale ba...xKinAnx

Hadoop hdfsSudipta Ghosh

Hive + Tez: A Performance Deep DiveDataWorks Summit

Storage BasicsMurali Rajesh

Google File SystemAmgad Muhammad

HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon

Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Alfresco Share - Recycle Bin IdeasAlfrescoUE

Apache Cassandra at the Geek2Geek BerlinChristian Johannsen

Distributed Caching in Kubernetes with HazelcastMesut Celik

NiFi Developer GuideDeon Huang

CDW: SAN vs. NASSpiceworks

Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...xKinAnx

Apache Spark ArchitectureAlexey Grishchenko

Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit

Redis persistence in practiceEugene Fidelin

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit

Introduction to RedisTO THE NEW | Technology

Ibm spectrum scale fundamentals workshop for americas part 8 spectrumscale ba...xKinAnx

Hadoop hdfsSudipta Ghosh

Hive + Tez: A Performance Deep DiveDataWorks Summit

Storage BasicsMurali Rajesh

Google File SystemAmgad Muhammad

HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon

Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Alfresco Share - Recycle Bin IdeasAlfrescoUE

Similar to IBM GPFS (20)

IBM Platform Computing Elastic StoragePatrick Bouillaud

Wheeler w 0450_linux_file_systems1sprdd

getFamiliarWithHadoopAmirReza Mohammadi

This document provides an introduction to big data and Hadoop. It discusses how the volume of data being generated is growing rapidly and exceeding the capabilities of traditional databases. Hadoop is presented as a solution for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce for parallel processing, the Hadoop Distributed File System (HDFS) for reliable storage, and how data is replicated across nodes for fault tolerance.

Storage solutions for High Performance Computinggmateesc

This document discusses storage infrastructure for high-performance computing. It begins by introducing data-intensive science and the need for parallel storage systems. It then discusses several parallel file systems used in HPC like GPFS, Lustre, and PanFS. Key concepts covered include data striping, scale-out NAS, parallel file systems, and IO acceleration techniques. The document also discusses challenges of data growth, bottlenecks in scaling storage, and architectures of various parallel file systems.

Chapter2.pdfWasyihunSema2

The document provides an introduction to Hadoop and HDFS (Hadoop Distributed File System). It discusses key concepts such as: - HDFS stores large datasets across commodity hardware in a fault-tolerant manner and provides scalable storage and access. - HDFS has a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. - Data is replicated across DataNodes for reliability, with one replica on a local rack and two on remote racks by default. - Hadoop allows processing of large datasets in parallel across clusters and is well-suited for massive amounts of structured and unstructured data.

UNIT 4-UNDERSTANDING VIRTUAL MEMORY.pptxLeahRachael

Virtual memory uses demand paging to improve memory usage by only loading pages from disk into RAM when needed by the CPU. This allows programs to be larger than physical RAM since unused pages remain on disk. When a program accesses a page not in RAM, a page fault occurs and the OS loads the required page from disk transparently. Demand paging allows more efficient use of physical RAM and faster program startup compared to loading the entire program at once.

Introducing StorNext5 and Lattusinside-BigData.com

StorNext 5 introduces new appliances built from the ground up to provide faster performance, 5x greater scalability, and optimization for modern workflows. Key features include a new modern core with improved caching, compact metadata, and management. Performance is improved across all file sizes up to 10x faster, and it scales to support up to 5 billion files. StorNext 5 provides topology-agnostic access over Fibre Channel, IP/NAS, and InfiniBand. It is optimized for Quantum's StorNext Q-Series storage and Lattus object storage. StorNext 5 also introduces native LTFS support for tape formats and is architected for long-term efficiency and non-disruptive updates.

Hadoop introductionSubhas Kumar Ghosh

This document provides an overview of big data concepts and Hadoop. It discusses the four V's of big data - volume, velocity, variety, and veracity. It then describes how Hadoop uses MapReduce and HDFS to process and store large datasets in a distributed, fault-tolerant and scalable manner across commodity hardware. Key components of Hadoop include the HDFS file system and MapReduce framework for distributed processing of large datasets in parallel.

Apache hadoop basicssaili mane

This document provides an overview of Apache Hadoop, a framework for storing and processing large datasets in a distributed computing environment. It discusses what big data is and the challenges of working with large datasets. Hadoop addresses these challenges through its two main components: the HDFS distributed file system, which stores data across commodity servers, and MapReduce, a programming model for processing large datasets in parallel. The document outlines the architecture and benefits of Hadoop for scalable, fault-tolerant distributed computing on big data.

Distributed Filesystems ReviewSchubert Zhang

The document summarizes and compares several distributed file systems, including Google File System (GFS), Kosmos File System (KFS), Hadoop Distributed File System (HDFS), GlusterFS, and Red Hat Global File System (GFS). GFS, KFS and HDFS are based on the GFS architecture of a single metadata server and multiple chunkservers. GlusterFS uses a decentralized architecture without a metadata server. Red Hat GFS requires a SAN for high performance and scalability. Each system has advantages and limitations for different use cases.

Introduction to intelligence cybersecurity_4arazaque2675

Big data and hadoopRoushan Sinha

Big data and Hadoop are frameworks for processing and storing large datasets. Hadoop uses HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines for redundancy and parallel access. MapReduce divides jobs into map and reduce tasks that run in parallel across a cluster. Hadoop provides scalable and fault-tolerant solutions to problems like processing terabytes of data from jet engines or scaling to Google's data processing needs.

Big in memory file systemMahesh Gupta

This document discusses big memory systems and proposes an in-memory file system for operating systems. It begins with an overview of traditional system architectures and their limitations due to disk access bottlenecks. Next, it surveys existing big memory solutions like Violin memory systems and in-memory databases like Redis. It then outlines the functional requirements and experiments needed to modify the memory management unit to take advantage of large main memory capacities and improve performance by minimizing disk access.

FAT.pptxmadhavigulhane1

HDD performance is determined by seek time, rotational latency, data transfer time, and controller time. HDDs are non-volatile, high-capacity, and cost-effective but relatively slow compared to SSDs due to their mechanical components. Common file systems include FAT, FAT32, HPFS, and NTFS, with NTFS being the standard for Windows systems. NTFS supports large hard disks, security, and data recovery while EXT is commonly used for Linux systems.

Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz

Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Doug O'Flaherty

Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy

Spectrum Scale Unified File and Object with WAN CachingSandeep Patil

This document provides an overview of IBM Spectrum Scale's Active File Management (AFM) capabilities and use cases. AFM uses a home-and-cache model to cache data from a home site at local clusters for low-latency access. It expands GPFS' global namespace across geographical distances and provides automated namespace management. The document discusses AFM caching basics, global sharing, use cases like content distribution and disaster recovery. It also provides details on Spectrum Scale's protocol support, unified file and object access, using AFM with object storage, and configuration.

Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar

IBM Platform Computing Elastic StoragePatrick Bouillaud

Wheeler w 0450_linux_file_systems1sprdd

getFamiliarWithHadoopAmirReza Mohammadi

Storage solutions for High Performance Computinggmateesc

Chapter2.pdfWasyihunSema2

UNIT 4-UNDERSTANDING VIRTUAL MEMORY.pptxLeahRachael

Introducing StorNext5 and Lattusinside-BigData.com

Hadoop introductionSubhas Kumar Ghosh

Apache hadoop basicssaili mane

Distributed Filesystems ReviewSchubert Zhang

Introduction to intelligence cybersecurity_4arazaque2675

Big data and hadoopRoushan Sinha

Big in memory file systemMahesh Gupta

FAT.pptxmadhavigulhane1

Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz

Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5Doug O'Flaherty

Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy

Spectrum Scale Unified File and Object with WAN CachingSandeep Patil

Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar

IBM GPFS

1. GPFS: General Parallel File System Why is it needed? What is GPFS and its features? Where it is being used?

2. Why GPFS is needed?

3. Growth Rate of Components • ✓ CPU speed performance has increased 8 to 10 times. • ✓ DRAM speed performance has increased 7 to 9 times. • ✓ Network speed performance has increased 100 times. • ✓ Bus speed performance has increased 20 times. • ✓ But Hard disk drive (HDD) speed performance has increased only 1.2 times.

4. Three Important Functions of Enterprise Storage • ✓ Store data • ✓ Protect data from being lost • ✓ Feed data to the computer’s processors (so they can keep doing work)

5. Existing Solutions Inability • DAS, NAS, SAN [alone] • Many data centers have become victims of “filer-sprawl” • Data administration and management (such as migration, backups, archiving) costs to skyrocket! • I/O performance & application workflow

6. What is GPFS • The General Parallel File System (GPFS) is a high performance clustered file system. It can be deployed in shared disk or shared nothing distributed parallel modes. • Developer(s): IBM • Operating system: AIX / Linux / Windows Server • License: Proprietary • System Introduced: 1998 (AIX) • Max. volume size: 8 YB • Max. file size: 8 EB • Max. number of files: 264 per file system • File system permissions: POSIX

7. GPFS Current Usage • It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. • For example, GPFS was the filesystem of the ASC Purple Supercomputer which was composed of more than 12,000 processors and 2 petabytes of total disk storage spanning more than 11,000 disks. • IBM,s GPFS is extensively used across multiple industries like Government, Oil and Gas, Life Sciences, Media/Entertainment, Financial services

8. GPFS Features Standard file system interface with POSIX semantics – Metadata on shared storage – Distributed locking for read/write semantics • Highly scalable – High capacity (up to 2^99 bytes file system size, up to 2^63 files per file system) – High throughput (TB/s) – Wide striping – Large block size (up to 16MB) – Multiple nodes write in parallel • Advanced data management – ILM (storage pools), Snapshots – Backup HSM (DMAPI) – Remote replication, WAN caching • High availability – Fault tolerance (node, disk failures) – On-line system management (add/remove nodes, disks, ...)

9. References • GPFS official homepage • GPFS resources (including download) • GPFS at Almaden • GPFS Mailing List • GPFS User Group • IBM GPFS Product Documentation • IBM GPFS Wiki

IBM GPFS

Recommended

More Related Content

What's hot (20)

Similar to IBM GPFS (20)

IBM GPFS