Electronics: Performance Evaluations of Distributed File Systems For Scientific Big Data in FUSE Environment
Electronics: Performance Evaluations of Distributed File Systems For Scientific Big Data in FUSE Environment
Article
Performance Evaluations of Distributed File Systems for
Scientific Big Data in FUSE Environment
Jun-Yeong Lee 1 , Moon-Hyun Kim 1 , Syed Asif Raza Shah 2 , Sang-Un Ahn 3 and Heejun Yoon 3
and Seo-Young Noh 1, *
Abstract: Data are important and ever growing in data-intensive scientific environments. Such
research data growth requires data storage systems that play pivotal roles in data management and
analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known
storage technology combining multiple disks into a single large logical volume, has been widely
used for the purpose of data redundancy and performance improvement. However, this requires
RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult
to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file
systems have been developed and are being actively used in various environments, especially in
data-intensive computing facilities, where a tremendous amount of data have to be handled. In this
study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS,
Citation: Lee, J.-Y.; Kim, M.-H.; Raza
Shah, S.A.; Ahn, S.-U.; Yoon, H.; Noh,
Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed
S.-Y. Performance Evaluations of file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in
Distributed File Systems for Scientific Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect
Big Data in FUSE Environment. the read and write performance depending on the features of data, which have to be considered in
Electronics 2021, 10, 1471. https:// data-intensive computing environments.
doi.org/10.3390/electronics10121471
Keywords: data-intensive computing; distributed file system; RAIN; FUSE; Ceph; EOS; GlusterFS;
Academic Editor: Antonio F. Diaz Lustre
distributed file system provides horizontal scalability compared to RAID, which uses
vertical scalability. Additionally, some distributed file systems provide geo-replication,
allowing data to be geographically replicated throughout the sites. Due to these features,
distributed file systems provide more redundancy than RAID storage systems. Distributed
file systems are widely deployed at many data-intensive computing facilities. EOS, one
of the distributed file systems, was developed by CERN in 2010. It is currently deployed
for storing approximately 340 petabytes, consisting of 6 billion files [3]. Many national
laboratories and supercomputing centers, like Oak Ridge National Laboratory, use Lustre
for their storage for high-performance computing [4]. In this study, we deployed and
evaluated numerous distributed file systems using a small cluster with inexpensive server
hardware and analyzed the performance characteristics for each file system. We configured
a RAID 6-like RAIN data storing system and distributed data storing systems and measured
the performance of file systems by accessing data using a FUSE client rather than using
vendor-specific APIs and benchmarking tools. Our approach can allow us to distinguish
the main performance differences of distributed file systems in userspace which are directly
affecting user experiences. Our experimental results show that the performance impacts
depend on the scientific data analysis scenarios. Therefore, it is expected that the outcomes
of our research can provide valuable insights which can help scientists when deploying
distributed file systems in their data-intensive computing environments, considering the
characteristics of their data.
The rest of this paper is organized as follows: In Section 2, we describe which technolo-
gies and distributed file systems were used for our evaluation. In Section 3, we describe
previous studies relevant to our research. In Section 4, we describe our evaluation environ-
ment and configuration of hardware and software which were used for our evaluation. In
Sections 5 and 6, we cover the results of our evaluation. Finally, Section 7 describes our
conclusions about the results and future plans.
2. Backgrounds
In this section, we discuss the background knowledge related to our work, such as RAIN.
2.1. RAIN
Reliable Array of Independent Nodes (RAIN) is a collaboration project from the
Caltech Parallel and Distributed Computing Group and Jet Propulsion Laboratory from
NASA [5]. The purpose of this project was to create a reliable parallel computing cluster
using commodity hardware and storage with multiple network connections. RAIN im-
plements redundancy using multiple computing nodes and storage nodes which consist
of heterogeneous clusters. RAIN features scalability, dynamic reconfiguration and high
availability. RAIN can handle failures using four main techniques, described below:
• Implement multiple network interfaces at nodes.
• Single point of failure prevention using network monitoring.
• Cluster monitoring using grouping.
• Storage node redundancy using error-correcting code such as RAID.
2.2. FUSE
Filesystem in Userspace (FUSE) [6] is an interface library that passes a file system to
a Linux kernel from the userspace program included in most Linux distributions. Imple-
menting a file system directly through the Linux kernel is very difficult, but using the FUSE
library allows a file system to be configured without manipulating the kernel directly. FUSE
provides high-level and low-level API, and supports various platforms like Linux, BSD
and MacOS. Due to these characteristics, hundreds of file systems have been implemented
using the FUSE library [7].
Electronics 2021, 10, 1471 3 of 16
2.3. Ceph
Ceph [8] is an open-source distributed file system developed by the University of
California and maintained by the Ceph Foundation. This file system provides object,
block and file storage in a unified system. In addition, it uses the Reliable Autonomous
Distributed Object Store (RADOS) to provide reliable and high-performance storage that
can scale up from petabyte to exabyte capacity. RADOS consists of a monitor (MON),
manager (MGR), object storage daemon (OSD) and metadata server (MDS). MON maintains
a master copy of the cluster map, which contains the topology of the cluster. MGR runs
with MON, which provides an additional monitoring interface for external monitoring and
management systems. OSD interacts with logical disk and handles data read, write and
replicate operations on actual physical disk drives. MDS provides metadata to CephFS for
serving file services. Ceph stores data using the Controlled Replication Under Scalable
Hashing (CRUSH) algorithm, which can place and locate data using the hash algorithm [9].
Figure 1 shows the structure of Ceph.
2.4. EOS
EOS [10] is an open-source disk-based distributed storage developed by CERN. It
is used to store LHC experiment data and user data at CERN. EOS natively supports
the XRootD protocol, but also supports various other protocols, such as HTTP, WebDAV,
GridFTP and FUSE. EOS consists of three components: MGM, FST and MQ. Figure 2 shows
the structure of EOS. MGM is a management server that manages the namespace, file
system quota and file placement and location. FST is a file storage server that stores data
and metadata. MQ is a message queue that provides asynchronous messaging between
MGM and FST. EOS uses the “layout” to store data [11]. The layout determines how data
can be stored on the file system. Some layouts can support duplication or erasure coding
which can prevent data loss and accidents. The layouts are shown in Table 1.
2.5. GlusterFS
GlusterFS is an open-source distributed file system that is developed and supported
by RedHat [12]. This file system binds multiple server disk resources into a single global
namespace using the network. GlusterFS can be scaled up to several petabytes and can be
used with commodity hardware to create storage. GlusterFS provides replication, quotas,
geo-replication, snapshots and bit rot detection. Unlike other distributed file systems,
GlusterFS has no central management node or metadata node. GlusterFS can be accessed
not only with the GlusterFS native client, but can also be accessed with various protocols,
such as a network file system (NFS), service message block (SMB) and common interest file
system (CIFS). Figure 3 shows the architecture of GlusterFS.
GlusterFS stores data in a place called a volume that consists of multiple bricks, which
can be a single disk or a just a bunch of disks (JBOD) enclosure [13]. The volume supports
different types of stored data, and some types of volume support duplication or erasure
coding. Table 2 shows the types of volume used in GlusterFS.
2.6. Lustre
Lustre is an open-source distributed file system designed for high-performance com-
puting [4]. Lustre started from Carnegie Mellon University’s project and it is currently
used in many high-performance computing clusters. It uses distributed object storage ar-
chitecture [14], which consists of a management server, metadata server and object storage
server. The management server (MGS) manages all Lustre servers and clients. In addition,
it stores the server configuration. The metadata server (MDS) stores metadata information.
Multiple metadata servers can be deployed to scale up metadata storage and provide more
Electronics 2021, 10, 1471 5 of 16
redundancy. The object storage server (OSS) provides the storage for data. It uses striping
to maximize performance and storage capacity. Figure 4 shows the architecture of Lustre.
3. Related Work
There are several studies which have been conducted to evaluate the performance of
distributed file systems.
Diana et al. [15] implemented Ceph using commodity servers to provide multi-use,
highly available and performance-efficient file storage for a variety of applications, from
shared home directories to the scratch directories of high-performance computing. They
evaluated scalability for Ceph by increasing the object storage server, number of clients and
object size to understand which factors affect file system performance. They benchmarked
Ceph using rados bench and RADOS block device with fio. In their experiment, cluster
network performance was measured using netcat and iperf while the individual data disk
performance was measured using osd tell to make a baseline for file system performance.
Zhang et al. [16] virtually deployed a Ceph cluster in an OpenStack environment to
evaluate the performance of Ceph deployed in the virtual environment. They benchmarked
the cluster’s network performance using netcat and iperf. They used rados bench and
RADOS block device with bonnie++ to measure the performance.
Kumar [17] configured GlusterFS in a software-defined network (SDN) environment
with six remote servers and analyzed how GlusterFS performs in large-scale scientific ap-
plications. Through the environment, they evaluated GlusterFS and network performance.
With the evaluation result, they proposed which kind of quality of service (QoS) policy has
to provide for certain users for servicing GlusterFS in federated cloud environments.
Luca et al. [18] presented different distributed file systems used in modern cloud
services, including HDFS, Ceph, GlusterFS and XtremeFS. They focused on writing perfor-
mance, fault tolerance and re-balancing ability for each file system. They also evaluated
deployment time for each distributed file system.
In addition, several benchmark tools designed to evaluate distributed file systems have
also been introduced. Xin Li et al. [19] developed the LZpack benchmark for distributed
file systems that can test metadata and file I/O performance and evaluated file system
performance using Lustre and NFS.
Jaemyoun Lee et al. [20] proposed a large-scale object storage benchmark based on
the Yahoo! Cloud Serving Benchmark (YCSB) [21]. They developed a YCSB client for Ceph
RADOS, which can communicate between Ceph and YCSB to evaluate the performance of
Ceph storage using the YCSB.
Although there are many methods to evaluate the distributed file system performance,
two approaches are mainly used when evaluating the performance of file systems. The first
is using the file system’s own tool, for example, rados bench of Ceph [22] and TestDFSIO of
Hadoop [23]. The second is mounting the file system using the FUSE client and benchmark
using various tools such as dd [24], bonnie++ [25], iozone [26] and fio [27]. The first
method can verify performance under specific file systems, but the other file systems
cannot use the file system’s API or tools to find performance differences using file system-
Electronics 2021, 10, 1471 6 of 16
specific or biased tools. However, if we use FUSE clients from each file system, we can
mount the file systems in same Linux userspace and verify performance using the same
tools with the same parameters. Therefore, it is possible to evaluate the distributed file
systems with the same conditions, resulting in fair performance comparisons, which can
give valuable insights to scientists when adapting distributed file systems in their data-
intensive computing environments.
We can find various studies [15–17] which have measured the performance of file sys-
tems using various tools. Other papers [19,20] also describe their own tools to benchmark
the file system. However, it is not easy to find research papers describing FUSE clients to
evaluate the performance of distributed file systems.
In this study, we evaluated the storage performance using FUSE clients provided by
each distributed file system with the FIO benchmark. We selected this method because
using FUSE clients can evaluate file systems with identical parameters, which is important
for a fair comparison among file systems.
4.1. Hardware
To benchmark the distributed file system, we configured a small cluster environment
for simulating a small distributed system. Our testing cluster environment had four servers
for deploying and testing the distributed file system. We configured the master server to
act as a management node to test the file systems. We also set up three slave servers to
act as storage for the distributed file systems. The detailed specifications for all servers
are listed in Table 3. All slave servers were configured identically to minimize variables
during the evaluation. For OS, we installed CentOS 7.8.2003 Minimal on all servers. Then
we separated the boot disk and data disk of all servers to prevent interference between OS
and storage for the distributed file systems.
4.2. Network
Figure 5 shows the overall network configuration for our cluster. We used network
bonding using nmtui to bond three 1 Gbit network interfaces to create a single 3 Gbit logical
interface in the master server. In this way, we could minimize the bottleneck between the
three slave servers.
Slave servers were configured with a single 1 Gbit network interface and connected
to the same router as the master server. To evaluate the network configuration, iperf
benchmark was performed simultaneously between slave servers and the master server.
Table 4 shows the iperf results from our cluster.
Electronics 2021, 10, 1471 7 of 16
The bandwidth result of each slave server was about 910 Mbit/s due to the router’s
internal hardware traffic limit, summed up as 2730 Mbit/s.
Options Parameters
Block Size 4 K, 128 K, 1024 K
Job Size 1 GB
Number of Threads 1, 2, 4, 8
IODepth 32
Evaluated I/O Pattern Sequential Read, Sequential Write
Random Read, Random Write
5. Results
In this section, we discuss the evaluation results from our experiments. The measured
results are described by dividing them into layouts, and in the case of the RAID 6-like
RAIN layout, the benchmark results of three distributed file systems excluding Luster are
shown as graphs because Luster does not support the corresponding layout.
in throughput due to increasing threads, but the increase in performance with increasing
threads was minimal. EOS increases throughput without deteriorating as the number
of threads increases at all block sizes. Unlike the other file systems, GlusterFS shows
similar performance between one and two threads at all block sizes. For more than two
threads, throughput was found to increase like any other file system. Lustre’s performance
was enhanced in a similar way to EOS. The 4 K and 128 K block evaluations showed
low performance compared to the other distributed file systems. However, at a 1024 K
block with eight threads, Luster showed the highest performance compared to the other
file systems.
were very similar and showed higher throughput than the other file systems. However,
it should be noted that the increase in threads had a marginal effect on the increase in
throughput. In the 4 K block benchmark, the throughput of EOS increases, but decreases
when the number of threads is 8, resulting in lower throughput than single-threaded
performance. As shown in the 128 K block results, the increase in throughput from two
threads or more was insignificant. The result of the 1024 K block shows that throughput
increases approximately 70 MB/s as the thread number increases. In the case of GlusterFS,
the results showed poor throughput performance compared to the other file systems at
a 4 K block. In other block sizes, the throughput increases as the number of threads is
increased. At this point, we can see that the performance pattern is very similar to the
sequential read graph shown in Figure 10. Luster showed increased throughput when
increasing the block size and the number of threads.
6. Discussion
In this section, we discuss our evaluation results.
While the RAID 6 layout provides low disk utilization due to parity data compared to
the distributed layout, parity data provide data redundancy when part of the storage fails.
Additional parity data can recover the failed storage, ensuring data are intact. In addition,
the RAID 6 layout has lower throughput than the distributed layout as a result of parity
calculations, which can be seen in our results.
Author Contributions: Conceptualization, J.-Y.L. and M.-H.K.; methodology, J.-Y.L., M.-H.K. and
S.-Y.N.; software, J.-Y.L.; validation, J.-Y.L. and S.-Y.N.; formal analysis, J.-Y.L. and M.-H.K.; resources,
S.-Y.N. and H.Y.; data curation, J.-Y.L.; writing—original draft preparation, J.-Y.L. and M.-H.K.;
writing—review and editing, S.A.R.S., S.-U.A., H.Y. and S.-Y.N.; visualization, J.-Y.L.; supervision,
S.-Y.N.; funding acquisition, S.-Y.N. All authors have read and agreed to the published version of
the manuscript.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korean government (MSIT) (No. NRF-2008-00458).
Acknowledgments: The authors would like to extend their sincere thanks to the Global Science
Experimental Data Hub Center (GSDC) at the Korea Institute of Science Technology Information
(KISTI) for their support of our research.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Rydning, D.R.J.G.J. The Digitization of the World from Edge to Core. Available online: https://www.seagate.com/files/www-
content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf (accessed on 4 January 2021).
2. CERN. Storage|CERN. Available online: https://home.cern/science/computing/storage (accessed on 4 January 2021).
3. Mascetti, L.; Rios, M.A.; Bocchi, E.; Vicente, J.C.; Cheong, B.C.K.; Castro, D.; Collet, J.; Contescu, C.; Labrador, H.G.; Iven, J.; et al.
CERN Disk Storage Services: Report from last data taking, evolution and future outlook towards Exabyte-scale storage. EPJ Web
of Conferences. EDP Sci. 2020, 245, 04038. [CrossRef]
4. OpenSFS. About the Lustre® File System|Lustre. Available online: https://www.lustre.org/about/ (accessed on 4 January 2021).
5. Bohossian, V.; Fan, C.C.; LeMahieu, P.S.; Riedel, M.D.; Xu, L..; Bruck, J. Computing in the RAIN: A reliable array of independent
nodes. IEEE Trans. Parallel Distrib. Syst. 2001, 12, 99–114. [CrossRef]
6. Szeredi, M. Libfuse: Libfuse API Documentation. Available online: http://libfuse.github.io/doxygen/ (accessed on 4 January 2021).
Electronics 2021, 10, 1471 16 of 16
7. Tarasov, V.; Gupta, A.; Sourav, K.; Trehan, S.; Zadok, E. Terra Incognita: On the Practicality of User-Space File Systems. In
Proceedings of the 7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15), Santa Clara, CA, USA,
6–7 July 2015. [CrossRef]
8. Ceph Foundation. Architecture—Ceph Documentation. Available online: https://docs.ceph.com/en/latest/architecture/
(accessed on 4 January 2021).
9. Weil, S.A.; Brandt, S.A.; Miller, E.L.; Maltzahn, C. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data. In
Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, Tampa, FL, USA, 11–17 November 2006; Association for
Computing Machinery: New York, NY, USA, 2006; SC ’06, p. 122-es. [CrossRef]
10. CERN. Introduction—EOS CITRINE Documentation. Available online: https://eos-docs.web.cern.ch/intro.html (accessed on
4 January 2021).
11. CERN. RAIN—EOS CITRINE Documentation. Available online: https://eos-docs.web.cern.ch/using/rain.html (accessed on
4 January 2021).
12. Red Hat. Introduction—Gluster Docs. Available online: https://docs.gluster.org/en/latest/Administrator-Guide/GlusterFS-
Introduction/ (accessed on 4 January 2021).
13. Red Hat. Architecture—Gluster Docs. Available online: https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/
(accessed on 4 January 2021).
14. OpenSFS. Introduction to Lustre—Lustre Wiki. Available online: https://wiki.lustre.org/Introduction_to_Lustre#Lustre_
Architecture (accessed on 4 January 2021).
15. Gudu, D.; Hardt, M.; Streit, A. Evaluating the performance and scalability of the Ceph distributed storage system. In Proceedings
of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014; pp. 177–182.
[CrossRef]
16. Zhang, X.; Gaddam, S.; Chronopoulos, A.T. Ceph Distributed File System Benchmarks on an Openstack Cloud. In Pro-
ceedings of the 2015 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), Bangalore, India,
25–27 November 2015; pp. 113–120. [CrossRef]
17. Kumar, M. Characterizing the GlusterFS Distributed File System for Software Defined Networks Research. Ph.D. Thesis, Rutgers
The State University of New Jersey, New Brunswick, NJ, USA, 2015; [CrossRef]
18. Acquaviva, L.; Bellavista, P.; Corradi, A.; Foschini, L.; Gioia, L.; Picone, P.C.M. Cloud Distributed File Systems: A Benchmark of
HDFS, Ceph, GlusterFS, and XtremeFS. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM),
Abu Dhabi, United Arab Emirates, 9–13 December 2018; pp. 1–6. [CrossRef]
19. Li, X.; Li, Z.; Zhang, X.; Wang, L. LZpack: A Cluster File System Benchmark. In Proceedings of the 2010 International Conference
on Cyber-Enabled Distributed Computing and Knowledge Discovery, Huangshan, China, 10–12 October 2010; IEEE Computer
Society: Washington, DC, USA, 2010; CYBERC ’10, pp. 444–447. [CrossRef]
20. Lee, J.; Song, C.; Kang, K. Benchmarking Large-Scale Object Storage Servers. In Proceedings of the 2016 IEEE 40th Annual
Computer Software and Applications Conference (COMPSAC), Atlanta, GA, USA, 10–14 June 2016; Volume 2, pp. 594–595.
[CrossRef]
21. Cooper, B.F.; Silberstein, A.; Tam, E.; Ramakrishnan, R.; Sears, R. Benchmarking Cloud Serving Systems with YCSB. In Proceedings
of the 1st ACM Symposium on Cloud Computing, Indianapolis, IN, USA, 10–11 June 2010; Association for Computing Machinery:
New York, NY, USA, 2010; SoCC ’10, pp. 143–154. [CrossRef]
22. Red Hat. Chapter 9. Benchmarking Performance Red Hat Ceph Storage 1.3|Red Hat Customer Portal. Available online:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/administration_guide/benchmarking_
performance (accessed on 4 January 2021).
23. Li, J.; Wang, Q.; Jayasinghe, D.; Park, J.; Zhu, T.; Pu, C. Performance Overhead among Three Hypervisors: An Experimental
Study Using Hadoop Benchmarks. In Proceedings of the 2013 IEEE International Congress on Big Data, Santa Clara, CA, USA,
27 June–2 July 2013; IEEE Computer Society: Washington, DC, USA, 2013; BIGDATACONGRESS ’13, pp. 9–16. [CrossRef]
24. IEEE Standard for Information Technology–Portable Operating System Interface (POSIX(TM)) Base Specifications, Issue 7.
IEEE Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008); 2018 ; pp. 2641–2649. Available online: https://ieeexplore.ieee.org/
document/8277153/ (accessed on 4 January 2021). [CrossRef]
25. Russel Cocker. Bonnie++ Russell Coker’s Documents. Available online: https://doc.coker.com.au/projects/bonnie/ (accessed on
4 January 2021).
26. Don capps. Iozone Filesystem Benchmark. Available online: http://iozone.org (accessed on 4 January 2021).
27. Axboe, J. GitHub—axboe/fio: Flexible I/O Tester. Available online: https://github.com/axboe/fio (accessed on 4 January 2021).
28. OpenSFS. Lustre Roadmap|Lustre. Available online: https://www.lustre.org/roadmap/ (accessed on 4 January 2021).