Flinkcl: An Opencl-Based In-Memory Computing Architecture On Heterogeneous Cpu-Gpu Clusters For Big Data Abstract
Flinkcl: An Opencl-Based In-Memory Computing Architecture On Heterogeneous Cpu-Gpu Clusters For Big Data Abstract
Abstract:
Research on in-memory big data management and processing has been prompted by the increase in main
memory capacity and the explosion in big data. By offering an efficient in-memory distributed execution
model, existing in-memory cluster computing platforms such as Flink and Spark have been proven to be
outstanding for processing big data. This paper proposes FlinkCL, an inmemory computing architecture
on heterogeneous CPU-GPU clusters based on OpenCL that enables Flink to utilize GPU’s massive
parallel processing ability. Our proposed architecture utilizes four techniques: a heterogeneous distributed
abstract model (HDST), a Just-In-Time (JIT) compiling schema, a hierarchical partial reduction (HPR)
and a heterogeneous task management strategy. Using FlinkCL, programmers only need to write Java
code with simple interfaces. The Java code can be compiled to OpenCL kernels and executed on CPUs
and GPUs automatically. In the HDST, a novel memory mapping scheme is proposed to avoid
serialization or deserialization between Java Virtual Machine (JVM) objects and OpenCL structs. We
have comprehensively evaluated FlinkCL with a set of representative workloads to show its effectiveness.
Our results show that FlinkCL improve the performance by up to 11× for some computationally heavy
algorithms and maintains minor performance improvements for a I/O bound algorithm.
Existing System:
By offering an efficient in-memory distributed execution model, existing in-memory cluster computing
platforms such as Flink and Spark have been proven to be outstanding for processing big data. This paper
proposes FlinkCL, an inmemory computing architecture on heterogeneous CPU-GPU clusters based on
OpenCL that enables Flink to utilize GPU’s massive parallel processing ability.
Proposed System:
Our proposed architecture utilizes four techniques: a heterogeneous distributed abstract model (HDST), a
Just-In-Time (JIT) compiling schema, a hierarchical partial reduction (HPR) and a heterogeneous task
management strategy. Using FlinkCL, programmers only need to write Java code with simple interfaces.
The Java code can be compiled to OpenCL kernels and executed on CPUs and GPUs automatically. In
the HDST, a novel memory mapping scheme is proposed to avoid serialization or deserialization between
Java Virtual Machine (JVM) objects and OpenCL structs. We have comprehensively evaluated FlinkCL
with a set of representative workloads to show its effectiveness. Our results show that FlinkCL improve
the performance by up to 11× for some computationally heavy algorithms and maintains minor
performance improvements for a I/O bound algorithm.
Conclusion:
GPUs have become efficient accelerators for HPC. This paper has proposed FlinkCL, which harnesses the
high computational power of GPUs to accelerate the in-memory cluster computing with an easy
programming model. FlinkCL is based on four proposed core techniques: an HDST, a JIT compiling
scheme, an HPR scheme and a heterogeneous task management strategy. By using these techniques,
FlinkCL remains compatible with both the compile-time and the runtime of the original Flink. To further
improve the scalability of FlinkCL, a pipeline scheme similar to that introduced in could be considered.
We can utilize this pipeline to overlap the communication between cluster nodes and the computation in a
node. In addition, by using an asynchronous execution model, transfer on PCIe and executions on GPUs
can also be overlapped. In the current implementation, data in the GPU memory must be moved into the
host memory before it can be sent over the network. A future research direction could involve enabling
GPU-to-GPU communication using GPU Direct RDMA to further improve the performance. Another
optimization measure could be a software cache scheme that can cache intermediate data in GPUs to
avoid unnecessary data transfers on PCIe. In current design, hMap and hReduce function are compiled to
kernels separately. Actually, we can fuse these kernels together if possible in our JIT compiler. By
adopting this scheme, the time for kernel invoking can be decreased and some data transfers on PCIe can
be avoided.
REFERENCES:
[2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I.
Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in
Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.
USENIX Association, 2012, pp. 2–2.
[3] W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned SpMV on GPUs and
multicore CPUs,” IEEE Transactions on Computers, vol. 64, no. 9, pp. 2623–2636, 2015.
[4] Z. Zhong, V. Rychkov, and A. Lastovetsky, “Data partitioning on multicore and multi-GPU platforms
using functional performance models,” IEEE Transactions on Computers, vol. 64, no. 9, pp. 2506– 2518,
2015.
[5] K. Li, W. Yang, and K. Li, “Performance analysis and optimization for spmv on gpu using
probabilistic modeling,” Parallel and Distributed Systems, IEEE Transactions on, vol. 26, no. 1, pp. 196–
205, 2015.
[6] C. Chen, K. Li, A. Ouyang, Z. Tang, and K. Li, “Gpu-accelerated parallel hierarchical extreme
learning machine on Flink for big data,” IEEE Transactions on Systems, Man, and Cybernetics: Systems,
vol. 47, no. 10, pp. 2740–2753, 2017.
[7] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A highperformance, portable implementation of the
MPI message passing interface standard,” Parallel computing, vol. 22, no. 6, pp. 789– 828, 1996.
[8] L. Dagum and R. Enon, “OpenMP: An industry-standard API for shared-memory programming,”
IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46–55, 1998.
[9] P. Carbone, G. Fra, S. Ewen, S. Haridi, and K. Tzoumas, “Lightweight asynchronous snapshots for
distributed dataflows,” Computer Science, 2015.
[10] C. Chen, K. Li, A. Ouyang, Z. Tang, and K. Li, “GFlink: An in-memory computing architecture on
heterogeneous CPU-GPU clusters for big data,” in International Conference on Parallel Processing, 2016,
pp. 542–551.
[11] C. Li, Y. Yang, Z. Lin, and H. Zhou, “Automatic data placement into GPU on-chip memory
resources,” in Ieee/acm International Symposium on Code Generation and Optimization, 2015, pp. 23–33.
[12] N. Fauzia and P. Sadayappan, “Characterizing and enhancing global memory data coalescing on
GPUs,” in Ieee/acm International Symposium on Code Generation and Optimization, 2015, pp. 12–22.
[13] T. Ben-Nun, E. Levy, A. Barak, and E. Rubin, “Memory access patterns: the missing piece of the
multi-GPU puzzle,” in High Performance Computing, Networking, Storage and Analysis, 2015 SC -
International Conference for, 2017, p. 19.
[14] I. J. Sung, G. D. Liu, and W. M. W. Hwu, “DL: A data layout transformation system for
heterogeneous computing,” in Innovative Parallel Computing, 2012, pp. 1–11.