FFT, FMM, and Multigrid on the Road to Exascale: performance challenges and opportunities Ibeid, H., Olson, L., and Gropp, W. Journal of Parallel and Distributed Computing
FFT, FMM, and multigrid methods are widely used fast and highly scalable solvers for elliptic PDEs. However, emerging large-scale computing systems are introducing challenges in comparison to current petascale computers. Recent efforts have identified several constraints in the design of exascale software that include massive concurrency, resilience management, exploiting the high performance of heterogeneous systems, energy efficiency, and utilizing the deeper and more complex memory hierarchy expected at exascale. In this paper, we perform a model-based comparison of the FFT, FMM, and multigrid methods in the context of these projected constraints. In addition we use performance models to offer predictions about the expected performance on upcoming exascale system configurations based on current technology trends.
Learning with Analytical Models Ibeid, H., Meng, S., Dobon, O., Olson, L., and Gropp, W. IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
To understand and predict the performance of parallel and distributed programs, several analytical and machine learning approaches have been proposed, each having its advantages and disadvantages. In this paper, we propose and validate a hybrid approach exploiting both analytical and machine learning models. The hybrid model is able to learn and correct the analytical models to better match the actual performance. Furthermore, the proposed hybrid model improves the prediction accuracy in comparison to pure machine learning techniques while using small training datasets, thus making it suitable for hardware and workload changes.
Fast Multipole Preconditioners for Sparse Matrices Arising from Elliptic Equations Ibeid, H., Yokota, R., Pestana, J., and Keyes, D. Computing and Visualization in Science
Among optimal hierarchical algorithms for the computational solution of elliptic problems, the fast multipole method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxable global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for satisfying conditions at finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Here, we do not discuss the well developed applications of FMM to implement matrix-vector multiplications within Krylov solvers of boundary element methods. Instead, we propose using FMM for the volume-to-volume contribution of inhomogeneous Poisson-like problems, where the boundary integral is a small part of the overall computation. Our method may be used to precondition sparse matrices arising from finite difference/element discretizations, and can handle a broader range of scientific applications. It is capable of algebraic convergence rates down to the truncation error of the discretized PDE comparable to those of multigrid methods, and it offers potentially superior multicore and distributed memory scalability properties on commodity architecture supercomputers. Compared with other methods exploiting the low-rank character of off-diagonal blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may reduce the amount of communication because it is matrix-free and exploits the tree structure of FMM. We describe our tests in reproducible detail with freely available codes and outline directions for further extensibility.
Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions Abduljabbar, M., Markomanolis, G., Ibeid, H., Yokota, R., and Keyes, D. International Supercomputing Conference (ISC)
Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute a significant portion of FMM’s application user base. We propose an alternative method that modifies orthogonal recursive bisection to relieve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library’s MPI_Alltoallv that is commonly used.
A Performance Model for the Communication in Fast Multipole Methods on High-performance Computing Platforms Ibeid, H., Yokota, R., and Keyes, D. International Journal of High Performance Computing Applications (IJHPCA)
Exascale systems are predicted to have approximately 1 billion cores, assuming gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the currently dominant parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. It is therefore of interest to model application performance and to understand what changes need to be made to ensure extrapolated scalability. The fast multipole method FMM was originally developed for accelerating N-body problems in astrophysics and molecular dynamics but has recently been extended to a wider range of problems. Its high arithmetic intensity combined with its linear complexity and asynchronous communication patterns make it a promising algorithm for exascale systems. In this paper, we discuss the challenges for FMM on current parallel computers and future exascale architectures, with a focus on internode communication. We focus on the communication part only; the efficiency of the computational kernels are beyond the scope of the present study. We develop a performance model that considers the communication patterns of the FMM and observe a good match between our model and the actual communication time on four high-performance computing HPC systems, when latency, bandwidth, network topology, and multicore penalties are all taken into account. To our knowledge, this is the first formal characterization of internode communication in FMM that validates the model against actual measurements of communication time. The ultimate communication model is predictive in an absolute sense; however, on complex systems, this objective is often out of reach or of a difficulty out of proportion to its benefit when there exists a simpler model that is inexpensive and sufficient to guide coding decisions leading to improved scaling. The current model provides such guidance.
A Generic Buffer Occupancy Expression for Stop-and-Wait Hybrid Automatic Repeat Request Protocol over Unstable Channels Darabkh, K., Ibeid, H., Jafar, I., and Al-Zubi, R. Telecommunication Systems
Recently, there has been a rapid progress in the field of wireless networks and mobile communications which makes the constraints on the used links clearly unconcealed. Wireless links are characterized by limited bandwidth and high latencies. Moreover, the bit-error-rate (BER) is very high in such environments for various reasons out of which weather conditions, cross-link interference, and mobility. High BER causes corruption in the data being transmitted over these channels. Therefore, convolutional encoding has been originated to be a professional means of communication over noisy environments. Sequential decoding, a category of convolutional codes, represents an efficient error detection and correction mechanism which attracts the attention for most of current researchers as for having a complexity that is dependent to the channel condition. In this paper, we propose a new queuing study over networking systems that make use of sequential decoders. Hence, the adopted flow and error control refer to stop-and-wait hybrid automatic repeat request. However, our queuing study is a novel extension to our prior work in which the lowest decoding complexity was fixed and did not account for the channel state. In other words, our proposed closed-form expression of the average buffer occupancy is totally generic and parameterized by not only channel condition and packet incoming rate, but also those that are automatically adapted to the channel conditions which include lower and upper bound decoding limits.
Fast Multipole Method as a Matrix-Free Hierarchical Low-Rank Approximation Yokota, R., Ibeid, H., and Keyes, D. International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing (EPASA)
There has been a large increase in the amount of work on hierarchical low-rank approximation methods, where the interest is shared by multiple communities that previously did not intersect. This objective of this article is two-fold; to provide a thorough review of the recent advancements in this field from both analytical and algebraic perspectives, and to present a comparative benchmark of two highly optimized implementations of contrasting methods for some simple yet representative test cases. The first half of this paper has the form of a survey paper, to achieve the former objective. We categorize the recent advances in this field from the perspective of compute-memory tradeoff, which has not been considered in much detail in this area. Benchmark tests reveal that there is a large difference in the memory consumption and performance between the different methods.
Petascale Molecular Dynamics Simulation using the Fast Multipole Method on K Computer Ohno, Y., Yokota, R., Koyama, H., Morimoto, G., Hasegawa, A., Masumoto, G., Okimoto, N., Hirano, Y., Ibeid, H., Narumi, T., and M., Taiji Computer Physics Communications
In this paper, we report all-atom simulations of molecular crowding — a result from the full node simulation on the “K computer”, which is a 10-PFLOPS supercomputer in Japan. The capability of this machine enables us to perform simulation of crowded cellular environments, which are more realistic compared to conventional MD simulations where proteins are simulated in isolation. Living cells are “crowded” because macromolecules comprise ∼30% of their molecular weight. Recently, the effects of crowded cellular environments on protein stability have been revealed through in-cell NMR spectroscopy. To measure the performance of the “K computer”, we performed all-atom classical molecular dynamics simulations of two systems: target proteins in a solvent, and target proteins in an environment of molecular crowders that mimic the conditions of a living cell. Using the full system, we achieved 4.4 PFLOPS during a 520 million-atom simulation with cutoff of 28 Å. Furthermore, we discuss the performance and scaling of fast multipole methods for molecular dynamics simulations on the “K computer”, as well as comparisons with Ewald summation methods.
Toward Accelerating the Matrix Inversion Computation of Symmetric Positive-definite Matrices on Heterogeneous GPU-based Systems Ibeid, H., Kaushik, D., Keyes, D., and Ltaief, H. IEEE International Conference on High Performance Computing (HiPC)
The goal of this paper is to implement an efficient matrix inversion of symmetric positive-definite matrices on heterogeneous GPU-based systems. The matrix inversion pro- cedure can be split into three stages: computing the Cholesky factorization, inverting the Cholesky factor and calculating the product of the inverted Cholesky factor with its transpose to get the final inverted matrix. Using high performance data layout, which represents the matrix in the system memory with an optimized cache-aware format, the computation of the three stages is decomposed into fine-grained computational tasks. The data flow programming model can then be represented as a directed acyclic graph, where nodes represent tasks and edges the dependencies between them. Standard implementations of matrix inversions as well as other numerical algorithms (e.g., linear and eigenvalue solvers), available in the state-of-the- art numerical libraries (e.g., LAPACK), rely on the expensive fork-join paradigm to achieve parallel performance and are characterized by artifactual synchronization points, which have to be removed to fully exploit the underlying hardware capabilities. Our tile algorithmic approach allows to remove those bottlenecks and to flawlessly execute the tasks, as soon as the data dependencies are satisfied. A hybrid runtime environ- ment system becomes paramount to dynamically schedule the numerical kernels on the available processing units, whether it is a hardware accelerator (i.e, GPU) or a homogeneous multicore (i.e., x86), and this is transparently carried out from the user. Preliminary results are shown on a dual-socket quad- core Intel Xeon 2.67GHz workstation with two nVIDIA Fermi C2070 GPU cards. Our implementation (448 Gflop/s) results in up to 5 and 6-fold improvement compared to the equivalent routines from MAGMA V1.0 and PLASMA V2.4, respectively, and 10-fold improvement compared to LAPACK V3.2 linked with multithreaded Intel MKL BLAS V10.2, with a matrix size of 24960× 24960.
Fast Multipole-Based Elliptic PDE Solver and Preconditioner.
Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the currently dominant parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. It is therefore of interest to model application performance and to understand what changes need to be made to ensure extrapolated scalability. Fast multipole methods (FMM) were originally developed for accelerating N-body problems for particle-based methods in astrophysics and molecular dynamics. FMM is more than an N-body solver, however. Recent efforts to view the FMM as an elliptic PDE solver have opened the possibility to use it as a preconditioner for even a broader range of applications. In this thesis, we (i) discuss the challenges for FMM on current parallel computers and future exascale architectures, with a focus on inter-node communication, and develop a performance model that considers the communication patterns of the FMM for spatially quasi-uniform distributions, (ii) employ this performance model to guide performance and scaling improvement of FMM for all-atom molecular dynamics simulations of uniformly distributed particles, and (iii) demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for satisfying conditions at finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Compared with multilevel methods, FMM is capable of comparable algebraic convergence rates down to the truncation error of the discretized PDE, and it has superior multicore and distributed memory scalability properties on commodity architecture supercomputers. Compared with other methods exploiting the low rank character of off-diagonal blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may reduce the amount of communication because it is matrix-free and exploits the tree structure of FMM. Fast multipole-based solvers and preconditioners are demonstrably poised to play a leading role in exascale computing.