Empirical Autotuning of Two-level Parallel Linear Algebra Routines on Large cc-NUMA Systems
In large cc-NUMA systems the efficient use of the different levels of the memory hierarchy is not an easy task, and the performance of multithreading implementations of the libraries decreases when the number of cores used increases, so producing an important lost of efficiency. To alleviate this problem, routines with multilevel parallelism can be developed by combining OpenMP and BLAS parallelism. In that way, higher performance can be achieved, but it is necessary to develop some autotuning technique for the appropriate selection of the number of threads to use at each level. The selection can be made through theoretical models of the execution time or some installation methodology. This…