Multiple-precision matrix-vector multiplication on graphics processing units

Автор: Isupov Konstantin, Knyazkov Vladimir

Журнал: Программные системы: теория и приложения @programmnye-sistemy

Рубрика: Программное и аппаратное обеспечение для супер ЭВМ

Статья в выпуске: 3 (46) т.11, 2020 года.

Бесплатный доступ

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU's resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.

Еще

Multiple-precision computations, blas, gemv, parallel algorithms, cuda, gpu, residue number system

Короткий адрес: https://sciup.org/143172953

IDR: 143172953 | DOI: 10.25209/2079-3316-2020-11-3-61-84

Список литературы Multiple-precision matrix-vector multiplication on graphics processing units

M. Courbariaux, Y. Bengio, J. David. Training deep neural networks with low precision multiplications, 2014.
D. H. Bailey, J. M. Borwein. “High-precision arithmetic in mathematical physics”, Mathematics, 3:2 (2015), pp. 337-367. DOI: 10.3390/math3020337
J. Daněk, J. Pospíšil. “Numerical aspects of integration in semi-closed option pricing formulas for stochastic volatility jump diffusion models”, International Journal of Computer Mathematics, 97:6 (2020), pp. 1268-1292. DOI: 10.1080/00207160.2019.1614174
Y. Feng, J. Chen, W. Wu. “The PSLQ algorithm for empirical data”, Math. Comp., 88:317 (2019), pp. 1479-1501. DOI: 10.1090/mcom/3356
S. Leweke, E. von Lieres. “Fast arbitrary order moments and arbitrary precision solution of the general rate model of column liquid chromatography with linear isotherm”, Comput. Chem. Eng., 84 (2016), pp. 350-362. DOI: 10.1016/j.compchemeng.2015.09.009
M. Kyung, E. Sacks, V. Milenkovic. “Robust polyhedral Minkowski sums with GPU implementation”, Comput. Aided Des., 67-68 (2015), pp. 48-57.
DOI: 10.1016/j.cad.2015.04.012
B. Pan, Y. Wang, S. Tian. “A high-precision single shooting method for solving hypersensitive optimal control problems”, Mathematical Problems in Engineering, 2018 (2018), 7908378, 11 pp.
DOI: 10.1155/2018/7908378
Y. Xuan, D. Li, W. Han. “Efficient optimization approach for fast GPU computation of Zernike moments”, Journal of Parallel and Distributed Computing, 111 (2018), pp. 104-114.
DOI: 10.1016/j.jpdc.2017.07.008
C. L. Lawson, R. J. Hanson, D. R. Kincaid, F. T. Krogh. “Basic linear algebra subprograms for Fortran usage”, ACM Trans. Math. Softw., 5:3 (1979), pp. 308-323.
DOI: 10.1145/355841.355847
R. Nath, S. Tomov, T. Tim Dong, J. Dongarra. “Optimizing symmetric dense matrix-vector multiplication on GPUs”, ACM, New York, NY, USA, 2011, pp. 1-10.
DOI: 10.1145/2063384.2063392
K. Isupov, V. Knyazkov, A. Kuvaev. “Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units”, Journal of Parallel and Distributed Computing, 140 (2020), pp. 25-36.
DOI: 10.1016/j.jpdc.2020.02.006
A. Omondi, B. Premkumar. Residue number systems: theory and implementation, Imperial College Press, London, UK, 2007.
K. Bigou, A. Tisserand. “Single base modular multiplication for efficient hardware RNS implementations of ECC”, eds. T. Güneysu, H. Handschuh, Springer Berlin Heidelberg, Berlin, Heidelberg, 2015, pp. 123-140.
A. Abdelfattah, D. Keyes, H. Ltaief. “KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators”, ACM Trans. Math. Softw., 42:3 (2016), 18.
DOI: 10.1145/2818311
G. He, J. Gao, J. Wang. “Efficient dense matrix-vector multiplication on GPU”, Concurrency and Computation: Practice and Experience, 30:19 (2018), e4705.
DOI: 10.1002/cpe.4705
T. Inoue, H. Tokura, K. Nakano, Y. Ito. “Efficient triangular matrix vector multiplication on the GPU”, Lecture Notes in Computer Science, vol. 12043, eds. R. Wyrzykowski, E. Deelman, J. Dongarra, K. Karczewski, Springer International Publishing, Cham, 2020, pp. 493-504.
DOI: 10.1007/978-3-030-43229-4_42
Quadruple precision BLAS routines for GPU: QPBLAS-GPU ver.1.0. User's manual, 2013, 58 pp. (accessed 19 May 2019).
R. Iakymchuk, S. Collange, D. Defour, S. Graillat. “ExBLAS: reproducible and accurate BLAS library”, Numerical Reproducibility at Exascale (NRE2015) workshop held as part of the Supercomputing Conference (SC15) (November 20, 2015, Austin, TX, USA) URL https://hal.archives-ouvertes.fr/hal-01202396/file/exblas.pdf.
D. Mukunoki, T. Ogita. “Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs”, Journal of Computational and Applied Mathematics, 372 (2020), 112701.
DOI: 10.1016/j.cam.2019.112701
Y. Hida, X. S. Li, D. H. Bailey. “Algorithms for quad-double precision floating point arithmetic”, ARITH-15 (11-13 June 2001, Vail, CO, USA), pp. 155-162.
DOI: 10.1109/ARITH.2001.930115
D. E. Knuth. The art of computer programming, 3rd ed., Addison-Wesley Longman Publishing Co., Inc., USA, 1997, 978-0201896848.
ISBN: 9780201896848
J. R. Shewchuk. “Adaptive precision floating-point arithmetic and fast robust geometric predicates”, Discrete Computational Geometry, 18:3 (1997), pp. 305-363.
DOI: 10.1007/PL00009321
T. Ogita, S. M. Rump, S. Oishi. “Accurate sum and dot product”, SIAM J. Sci. Comput., 26:6 (2005), pp. 1955-1988.
DOI: 10.1137/030601818
M. Lu, B. He, Q. Luo. “Supporting extended precision on graphics processors”, DaMoN'10: Proceedings of the Sixth International Workshop on Data Management on New Hardware (2010, Indianapolis, Indiana, USA), pp. 19-26.
DOI: 10.1145/1869389.1869392
M. Joldes, J. Muller, V. Popescu. “Implementation and performance evaluation of an extended precision floating-point arithmetic library for high-accuracy semidefinite programming”, 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH) (24-26 July 2017, London, UK), pp. 27-34.
DOI: 10.1109/ARITH.2017.18
T. Nakayama, D. Takahashi. “Implementation of multiple-precision floating-point arithmetic library for GPU computing”, The 23rd IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2011 (December 14-16 2011, Dallas, USA), pp. 343-349.
DOI: 10.2316/P.2011.757-041
K. Isupov. “Using floating-point intervals for non-modular computations in residue number system”, IEEE Access, 8 (2020), pp. 58603-58619.
DOI: 10.1109/ACCESS.2020.2982365
N. J. Higham. Accuracy and stability of numerical algorithms, 2nd, SIAM, Philadelphia, PA, USA, 2002, 978-0-89871-521-7, xxvii+663 pp.
DOI: 10.1137/1.9780898718027 ISBN: 9780898715217
J. Muller, N. Brunie, F. de Dinechin, C. Jeannerod, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, S. Torres. Handbook of floating-Point arithmetic, 2, Birkhäuser, Basel, 2018, 978-3-319-76525-9.
DOI: 10.1007/978-3-319-76526-6 ISBN: 9783319765259

Еще

Статья научная