Методы оптимизации обобщенных тензорных сверток

Автор: Гареев Роман Альбертович

Журнал: Вестник Южно-Уральского государственного университета. Серия: Вычислительная математика и информатика @vestnik-susu-cmi

Статья в выпуске: 2 т.9, 2020 года.

Бесплатный доступ

Свертка тензоров является одной из основных операций "Тензорного исчисления" - отдельного раздела математики, ставшего основным языком для описания фундаментальных законов таких областей науки, как теория относительности, механика, электродинамика и физика твердого тела. Эффективность выполнения свертки тензоров и её обобщений имеет существенную практическую значимость для таких областей как решение задач математической физики, машинного обучения, в спектральных методах, в квантовой химии, при интеллектуальном анализе данных, в высокопроизводительных вычислениях на многопроцессорных системах, и др. В последние двадцать лет количество методов оптимизации тензорных сверток значительно увеличилось и продолжает возрастать. В статье представлен обзор активно используемых подходов к оптимизации свертки тензоров, применяемых при решении прикладных задач на однопроцессорных и многопроцессорных вычислительных системах с распределенной памятью. В работе представлены методы оптимизации важных частных случаев свертки тензоров - матричного и матрично-векторного произведения, использующихся для большинства оптимизаций сверток тензоров. Описанные оптимазации могут применяться в процессе компиляции программ, выполняемой промышленными компиляторами. Представленная информация может помочь при систематизации уже имеющихся знаний.

Еще

Свертка тензоров, линейная алгебра, высокопроизводительные вычисления

Короткий адрес: https://sciup.org/147234270

IDR: 147234270 | DOI: 10.14529/cmse200202

Список литературы Методы оптимизации обобщенных тензорных сверток

Коренев Г.В. Тензорное исчисление. Москва: Изд-во МФТИ, 2000. 240 с.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. URL: https://www. tensorflow.org/ (дата обращения: 02.02.2020).
Pekurovsky D. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions // SIAM Journal on Scientific Computing. 2012. Vol. 34, no. 4. P. 192-209. DOI: 10.1137/11082748X.
Deville M.O., Fischer P.F., Mund E.H High-Order Methods for Incompressible Fluid Flow. Cambridge University Press, 2002. 489 p.
Jayachandran S., Venkatachalam T. A Secure Scheme for Privacy Preserving Data Mining Using Matrix Encoding // Australian Journal of Basic and Applied Sciences (AJBAS). 2015. URL: http://www.ajbasweb.com/old/ajbas/2015/July/741-744.pdf (дата обращения: 17.03.2020).
Cong J., Xiao B. Minimizing Computation in Convolutional Neural Networks // Artificial Neural Networks and Machine Learning - ICANN. Springer International Publishing, Cham. 2014. P. 281-290. DOI: 10.1007/978-3-319-11179-7_36.
Watson M., Olivares-Amaya R., Edgar R.G., Aspuru-Guzik A. Accelerating Correlated Quantum Chemistry Calculations Using Graphical Processing Units // Computing in Science Engineering. 2010. Vol. 12, no. 4. P. 40-51. DOI: 10.1021/ct900543q.
Акимова E.H., Гареев Р.А. Аналитическое моделирование матрично-векторного произведения на многоядерных процессорах // Вестник ЮУрГУ. Серия: Вычислительная математика и информатика. 2020. Т. 9, № 1. С. 69-82. DOI: 10.14529/cmse200105.
Goto К., Geijn R. Anatomy of High-performance Matrix Multiplication // ACM Transactions on Mathematical Software. 2008. Vol. 34, no. 3. P. 1-25. DOI: 10.1145/1356052.1356053.
Yu J., Lukefahr A., Palframan D., Dasika G., Das R., Mahlke S. Scalpel: Customizing DNN pruning to the underlying hardware parallelism // 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (Ontario, Canada, June, 24-28, 2017). New York, ACM. 2017. P. 548-560. DOI: 10.1145/3079856.3080215.
Yang X., Parthasarathy S., Sadayappan P. Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining // Proceedings of the VLDB Endowment. 2011. Vol. 4, no. 4. P. 231-242. DOI: 10.14778/1938545.1938548.
Kaushik D., Gropp W., Minkoff M., Smith B. Improving the Performance of Tensor Matrix Vector Multiplication in Cumulative Reaction Probability Based Quantum Chemistry Codes // High Performance Computing - HiPC 2008 (Bangalore, India, December, 17-20, 2008). Heidelberg, Springer-Verlag. 2008. P. 120-130. DOI: 10.1007/978-3-540-89894-8_14.
Lawson C.L., Hanson R.J., Kincaid D.R., Krogh F.T. Basic Linear Algebra Subprograms for Fortran Usage // ACM Transactions on Mathematical Software. 1979. Vol. 5. no. 3. P. 308-323. DOI: 10.1145/355841.355847.
Dongarra J.J., Du Croz J., Hammarling S., Hanson R.J. An Extended Set of FORTRAN Basic Linear Algebra Subprograms // ACM Transactions on Mathematical Software. 1988. Vol. 14, no. 1. P. 1-17. DOI: 10.1145/42288.42291.
Dongarra J.J., Du Croz J., Hammarling S., Duff I.S. A Set of Level 3 Basic Linear Algebra Subprograms // ACM Transactions on Mathematical Software. 1990. Vol. 16, no. 1. P. 1-17. DOI: 10.1145/77626.79170.
Sedukhin S.G., Paprzycki M. Generalizing Matrix Multiplication for Efficient Computations on Modern Computers // Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics - Volume Part I (Reykjavik, Iceland, September, 9-13, 2006). Heidelberg, Springer-Verlag. 2006. P. 225-234. DOI: 10.1007/978-3-642-31464-3_23.
Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS // ACM Transactions on Mathematical Software. 2016. Vol. 43, no. 2. P. 12:1-12:18. DOI: 10.1145/2925987.
Gates M., Luszczek P., Abdelfattah A., Kurzak J., Dongarra J., Arturov K., Cecka C., Freitag C. C++ API for BLAS and LAPACK. SLATE Working Notes. 2017. Vol. 2. P. 1 45. URL: https://www.icl.utk.edu/files/publications/2017/icl-utk-1031-2017.pdf (дата обращения: 17.03.2020).
Goto К., Van De Geijn R. High-performance Implementation of the Level-3 BLAS // ACM Transactions on Mathematical Software. 2008. Vol. 35. no. 1. P. 4:1-4:14. DOI: 10.1145/1377603.1377607.
Xianyi Z., Qian W., Yunquan Z. Model-driven level 3 BLAS performance optimization on Loongson ЗА processor // 2012 IEEE 18th International Conference on Parallel and Distributed Systems (Singapore, December, 17-19, 2012). Massachusetts Ave. 2012. P. 684-691. DOI: 10.1109/ICPADS.2012.97.
Van Zee F., Van De Geijn R. BLIS: A Framework for Rapidly Instantiating BLAS Functionality // ACM Transactions on Mathematical Software. 2015. Vol. 41, no. 3. P. 14:1-14:33. DOI: 10.1145/2764454.
Intel Math Kernel Library (Intel MKL) URL: https://software.intel.com/ru-ru/intel-mkl/?cid=sem43700011401059448&intel_term=intel+mkl&gclid= CIjbtvqaqM8CFSoNcwodDPUAbw&gclsrc=aw.ds (дата обращения: 02.02.2020).
IBM Engineering and Scientific Subroutine Library Intel Math Kernel Library (Intel MKL). URL: https: //www. ibm. com/support/knowledgecenter/SSFHY8_6.1/ref erence/ essl_reference_pdf.pdf (дата обращения: 02.02.2020).
ARM Performance Libraries Reference Manual. URL: https://ds.arm.com/media/ uploads/arm_performance_libraries_reference_manual_arm-ecm-0493690.pdf (дата обращения: 02.02.2020).
Whaley R.C., Dongarra J.J. Automatically Tuned Linear Algebra Software // Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (Montreal, Canada, November, 7, 1998). New York, ACM. 1998. P. 1-27. DOI: 10.5555/509058.509096.
Bilmes J., Asanovic K., Chin C., Demmel J. Optimizing Matrix Multiply Using PHiPAC: A Portable, High-performance, ANSI C Coding Methodology // Proceedings of the 11th International Conference on Super computing (Vienna, Austria, July, 7-11, 1997). New York, ACM. 1997. P. 340-347. DOI: 10.1145/2591635.2667174.
Su X., Liao X., Xue J. Automatic Generation of Fast BLAS3-GEMM: A Portable Compiler Approach // Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, TX, USA, February, 4-8, 2017). IEEE Press. 2017. P. 122-133. DOI: 10.1109/CGO.2017.7863734.
Gareev R., Grosser T., Michael K. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach // ACM Transactions on Architecture and Code Optimization. 2018. Vol. 15, no. 3. P. 34:1-34:27. DOI: 10.1145/3235029.
Spampinato D.G., Markus P. A Basic Linear Algebra Compiler // International Symposium on Code Generation and Optimization (Orlando, FL, USA, February, 15-19, 2014). New York, ACM. 2014. P. 23-32. DOI: 10.1145/2581122.2544155.
Belter G., Jessup E.R., Karlin I., Siek J.G. Automating the Generation of Composed Linear Algebra Kernels // Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Portland, OR, USA, November, 14-20, 2009). New York, ACM. 2009. P. 59:1-59:12. DOI: 10.1145/1654059.1654119.
Wang Q., Zhang X., Zhang Y., Yi Q. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs // Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Denver, Colorado, USA, November, 17-22, 2013). New York, ACM. 2013. P. 25:1-25:12. DOI: 10.1145/2503210.2503219.
Yi Q., Wang Q., Cui H. Specializing Compiler Optimizations Through Programmable Composition for Dense Matrix Computations // Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, UK, December, 13-17, 2014). Washington, IEEE Computer Society. 2014. P. 596-608. DOI: 10.1109/MICR0.2014.14.
Knijnenburg P.M., Kisuki T., O’Boyle M.F.P. Iterative Compilation // Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation (SAMOS). Springer Berlin Heidelberg, 2002. P. 171-187. DOI: 10.1007/3-540-45874-3_10.
Yotov K., Xiaoming L., Gang R., Garzaran M.J.S., Padua D., Pingali K., Stodghill P. Is Search Really Necessary to Generate High-Performance BLAS? // Proceedings of the IEEE. 2005. Vol. 93, no. 2. P. 358-386. DOI: 10.1109/JPROC.2004.840444.
Akimova E.N., Gareev R.A. Algorithm of Automatic Parallelization of Generalized Matrix Multiplication // CEUR Workshop Proceedings. 2017. Vol. 2005. P. 1-10.
Demmel J. Communication Avoiding Algorithms // Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (Salt Lake City, UT, USA, November, 10-16, 2012). Massachusetts Ave., IEEE Computer Society. 2012. P. 1942-2000. DOI: 10.1109/SC.Companion. 2012.351.
Ballard G., Carson E., Denrmel J., Hoemmen M., Knight N., Schwartz O. Communication lower bounds and optimal algorithms for numerical linear algebra // Acta Nunrerica. 2014. Vol. 23. P. 1-155. DOI: 10.1017/S0962492914000038.
Frigo M., Leiserson C.E., Prokop H., Ramachandran S. Cache-Oblivious Algorithms // Proceedings of the 40th Annual Symposium on Foundations of Computer Science (New York City, New York, USA, October, 17-19, 1999). Massachusetts Ave., IEEE Computer Society. 2012. P. 285-298. DOI: 10.5555/795665.796479.
Yotov K., Roeder T., Pingali K., Gunnels J., Gustavson F. An Experimental Comparison of Cache-oblivious and Cache-conscious Programs // Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (San Diego, California, USA, June, 9-11, 2007). New York, ACM. 2007. P. 93-104. DOI: 10.1145/1248377.1248394.
Liang J., Zhang Y. Optimization of GEMV on Intel AVX processor // International Journal of Database Theory and Application. 2016. Vol. 9. P. 47-60. DOI: 10.14257/ijdta.2016.9.2.06.
Hassan S.A., Mahmoud M.M.M., Hemeida A.M., Saber M.A. Effective Implementation of Matrix Vector Multiplication on Intel’s AVX Multicore Processor / / Computer Languages, Systems & Structures. 2018. Vol. 51. P. 158-175. DOI: 10.1016/j.cl.2017.06.003.
Akimova E.N., Gareev R.A., Misilov V.E. Analytical Modeling of Matrix-Vector Multiplication on Multicore Processors: Solving Inverse Gravimetry Problem // 2019 International Multi-Conference on Engineering, Computer and Information Sciences (Novosibirsk, Russia, October, 21-27, 2019). Massachusetts, IEEE Xplore Digital Library. 2019. P. 0823-0827. DOI: 10.1109/SIBIRCON48586.2019.8958103.
Heinecke A., Pabst H., Henry G. LIBXSMM: A high performance library for small matrix multiplications // HPC transforms (Austin, TX, USA, November, 15-20, 2015). New York, ACM. 2015. P. 1-1. DOI: 10.1109/SC.2016.83.
Napoli E.D., Fabregat-Travel' D., Quintana-Ortl G., Bientinesi P. Towards an efficient use of the BLAS library for multilinear tensor contractions // Applied Mathematics and Computation. 2014. Vol. 235. P. 454-468. DOI: 10.1016/j.amc.2014.02.051.
Matthews D. High-Performance Tensor Contraction without BLAS // SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. 1-24. DOI: 10.1137/16ml08968x.
Springer P., Bientinesi P. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication // ACM Transactions on Mathematical Software. 2018. Vol. 44, no. 3. P. 28:1-28:29. DOI: 10.1145/3157733.
Hirata S. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories // The Journal of Physical Chemistry A. 2003. Vol. 107, no. 46. P. 9887-9897. DOI: 10.1021/jp034596z.
Bader B.W., Kolda G.T. Algorithm 862: MATLAB tensor classes for fast algorithm prototyping // ACM Transactions on Mathematical Software. 2006. Vol. 32. P. 635-653. DOI: 10.1145/1186785.1186794.
Walt S., Colbert C., Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation // Computing in Science & Engineering. 2011. Vol. 13. P. 22-30. DOI: 10.1109/MCSE.2011.37.
Eigen v3. URL: http://eigen.tuxfamily.org (дата обращения: 02.02.2020).
Solomonik E., Matthews D., Hammond J., Stanton J.F., Demmel J. A massively parallel tensor contraction framework for coupled-cluster computations // Journal of Parallel and Distributed Computing. 2014. Vol. 74, no. 12. P. 3176-3190. DOI: 10.1016/j.jpdc.2014.06.002.
Epifanovsky E., Wormit M., Ku T., Landau A., Zuev D., Khistyaev K., Manohar P., Kaliman I., Dreuw A., Krylov A. New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations // Journal of computational chemistry. 2013. Vol. 34. P. 2293-2309. DOI: 10.1002/jcc.23377.
Calvin J., Lewis C.A., Valeev E.F. Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices // Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas, USA, November, 15, 2015). New York, ACM. 2015. P. 4:l-4:8. DOI: 10.1145/2833179.2833186.
Calvin J., Valeev E.F. Task-Based Algorithm for Matrix Multiplication: A Step Towards Block-Sparse Tensor Computing // Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas, USA, November, 15, 2015). New York, ACM. 2015. P. 1-9.
Lattner C. LLVM: An Infrastructure for Multi-Stage Optimization // Master’s Thesis. 2002. URL: http://llvm.cs.uiuc.edu (дата обращения: 27.10.2019).
Grosser T., Grofflinger A., Lengauer C. Polly—Performing polyhedral optimizations on a low-level intermediate representation // Parallel Processing Letters. 2012. Vol. 22. No. 4. DOI: 10.1142/S0129626412500107.
Apra E., Klenim M., Kowalski K. Efficient Implementation of Many-Body Quantum Chemical Methods on the Intel E; Xeon Phi Coprocessor // International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, USA, November, 16-21, 2014). Massachusetts Ave., IEEE Computer Society. 2014. P. 674-684. DOI: 10.1109/SC.2014.60.
Stock K., Henretty T., Murugandi I., Sadayappan P., Harrison R. Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel // Proceedings of the 2011 IEEE International Parallel Sz Distributed Processing Symposium (New Orleans, LA, USA, May, 16-20, 2011). Massachusetts Ave., IEEE Computer Society. 2014. P. 1058-1067. DOI: 10.1109/IPDPS.2011.101.
Ma W., Krishnamoorthy S., Villa O., Kowalski K. GPU-Based Implementations of the Noniterative Regularized-CCSD(T) Corrections: Applications to Strongly Correlated Systems // Journal of Chemical Theory and Computation. 2011. Vol. 7, no. 5. P. 1316-1327. DOI: 10.1021/ct 1007247.

Еще

Статья научная