留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向流体力学仿真的大型稀疏矩阵混合精度GMRES加速算法

郑森炜 寇家庆 张伟伟

郑森炜, 寇家庆, 张伟伟. 面向流体力学仿真的大型稀疏矩阵混合精度GMRES加速算法[J]. 应用数学和力学, 2025, 46(1): 40-54. doi: 10.21656/1000-0887.450167
引用本文: 郑森炜, 寇家庆, 张伟伟. 面向流体力学仿真的大型稀疏矩阵混合精度GMRES加速算法[J]. 应用数学和力学, 2025, 46(1): 40-54. doi: 10.21656/1000-0887.450167
ZHENG Senwei, KOU Jiaqing, ZHANG Weiwei. A Mixed-Precision GMRES Acceleration Algorithm for Large Sparse Matrices in Fluid Dynamics Simulation[J]. Applied Mathematics and Mechanics, 2025, 46(1): 40-54. doi: 10.21656/1000-0887.450167
Citation: ZHENG Senwei, KOU Jiaqing, ZHANG Weiwei. A Mixed-Precision GMRES Acceleration Algorithm for Large Sparse Matrices in Fluid Dynamics Simulation[J]. Applied Mathematics and Mechanics, 2025, 46(1): 40-54. doi: 10.21656/1000-0887.450167

面向流体力学仿真的大型稀疏矩阵混合精度GMRES加速算法

doi: 10.21656/1000-0887.450167
(我刊青年编委寇家庆、编委张伟伟来稿)
详细信息
    作者简介:

    郑森炜(2001—),男,硕士生(E-mail: senweiz@mail.nwpu.edu.cn)

    寇家庆(1993—),男,教授,博士生导师(E-mail: jqkou@nwpu.edu.cn)

    通讯作者:

    张伟伟(1979—),男,教授,博士生导师(通讯作者. E-mail: aeroelastic@nwpu.edu.cn)

  • 中图分类号: O35

A Mixed-Precision GMRES Acceleration Algorithm for Large Sparse Matrices in Fluid Dynamics Simulation

(Contributed by KOU Jiaqing, M.AMM Youth Editorial Board & ZHANG Weiwei, M.AMM Editorial Board)
  • 摘要: 由于计算能耗低、效率高,以单精度/半精度计算单元为主的GPU/TPU/NPU等算力已成为人工智能计算的主要模式,但无法直接应用于浮点精度需求高的微分方程求解,不能直接替代双精度算力. 通过结合单/双精度各自的优势,提出了兼顾效率和精度的大型稀疏线性方程组的混合精度求解格式. 发展了面向稀疏大型矩阵的GMRES细化迭代算法(sparse GMRES-IR). 首先分析了流体力学仿真问题中的矩阵数据分布特点,通过双精度做预处理,单精度细化迭代,使单精度计算应用于算法主要耗时部分,发挥了计算效率优势. 通过求解开源数据集提供的33个线性方程组验证了所提出方法的精度和效率. 结果表明,在单核CPU上,相同精度要求下,提出的单双混合精度算法可以实现最高2.5倍的加速效果,且在大规模矩阵下效果更突出.
    1)  (我刊青年编委寇家庆、编委张伟伟来稿)
  • 图  1  Sparse GMRES-IR流程图

    Figure  1.  The sparse GMRES-IR program flowchart

    图  2  稀疏矩阵非零元素分布

      为了解释图中的颜色,读者可以参考本文的电子网页版本,后同.

    Figure  2.  Distribution of non-zero elements in sparse matrices

    图  3  混合精度算法和双精度算法RMSE比较

    Figure  3.  Comparison of RMSE between mixed-precision and double-precision algorithms

    图  4  混合精度方法的加速效果

    Figure  4.  Acceleration ratios of mixed-precision algorithms

    图  5  算例求解时间

    Figure  5.  Computation time costs for test cases

    图  6  求解时间与加速比关系散点图

    Figure  6.  The scatter plot of the computation time vs. the acceleration ratio

    图  7  矩阵维度同求解时间关系图

    Figure  7.  The relationship between the matrix dimension and the solution time

    图  8  Pearson相关系数分析结果

    Figure  8.  Pearson correlation coefficient analysis results

    图  9  条件数、矩阵维度对加速比影响

    Figure  9.  The influences of the condition number and the matrix dimension on the acceleration ratio

    图  10  混合精度及双精度求解器的误差与条件数关系图

    Figure  10.  The error vs. the condition number for mixed-precision and double-precision solvers

    表  1  IEEE754双精度和单精度的有效位数和取值范围

    Table  1.   IEEE754 FP64/FP32 significant digit and value ranges

    floating-point type significant digit maximum positive value minimum positive normal
    FP64 16 1.80E+308 4.94E-324
    FP32 7 3.40E+38 1.40E-45
    下载: 导出CSV
    算法1  经典GMRES-IR算法  
    input: matrix An×n; vector b  
    output: solution vector of linear equations Ax = b  
    1. LU factors of A : AM = LfUf lower precision
    2. solve Mx0= b, store x0 as working precision lower precision
    3. store Lf, Uf as working precision Lw, Uw  
    4. for i=0, …, imax-1 do  
    5.        ri=bAxi higher precision
    6.    solve AM-1 Mdi+1= ri by GMRES higher precision
    7.    if ‖ ri2= ‖ bAxi2 < εtol then  
    8.        xi+1= xi+ di+1 higher precision
    9.    else  
    10.        break  
    11.    end if  
    12. end for.  
    下载: 导出CSV
    算法2  IJK格式的ILU分解
    input: sparse matrix An×n;nonzero element position P
    output: ILU factorization L
    1. for i=2, …, n do
    2. for k=1, …, i-1
    3.    if (i, k) in P
    4.        aik=aik/akk
    5.       for j=k+1, …, n
    6.            if (i, k) in P
    7.                 aij=aijaikakj
    8.            end if
    9.       end for
    10.   end if
    11. end for.
    下载: 导出CSV

    表  2  测试算例中使用的矩阵参数信息

    Table  2.   The distribution of all parameters of the matrices used in the test case

      matrix dimension non-zero elements count condition number sparsity ratio/% absolute value range
    maximum 109 460 2 374 949 5.63E12 1.979 938 3.57E10
    minimum 1 080 14 133 885.246 1 0.004 111 -1.74E-10
    下载: 导出CSV

    表  3  部分算例在预处理花费时间同总求解时间比较及精度比较

    Table  3.   Comparison of preprocessing time to total computation time and accuracy for selected test cases

    name ILU time /s double precision time /s mixed precision time /s double precision RMSE mixed precision RMSE
    Poisson3Db 0.695 55.039 40.5 1.91E-11 4.59E-12
    Venkat01 0.189 34.801 13.726 9.60E-14 3.54E-14
    Raefsky1 0.092 2.526 1.757 6.91E-12 4.62E-12
    Poisson3Da 0.09 7.855 4.426 1.85E-12 8.76E-13
    Goodwin_013 0.013 1.133 0.772 3.20E-11 2.92E-13
    shallow_water1 0.012 40.603 19.853 3.51E-13 6.14E-13
    shallow_water2 0.012 40.087 19.937 2.30E-12 1.27E-13
    下载: 导出CSV

    表  4  未收敛算例不同迭代参数下求解精度和加速比结果

    Table  4.   Solutions precisions and speedup ratios of non-convergent numerical examples with different iteration parameters

    case solver precision maximum outer iteration maximum inner iteration precision RMSE iteration CPU time/s
    case 1 double 2(restart count) 300 8.18E-12 433 1.688
    case 2 mixed 10 100 3.51E-4 1 000 2.203
    case 3 mixed 10 200 8.49E-12 1 554 5.484
    case 4 mixed 20 100 9.36E-12 1 600 3.859
    case 5 mixed 40 50 9.289 2 000 3.248
    case 6 mixed 4 500 2.415E-5 1 988 14.812 5
    下载: 导出CSV
  • [1] JIMÉNEZ J. Computing high-Reynolds-number turbulence: will simulations ever replace experiments?[J]. Journal of Turbulence, 2003, 4. DOI: 10.1088/1468-5248/4/1/022.
    [2] CHOQUETTE J, GANDHI W, GIROUX O, et al. NVIDIA A100 tensor core GPU: performance and innovation[J]. IEEE Micro, 2021, 41 (2): 29-35. doi: 10.1109/MM.2021.3061394
    [3] RAVIKUMAR A, SRIRAMAN H. A novel mixed precision distributed TPU GAN for accelerated learning curve[J]. Computer Systems Science and Engineering, 2023, 46 (1): 563-578. doi: 10.32604/csse.2023.034710
    [4] NOVITSKIY I M, KUTATELADZE A G. DU8ML: machine learning-augmented density functional theory nuclear magnetic resonance computations for high-throughput in silico solution structure validation and revision of complex alkaloids[J]. Journal of Organic Chemistry, 2022, 87 (7): 4818-4828. doi: 10.1021/acs.joc.2c00169
    [5] HAIDAR A, TOMOV S, DONGARRA J, et al. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers[C]//SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. Dallas, TX, USA: IEEE, 2018: 603-613.
    [6] DU S, BHATTACHARYA C B, SEN S. Maximizing business returns to corporate social responsibility (CSR): the role of CSR communication[J]. International Journal of Management Reviews, 2010, 12 (1): 8-19. doi: 10.1111/j.1468-2370.2009.00276.x
    [7] DENG L, LI G, HAN S, et al. Model compression and hardware acceleration for neural networks: a comprehensive survey[J]. Proceedings of the IEEE, 2020, 108 (4): 485-532. doi: 10.1109/JPROC.2020.2976475
    [8] BAI Y, WANG Y X, LIBERTY E. ProxQuant: quantized neural networksvia proximal operators[J/OL]. 2018[2024-07-10]. https://arxiv.org/abs/1810.00861v3.
    [9] BUTTARI A, DONGARRA J, KURZAK J, et al. Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy[J]. ACM Transactions on Mathematical Software, 2008, 34 (4): 1-22. http://www.xueshufan.com/publication/2111593426
    [10] 陈逸, 刘博生, 徐永祺, 等. 混合精度频域卷积神经网络FPGA加速器设计[J]. 计算机工程, 2023, 49 (12): 1-9. doi: 10.3778/j.issn.1002-8331.2210-0108

    CHEN Yi, LIU Bosheng, XU Yongqi, et al. FPGA accelerator design for hybrid precision frequency domain convolutional neural network[J]. Computer Engineering, 2023, 49 (12): 1-9. (in Chinese) doi: 10.3778/j.issn.1002-8331.2210-0108
    [11] AMESTOY P R, DUFF I S, L'EXCELLENT J Y. Multifrontal parallel distributed symmetric and unsymmetric solvers[J]. Computer Methods in Applied Mechanics and Engineering, 2000, 184 (2/3/4): 501-520. http://pdfs.semanticscholar.org/2c70/86e4e8d476154d20b271898db23f6bb8a9a3.pdf
    [12] LI X S, DEMMEL J W. SuperLU_DIST: a scalable distributed-memory sparse direct solver for unsymmetric linear systems[J]. ACM Transactions on Mathematical Software, 2003, 29 (2): 110-140. doi: 10.1145/779359.779361
    [13] HOGG J D, SCOTT J A. A fast and robust mixed-precision solver for the solution of sparse symmetric linear systems[J]. ACM Transactions on Mathematical Software, 2010, 37 (2): 1-24. http://pdfs.semanticscholar.org/e001/343705203a8126a2a01310585458971a7158.pdf
    [14] CARSON E, HIGHAM N J. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems[J]. SIAM Journal on Scientific Computing, 2017, 39 (6): A2834-A2856. doi: 10.1137/17M1122918
    [15] HIGHAM N J, PRANESH S. Exploiting lower precision arithmetic in solving symmetric positive definite linear systems and least squares problems[J]. SIAM Journal on Scientific Computing, 2021, 43 (1): A258-A277. doi: 10.1137/19M1298263
    [16] LOE J A, GLUSA C A, YAMAZAKI I, et al. A study of mixed precision strategies for GMRES on GPUs[J/OL]. 2021[2024-07-10]. https://arxiv.org/abs/2109.01232v1.
    [17] AMESTOY P, BUTTARI A, HIGHAM N J, et al. Five-precision GMRES-based iterative refinement[J]. SIAM Journal on Matrix Analysis and Applications, 2024, 45 (1): 529-552. doi: 10.1137/23M1549079
    [18] HAIDAR A, BAYRAKTAR H, TOMOV S, et al. Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems[J]. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2020, 476 (2243): 20200110. doi: 10.1098/rspa.2020.0110
    [19] ZOUNON M, HIGHAM N J, LUCAS C, et al. Performance impact of precision reduction in sparse linear systems solvers[J]. PeerJ Computer Science, 2022, 8 : e778. doi: 10.7717/peerj-cs.778
    [20] GRATTON S, SIMON E, TITLEY-PELOQUIN D, et al. Exploiting variable precision in GMRES[EB/OL]. 2019[2024-07-10]. https://arxiv.org/abs/1907.10550v2.
    [21] GIRAUD L, HAIDAR A, WATSON L T. Mixed-precision preconditioners in parallel domain decomposition solvers[M]//Lecture Notes in Computational Science and Engineering. Berlin: Springer, 2008: 357-364.
    [22] GÖBEL F, GRVTZMACHER T, RIBIZEL T, et al. Mixed precision incomplete and factorized sparse approximate inverse preconditioning on GPUs[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2021: 550-564.
    [23] 陈华, 史悦戎. 基于GPU的重启PGMRES并行算法研究[J]. 计算机工程与应用, 2014, 50 (7): 35-40. doi: 10.3778/j.issn.1002-8331.1308-0008

    CHEN Hua, SHI Yuerong. Study on restarted PGMRES parallel algorithm with GPU[J]. Computer Engineering and Applications, 2014, 50 (7): 35-40. (in Chinese) doi: 10.3778/j.issn.1002-8331.1308-0008
    [24] 冯选燕, 燕振国, 朱华君, 等. 非精确Newton方法中线性迭代收敛判据研究[J]. 空气动力学学报, 2023, 41 (12): 28-36. doi: 10.7638/kqdlxxb-2023.0001

    FENG Xuanyan, YAN Zhenguo, ZHU Huajun, et al. Study on the convergence criterion of linear iteration in inexact Newton methods[J]. Acta Aerodynamica Sinica, 2023, 41 (12): 28-36. (in Chinese) doi: 10.7638/kqdlxxb-2023.0001
    [25] 贡伊明, 刘战合, 刘溢浪, 等. 时间谱方法中的高效GMRES算法[J]. 航空学报, 2017, 38 (7): 120894.

    GONG Yiming, LIU Zhanhe, LIU Yilang, et al. Efficient GMRES algorithm in time spectral method[J]. Acta Aeronautica et Astronautica Sinica, 2017, 38 (7): 120894. (in Chinese)
    [26] 伍康, 吕毅斌, 石允龙, 等. 有界多连通区域数值保角变换的GMRES(m)法[J]. 应用数学和力学, 2022, 43 (9): 1026-1033. doi: 10.21656/1000-0887.420305

    WU Kang, LÜ Yibin, SHI Yunlong, et al. The GMRES(m) method for numerical conformal mapping of bounded multi-connected domains[J]. Applied Mathematics and Mechanics, 2022, 43 (9): 1026-1033. (in Chinese) doi: 10.21656/1000-0887.420305
    [27] 肖文可, 陈星玎. 求解PageRank问题的重启GMRES修正的多分裂迭代法[J]. 应用数学和力学, 2022, 43 (3): 330-340. doi: 10.21656/1000-0887.420210

    XIAO Wenke, CHEN Xingding. A modified multi-splitting iterative method with the restarted GMRES to solve the PageRank problem[J]. Applied Mathematics and Mechanics, 2022, 43 (3): 330-340. (in Chinese) doi: 10.21656/1000-0887.420210
  • 加载中
图(10) / 表(6)
计量
  • 文章访问数:  31
  • HTML全文浏览量:  15
  • PDF下载量:  6
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-06-05
  • 修回日期:  2024-07-10
  • 刊出日期:  2025-01-01

目录

    /

    返回文章
    返回