How to run HPL/HPCG/IO500 in WSL
4. Performance Tuning.
0. It is a constant process of experimentation and requires patience.
1. Adjust the HPL.dat file
You can refer www.netlib.org/benchmark/h… to adjust the HPL.dat file, or generate one directly from www.advancedclustering.com/act_kb/tune….
| Input | Num |
|---|---|
| Nodes | 1 |
| Cores per Node | 1 |
| Memory per Node (MB) | 512 |
| Block Size (NB) | 192 |
Output:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
7296 Ns
1 # of NBs
192 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
1 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NB
Modify HPL.dat to the above value and run xhpl again.
mpirun -np 4 xhpl
Output (partial):
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 7296
NB : 192
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 7296 192 1 1 1.82 1.4243e+02
HPL_pdgesv() start time Fri May 19 23:52:43 2023
HPL_pdgesv() end time Fri May 19 23:52:45 2023
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.98844764e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
You can see that the score has improved by 10781%.
Then I modify these:
file device out (6=stdout,7=stderr,file)
2 # of problems sizes (N)
16384 20352 Ns
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
2 8 NBMINs (>= 1)
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
3 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
Now run it again and get the output:
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR13L2L2 16384 192 1 1 10.85 2.7028e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.48156500e-03 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR13L2L2 20352 192 1 1 20.03 2.8064e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.90299920e-03 ...... PASSED
================================================================================
You can see that the score has improved by 22253% than the first. The result of 2.8064e+02 Gflops can enter the Top500 in 2003.06.