The most expensive procedures in Krylov methods are matrix-vector products (y ← Ax) and vector operations (dot products, vector norms, y ← αx + βy). Therefore they directly affect the efficiency of the methods provided by Krylov.jl
. In this section, we present various optimizations based on multithreading to speed up these procedures. Multithreading is a form of shared memory parallelism that makes use of multiple cores to perform tasks.
Multi-threaded BLAS and LAPACK operations
OPENBLAS_NUM_THREADS=N julia # Set the number of OpenBLAS threads to N
MKL_NUM_THREADS=N julia # Set the number of MKL threads to N
If you don't know the maximum number of threads available on your computer, you can obtain it with
NMAX = Sys.CPU_THREADS
and define the number of OpenBLAS/MKL threads at runtime with
BLAS.set_num_threads(N) # 1 ≤ N ≤ NMAX
BLAS.get_num_threads()
The recommended number of BLAS threads is the number of physical and not logical cores, which is in general N = NMAX / 2
if your CPU supports simultaneous multithreading (SMT).
By default Julia ships with OpenBLAS but it's also possible to use Intel MKL BLAS and LAPACK with MKL.jl. If your operating system is MacOS 13.4 or later, it's recommended to use Accelerate BLAS and LAPACK with AppleAccelerate.jl.
using LinearAlgebra
BLAS.get_config() # BLAS.vendor() for Julia 1.6
Multi-threaded sparse matrix-vector products
For sparse matrices, the Julia implementation of mul!
of SparseArrays library is not parallelized. A significant speed-up can be observed with the multhreaded mul!
of MKLSparse.jl or ThreadedSparseCSR.jl.
It's also possible to implement a generic multithreaded julia version. For instance, the following function can be used for symmetric matrices
using Base.Threads
function threaded_mul!(y::Vector{T}, A::SparseMatrixCSC{T}, x::Vector{T}) where T <: Number
A.m == A.n || error("A is not a square matrix!")
@threads for i = 1 : A.n
tmp = zero(T)
@inbounds for j = A.colptr[i] : (A.colptr[i+1] - 1)
tmp += A.nzval[j] * x[A.rowval[j]]
end
@inbounds y[i] = tmp
end
return y
end
and wrapped inside a linear operator to solve symmetric linear systems
using LinearOperators
n, m = size(A)
sym = herm = true
T = eltype(A)
opA = LinearOperator(T, n, m, sym, herm, (y, v) -> threaded_mul!(y, A, v))
To enable multi-threading with Julia, you can start julia with the environment variable JULIA_NUM_THREADS
or the options -t
and --threads
julia -t auto # alternative: --threads auto
julia -t N # alternative: --threads N
JULIA_NUM_THREADS=N julia
Thereafter, you can verify the number of threads usable by Julia
using Base.Threads
nthreads()
The following benchmarks illustrate the time required in seconds to compute 1000 sparse matrix-vector products with symmetric matrices of the SuiteSparse Matrix Collection. The computer used for the benchmarks has 2 physical cores and Julia was launched with JULIA_NUM_THREADS=2
.