The most expensive procedures in Krylov methods are matrix-vector products (y ← Ax) and vector operations (dot products, vector norms, y ← αx + βy). Therefore they directly affect the efficiency of the methods provided by `Krylov.jl`

. In this section, we present various optimizations based on multithreading to speed up these procedures. Multithreading is a form of shared memory parallelism that makes use of multiple cores to perform tasks.

## Multi-threaded BLAS and LAPACK operations

```
OPENBLAS_NUM_THREADS=N julia # Set the number of OpenBLAS threads to N
MKL_NUM_THREADS=N julia # Set the number of MKL threads to N
```

If you don't know the maximum number of threads available on your computer, you can obtain it with

`NMAX = Sys.CPU_THREADS`

and define the number of OpenBLAS/MKL threads at runtime with

```
BLAS.set_num_threads(N) # 1 ≤ N ≤ NMAX
BLAS.get_num_threads()
```

The recommended number of BLAS threads is the number of physical and not logical cores, which is in general `N = NMAX / 2`

if your CPU supports simultaneous multithreading (SMT).

By default Julia ships with OpenBLAS but it's also possible to use Intel MKL BLAS and LAPACK with MKL.jl. If your operating system is MacOS 13.4 or later, it's recommended to use Accelerate BLAS and LAPACK with AppleAccelerate.jl.

```
using LinearAlgebra
BLAS.get_config() # BLAS.vendor() for Julia 1.6
```

## Multi-threaded sparse matrix-vector products

For sparse matrices, the Julia implementation of `mul!`

of SparseArrays library is not parallelized. A significant speed-up can be observed with the multhreaded `mul!`

of MKLSparse.jl or ThreadedSparseCSR.jl.

It's also possible to implement a generic multithreaded julia version. For instance, the following function can be used for symmetric matrices

```
using Base.Threads
function threaded_mul!(y::Vector{T}, A::SparseMatrixCSC{T}, x::Vector{T}) where T <: Number
A.m == A.n || error("A is not a square matrix!")
@threads for i = 1 : A.n
tmp = zero(T)
@inbounds for j = A.colptr[i] : (A.colptr[i+1] - 1)
tmp += A.nzval[j] * x[A.rowval[j]]
end
@inbounds y[i] = tmp
end
return y
end
```

and wrapped inside a linear operator to solve symmetric linear systems

```
using LinearOperators
n, m = size(A)
sym = herm = true
T = eltype(A)
opA = LinearOperator(T, n, m, sym, herm, (y, v) -> threaded_mul!(y, A, v))
```

To enable multi-threading with Julia, you can start julia with the environment variable `JULIA_NUM_THREADS`

or the options `-t`

and `--threads`

```
julia -t auto # alternative: --threads auto
julia -t N # alternative: --threads N
JULIA_NUM_THREADS=N julia
```

Thereafter, you can verify the number of threads usable by Julia

```
using Base.Threads
nthreads()
```

The following benchmarks illustrate the time required in seconds to compute 1000 sparse matrix-vector products with symmetric matrices of the SuiteSparse Matrix Collection. The computer used for the benchmarks has 2 physical cores and Julia was launched with `JULIA_NUM_THREADS=2`

.