PartitionedStructures.jl: Tutorial
This tutorial shows how PartitionedStructures.jl can define partitioned quasi-Newton approximations, by using partitioned vectors and partitioned matrices. Those partitioned structures are strongly related to partially-separable functions.
What are the partially-separable structure and the partitioned quasi-Newton approximations
The partitioned quasi-Newton methods exploit the partially-separable structure of $f:\R^n \to \R$
\[ f(x) = \sum_{i=1}^N f_i (U_i x) , \; f_i : \R^{n_i} \to \R, \; U_i \in \R^{n_i \times n},\; n_i \ll n,\]
as the sum of element function $f_i$. The gradient $\nabla f$ and the Hessian $\nabla^2 f$
\[\nabla f(x) = \sum_{i=1}^N U_i^\top f_i (U_i x), \quad \nabla^2 f(x) = \sum_{i=1}^N U_i^\top f_i (U_i x) U_i,\]
accumulate the element derivatives $\nabla f_i$ and $\nabla^2 f_i$.
The partitioned structure of the Hessian let us the definition of partitioned quasi-Newton approximations $B \approx \nabla^2 f$, such that $B$ accumulates every element Hessian approximation $B_i \approx \nabla^2 f_i$
\[B = \sum_{i=1}^N U_i^\top B_i U_i\]
The partitioned quasi-Newton approximations respect sparsity structure of $\nabla^2 f$, which is not the case of classical quasi-Newton approximation (e.g. BFGS, SR1). Moreover, the rank of the partitioned updates may be proportional to the number of elements $N$, whereas classical quasi-Newton approximations are low-rank updates. A partitioned quasi-Newton update may update every element-Hessian $B_i$ at each step $s$. It requires $B_i$, $U_i s$ and $\nabla f_i (U_i (x+s)) - \nabla f_i (U_i x)$, and therefore we have to store such an approximation and vectors for every element.
Reference
- A. Griewank and Ph. L. Toint. Partitioned variable metric updates for large structured optimization problems. Numer. Math., 39:119–137, 1982, 10.1007/BF01399316
Example: the partitioned structure of a quadratic
Let's take the quadratic f
as an example
f(x) = x[1]^2 + x[1]*x[2] + x[2]^2 + x[3]^2 + 3x[2]*x[3]
f (generic function with 1 method)
f
can be considered as the sum of two element functions
f1(y) = y[1]^2 + y[1]*y[2]
f2(y) = y[1]^2 + y[2]^2 + 3y[1]*y[2]
f2 (generic function with 1 method)
Define $U_1$ and $U_2$ indicating which variables are required by each element function:
U1 = [1 0 0; 0 1 0]
U2 = [0 1 0; 0 0 1]
2×3 Matrix{Int64}: 0 1 0 0 0 1
However, dense matrices produce a memory issue for large problems. Instead, we use a linear operator only informing the indices of the variables
U1 = [1, 2]
U2 = [2, 3]
2-element Vector{Int64}: 2 3
By gathering the different $U_i$ together
U = [U1, U2]
2-element Vector{Vector{Int64}}: [1, 2] [2, 3]
we define the function exploiting the partially-separable structure f_pss
as
f_pss(x, U) = f1(x[U[1]]) + f2(x[U[2]])
x0 = [2., 3., 4.]
f(x0) == f_pss(x0, U)
true
Similarly, we can compute: the gradient, the element gradients and explicit how the gradient is partitioned
∇f(x) = [2x[1] + x[2], x[1] + 2x[2] + 3x[3], 2x[3] + 3x[2]]
∇f1(y) = [2y[1] + y[2], y[1]]
∇f2(y) = [2y[1] + 3y[2], 2y[2] + 3y[1]]
function ∇f_pss(x, U)
gradient = zeros(length(x))
gradient[U1] .+= ∇f1(x[U[1]])
gradient[U2] .+= ∇f2(x[U[2]])
return gradient
end
∇f(x0) == ∇f_pss(x0, U)
true
However, ∇f_pss
accumulates directly the element gradients and does not store the value of each element gradients ∇f1, ∇f2
. We would like to store every element gradient, so we can later update partitioned quasi-Newton approximations. Thus, we define a partitioned vector from U
to store each element gradient and form $\nabla f$ when required:
using PartitionedStructures
U = [U1, U2]
n = length(x0)
# create the partitioned vector
partitioned_gradient_x0 = create_epv(U)
Elemental_pv{Float64}(2, 3, Elemental_elt_vec{Float64}[Elemental_elt_vec{Float64}([0.6325983422646918, 0.15907213325892944], [1, 2], 2), Elemental_elt_vec{Float64}([0.10602901995509595, 0.4297342009350201], [2, 3], 2)], [0.0, 0.0, 0.0], [[1], [1, 2], [2]], [1, 2, 3])
We set the value of each element vector to the corresponding element gradient
# return every element gradient
vector_gradient_element(x, U) = [∇f1(x[U[1]]), ∇f2(x[U[2]])] :: Vector{Vector{Float64}}
# set each element vector to its corresponding element gradient
set_epv!(partitioned_gradient_x0, vector_gradient_element(x0, U))
# Build the gradient vector
build_v!(partitioned_gradient_x0)
# with the same value as the gradient
get_v(partitioned_gradient_x0) == ∇f(x0)
true
Approximate the Hessian $\nabla^2 f$
There are at least two ways to approximate $\nabla^2 f$: a classical quasi-Newton approximation (ex: BFGS) and a partitioned quasi-Newton approximation (ex: PBFGS). Both methods are presented, and we expose the sparse structure of the partitioned approximation.
Quasi-Newton approximation of the quadratic (BFGS)
In the case of the BFGS method, you want to approximate the Hessian from s = x1 - x0
x1 = [1., 2., 2.]
s = x1 .- x0
3-element Vector{Float64}: -1.0 -1.0 -2.0
the gradient difference y
y = ∇f(x1) .- ∇f(x0)
3-element Vector{Float64}: -3.0 -9.0 -7.0
and the approximation B
, initially set to the identity
B = [ i==j ? 1. : 0. for i in 1:n, j in 1:n]
3×3 Matrix{Float64}: 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
By applying the BFGS update, you satisfy the secant equation Bs = y
# PartitionedStructures.jl exports a BFGS implementation
B_BFGS = BFGS(s,y,B)
using LinearAlgebra
# numerical verification of the secant equation
atol = sqrt(eps(eltype(s)))
norm(B_BFGS * s - y) < atol
true
but the approximation B_BFGS
is dense:
B_BFGS
3×3 Matrix{Float64}: 1.17949 0.871795 0.474359 0.871795 3.94872 2.08974 0.474359 2.08974 2.21795
which makes it impractical for large-scale problems.
Partitioned quasi-Newton approximation of the quadratic (PBFGS)
In order to make a sparse quasi-Newton approximation of $\nabla^2 f$, you may define a partitioned matrix with the same partially-separable structure as partitioned_gradient_x0
where each element matrix is set to the identity
partitioned_matrix = epm_from_epv(partitioned_gradient_x0)
Elemental_pm{Float64}(2, 3, Elemental_em{Float64}[Elemental_em{Float64}(2, [1, 2], [1.0 0.0; 0.0 1.0], Counter_elt_mat(0, 0, 0, 0, 0, 0), false, false, [6.95221969188094e-310, 6.9522196902944e-310]), Elemental_em{Float64}(2, [2, 3], [1.0 0.0; 0.0 1.0], Counter_elt_mat(0, 0, 0, 0, 0, 0), false, false, [0.0, 0.0])], sparse(Int64[], Int64[], Float64[], 3, 3), sparse(Int64[], Int64[], Float64[], 3, 3), [[1], [1, 2], [2]], [1, 2, 3])
It can be visualized with the Matrix
constructor
Matrix(partitioned_matrix)
3×3 Matrix{Float64}: 1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0
The second term on the diagonal accumulates two components of value 1.0 from the two initial element approximations.
Then you compute the partitioned gradient at x1
partitioned_gradient_x1 = similar(partitioned_gradient_x0)
set_epv!(partitioned_gradient_x1, vector_gradient_element(x1, U))
Elemental_pv{Float64}(2, 3, Elemental_elt_vec{Float64}[Elemental_elt_vec{Float64}([4.0, 1.0], [1, 2], 2), Elemental_elt_vec{Float64}([10.0, 10.0], [2, 3], 2)], [6.95220890896384e-310, 6.95220890896384e-310, 6.95220890896384e-310], [[1], [1, 2], [2]], [1, 2, 3])
and compute the elementwise difference of the partitioned gradients partitioned_gradient_x1 - partitioned_gradient_x0
# copy to avoid side effects on partitioned_gradient_x0
partitioned_gradient_difference = copy(partitioned_gradient_x0)
# apply in place an unary minus to every element gradient
minus_epv!(partitioned_gradient_difference)
# add in place the element vector of partitioned_gradient_x1
# to the corresponding element vector of partitioned_gradient_difference
add_epv!(partitioned_gradient_x1, partitioned_gradient_difference)
# compute the vector y
build_v!(partitioned_gradient_difference)
get_v(partitioned_gradient_difference) == y
true
Then you can define the partitioned quasi-Newton PBFGS update
# apply the partitioned PBFGS update to partitioned_matrix
# and return Matrix(partitioned_matrix)
B_PBFGS = update(partitioned_matrix, partitioned_gradient_difference, s; name=:pbfgs, verbose=true)
3×3 Matrix{Float64}: 2.75 0.25 0.0 0.25 4.45909 2.14545 0.0 2.14545 2.42727
which respects the sparsity structure of ∇²f.
In addition, update()
indicates the number of elements: updated, not updated or untouched, as long as the user doesn't set verbose=false
. The partitioned update verifies the secant equation
norm(B_PBFGS*s - y) < atol
true
which may also be calculated with
# compute the product partitioned matrix vector
Bs = mul_epm_vector(partitioned_matrix, s)
norm(Bs - y) < atol
true
Other partitioned quasi-Newton approximations
There exist two categories of partitioned quasi-Newton updates. In the first category, each element Hessian $\nabla^2 f_i$ is approximated with a dense matrix, for example: PBFGS. In the second category, each element Hessian $\nabla^2 f_i$ is approximated with a quasi-Newton linear-operator.
Partitioned quasi-Newton operators
Once the partitioned matrix is allocated,
partitioned_matrix_PBFGS = epm_from_epv(partitioned_gradient_x0)
partitioned_matrix_PSR1 = epm_from_epv(partitioned_gradient_x0)
partitioned_matrix_PSE = epm_from_epv(partitioned_gradient_x0)
Elemental_pm{Float64}(2, 3, Elemental_em{Float64}[Elemental_em{Float64}(2, [1, 2], [1.0 0.0; 0.0 1.0], Counter_elt_mat(0, 0, 0, 0, 0, 0), false, false, [6.9522196987263e-310, 6.95221974173494e-310]), Elemental_em{Float64}(2, [2, 3], [1.0 0.0; 0.0 1.0], Counter_elt_mat(0, 0, 0, 0, 0, 0), false, false, [6.95221259809826e-310, 6.95221259809826e-310])], sparse(Int64[], Int64[], Float64[], 3, 3), sparse(Int64[], Int64[], Float64[], 3, 3), [[1], [1, 2], [2]], [1, 2, 3])
you can apply on it any of the three partitioned updates : PBFGS, PSR1, PSE (by default) :
- PBFGS updates each element approximation with BFGS;
- PSR1 updates each element approximation with SR1;
- In PSE, each element approximation is updated with BFGS if it is possible or with SR1 otherwise.
B_PBFGS = update(partitioned_matrix_PBFGS, partitioned_gradient_difference, s; name=:pbfgs)
B_PSR1 = update(partitioned_matrix_PSR1, partitioned_gradient_difference, s; name=:psr1)
# name=:pse by default
B_PSE = update(partitioned_matrix_PSE, partitioned_gradient_difference, s)
3×3 Matrix{Float64}: 2.75 0.25 0.0 0.25 4.45909 2.14545 0.0 2.14545 2.42727
All these approximations satisfy the secant equation as long as every element approximation is updated
norm(mul_epm_vector(partitioned_matrix_PBFGS, s) - y) < atol
true
norm(mul_epm_vector(partitioned_matrix_PSR1, s) - y) < atol
true
norm(mul_epm_vector(partitioned_matrix_PSE, s) - y) < atol
true
Limited-memory partitioned quasi-Newton operators
These operators exist to apply the partitioned quasi-Newton methods to a partially-separable function with large $n_i$, whose element Hessian approximations can't be store as dense matrices. The limited-memory partitioned quasi-Newton operators allocate a quasi-Newton operator LBFGS or LSR1 defined in LinearOperators.jl for each element Hessian approximation. It defines three approximations:
- PLBFGS, each element approximation is a
LBFGSOperator
; - PLSR1, each element approximation is a
LSR1Operator
; - PLSE, each element approximation may be a
LBFGSOperator
orLSR1Operator
.
Contrary to the partitioned quasi-Newton operators, each partitioned limited-memory quasi-Newton operators has a different type
partitioned_linear_operator_PLBFGS = eplo_lbfgs_from_epv(partitioned_gradient_x0)
partitioned_linear_operator_PLSR1 = eplo_lsr1_from_epv(partitioned_gradient_x0)
partitioned_linear_operator_PLSE = eplo_lose_from_epv(partitioned_gradient_x0)
Elemental_plo{Float64}(2, 3, Union{Elemental_elo_bfgs{Float64}, Elemental_elo_sr1{Float64}}[Elemental_elo_bfgs{Float64}(2, [1, 2], Linear operator nrow: 2 ncol: 2 eltype: Float64 symmetric: true hermitian: true nprod: 0 ntprod: 0 nctprod: 0 , Counter_elt_mat(0, 0, 0, 0, 0, 0), false), Elemental_elo_bfgs{Float64}(2, [2, 3], Linear operator nrow: 2 ncol: 2 eltype: Float64 symmetric: true hermitian: true nprod: 0 ntprod: 0 nctprod: 0 , Counter_elt_mat(0, 0, 0, 0, 0, 0), false)], sparse(Int64[], Int64[], Float64[], 3, 3), sparse(Int64[], Int64[], Float64[], 3, 3), [[1], [1, 2], [2]], [1, 2, 3])
The different types simplify the update
method, since no argument name
is required to determine which update is applied
B_PLBFGS = update(partitioned_linear_operator_PLBFGS, partitioned_gradient_difference, s)
norm(B_PLBFGS * s - y) < atol
true
B_PLSE = update(partitioned_linear_operator_PLSE, partitioned_gradient_difference, s)
norm(B_PLSE * s - y) < atol
true
B_PLSR1 = update(partitioned_linear_operator_PLSR1, partitioned_gradient_difference, s)
norm(B_PLSR1 * s - y) < atol
true
That's it, you have all the tools to implement a partitioned quasi-Newton method. Enjoy!
Tips
The main issue about the definition of partitioned structures is informing the $f_i$ and $U_i$. To address this issue you may want to take a look at ExpressionTreeForge.jl which detect automatically the partially separable structure from an ADNLPModels or a JuMP model.