A regularized optimization problem

In this tutorial, we will show how to model and solve the nonconvex nonsmooth optimization problem

\[ \min_{x \in \mathbb{R}^2} (1 - x_1)^2 + 100(x_2 - x_1^2)^2 + |x_1| + |x_2|,\]

which can be seen as a $\ell_1$ regularization of the Rosenbrock function. It can be shown that the solution to the problem is

\[ x^* = \begin{pmatrix} 0.25\\ 0.0575 \end{pmatrix}\]

Modelling the problem

We first formulate the objective function as the sum of a smooth function $f$ and a nonsmooth regularizer $h$:

\[ (1 - x_1)^2 + 100(x_2 - x_1^2)^2 + |x_1| + |x_2| = f(x_1, x_2) + h(x_1, x_2),\]

where

\[\begin{align*} f(x_1, x_2) &:= (1 - x_1)^2 + 100(x_2 - x_1^2)^2,\\ h(x_1, x_2) &:= \|x\|_1. \end{align*}\]

To model $f$, we are going to use ADNLPModels.jl. For the nonsmooth regularizer, we use ProximalOperators.jl. We then wrap the smooth function and the regularizer in a RegularizedNLPModel

using ADNLPModels
using ProximalOperators
using RegularizedProblems

# Model the function
f_fun = x -> (1 - x[1])^2 + 100*(x[2] - x[1]^2)^2

# Choose a starting point for the optimization process, for the sake of this example, we choose
x0 = [-1.0, 2.0]

# Get an NLPModel corresponding to the smooth function f
f_model = ADNLPModel(f_fun, x0, name = "AD model of f")

# Get the regularizer from ProximalOperators
h = NormL1(1.0)

# Wrap into a RegularizedNLPModel
regularized_pb = RegularizedNLPModel(f_model, h)
Smooth model: ADNLPModel - Model with automatic differentiation backend ADModelBackend{
  ForwardDiffADGradient,
  ForwardDiffADHvprod,
  EmptyADbackend,
  EmptyADbackend,
  EmptyADbackend,
  SparseADHessian,
  EmptyADbackend,
}
  Problem name: AD model of f
   All variables: ████████████████████ 2      All constraints: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
            free: ████████████████████ 2                 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
         low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0              low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
          infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
            nnzh: (  0.00% sparsity)   3               linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
                                                    nonlinear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
                                                         nnzj: (------% sparsity)         

  Counters:
             obj: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 grad: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 cons: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
        cons_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0             cons_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 jcon: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           jgrad: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                  jac: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0              jac_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
         jac_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                jprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0            jprod_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
       jprod_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               jtprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0           jtprod_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
      jtprod_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 hess: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                hprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           jhess: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               jhprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     

Regularizer: ProximalOperators.NormL1{Float64}(1.0)

Selected variables: 1:2

Solving the problem

We can now choose one of the solvers presented here to solve the problem we defined above. Please refer to other sections of this documentation to make the wisest choice for your particular problem. Depending on the problem structure and on requirements from the user, some solvers are more appropriate than others. The following tries to give a quick overview of what choices one can make.

Suppose for example that we don't want to use a quasi-Newton approach and that we don't have access to the Hessian of f, or that we don't want to incur the cost of computing it. In this case, the most appropriate solver would be R2. For this example, we also choose a tolerance by specifying the keyword arguments atol and rtol across all solvers.

using RegularizedOptimization

out = R2(regularized_pb, verbose = 100, atol = 1e-6, rtol = 1e-6)
println("R2 converged after $(out.iter) iterations to the solution x = $(out.solution)")
[ Info:   iter     f(x)     h(x)   √(ξ/ν)        ρ        σ      ‖x‖      ‖s‖ R2
[ Info:      0  1.0e+02  3.0e+00  4.4e+02  6.6e-01  1.4e+03  2.2e+00  3.3e-01 =
[ Info:    100  4.2e+00  2.2e+00  4.1e+00 -5.0e-01  1.5e+02  1.5e+00  2.7e-02 ↗
[ Info:    200  2.6e+00  9.7e-01  3.5e+00  9.9e-01  5.0e+01  7.1e-01  6.9e-02 ↘
[ Info:    300  5.8e-01  2.9e-01  3.4e-02  9.3e-01  1.5e+02  2.5e-01  2.2e-04 ↘
[ Info:    400  5.7e-01  3.1e-01  2.7e-03  5.2e-02  5.0e+01  2.6e-01  5.3e-05 =
[ Info:    465  5.7e-01  3.1e-01  4.3e-04  7.9e-01  1.5e+02  2.6e-01  2.8e-06
[ Info: R2: terminating with √(ξ/ν) = 0.0004256907271339337
R2 converged after 465 iterations to the solution x = [0.24988875496474894, 0.05744417515023585]

Now, we can actually use second information on f. To do so, we are going to use TR, a trust-region solver that can exploit second order information.

out = TR(regularized_pb, verbose = 100, atol = 1e-6, rtol = 1e-6)
println("TR converged after $(out.iter) iterations to the solution x = $(out.solution)")
[ Info:  outer  inner     f(x)     h(x)  √(ξ1/ν)        ρ        Δ      ‖x‖      ‖s‖      ‖B‖ TR
[ Info:      0     11  1.0e+02  3.0e+00  4.4e+02 -2.7e-02  1.0e+00  2.0e+00  1.0e+00  7.1e+02 ↘
[ Info:    100      0  5.7e-01  3.1e-01  3.1e-03  1.0e+00  2.0e-01  2.5e-01  1.1e-05  2.5e+02 ↗
[ Info:    200      0  5.7e-01  3.1e-01  8.6e-04  1.0e+00  2.0e-01  2.5e-01  3.1e-06  2.5e+02 ↗
[ Info:    252      0  5.7e-01  3.1e-01  4.4e-04  1.0e+00  2.0e-01  2.5e-01  1.6e-06  2.5e+02
[ Info: TR: terminating with √(ξ1/ν) = 0.0004409035495581565
TR converged after 252 iterations to the solution x = [0.24987677860484372, 0.0574374062666352]

Suppose for some reason we can not compute the Hessian. In this case, we can try to switch to a quasi-Newton approximation, this can be done with NLPModelsModifiers.jl We could choose to use TR again but for the sake of this tutorial we run it with R2N

using NLPModelsModifiers

# Switch the model of the smooth function to a quasi-Newton approximation
f_model_lsr1 = LSR1Model(f_model)
regularized_pb_lsr1 = RegularizedNLPModel(f_model_lsr1, h)

# Solve with R2N
out = R2N(regularized_pb_lsr1, verbose = 100, atol = 1e-6, rtol = 1e-6)
println("R2N converged after $(out.iter) iterations to the solution x = $(out.solution)")
[ Info:  outer  inner     f(x)     h(x)  √(ξ1/ν)        ρ        σ      ‖x‖      ‖s‖      ‖B‖ R2N
[ Info:      0      1  1.0e+02  3.0e+00  4.4e+02 -2.5e+07  7.4e-04  2.2e+00  4.4e+02  1.0e+00 ↗
[ Info:     46      0  5.6e-01  3.1e-01  1.6e-04  9.5e-01  2.2e-03  2.6e-01  2.4e-04  2.5e+02
[ Info: R2N: terminating with √(ξ1/ν) = 0.000157451624378494
R2N converged after 46 iterations to the solution x = [0.250018802831165, 0.0575101888238507]

Finally, TRDH and R2DH are specialized for diagonal quasi-Newton approximations, and should be used instead of TR and R2N, respectively.

f_model_sg = SpectralGradientModel(f_model)
regularized_pb_sg = RegularizedNLPModel(f_model_sg, h)

# Solve with R2DH
out = R2DH(regularized_pb_sg, verbose = 100, atol = 1e-6, rtol = 1e-6)
println("R2DH converged after $(out.iter) iterations to the solution x = $(out.solution)")
[ Info:   iter     f(x)     h(x)   √(ξ/ν)        ρ        σ      ‖x‖      ‖s‖ R2DH
[ Info:      0  1.0e+02  3.0e+00  3.1e+02 -2.5e+07  7.4e-04  2.2e+00  4.4e+02 ↗
[ Info:    100  2.9e+00  1.1e+00  6.4e+00 -8.8e+00  1.5e+01  8.3e-01  5.0e-01 ↗
[ Info:    175  5.7e-01  3.1e-01  2.3e-04  1.2e+00  4.4e+01  2.6e-01  1.2e-06
[ Info: R2DH: terminating with √(ξ/ν) = 0.0002310767182191461
R2DH converged after 175 iterations to the solution x = [0.24994242234051245, 0.05747188777393554]