Nonlinear programming: Theory and applications
Gradient-based line search optimization algorithms explained in detail and implemented from scratch in Python
Optimization problems are usually divided into two major categories: Linear and Nonlinear Programming, which is the title of the famous book by Luenberger & Ye (2008). These categories are distinguished by the presence or not of nonlinear functions in either the objective function or constraints and lead to very distinct solution methods. Throughout this article, the focus will be on the second category.
Before algorithm details, we will go through some general optimization concepts to understand how to formulate an optimization problem. Once these concepts are defined, we will dive into convex unconstrained problems, in which we will see the general theory of local minimum and implement four line search algorithms: steepest descent, conjugate gradient, newton’s method, and quasi-newton (BFGS and SR1). At last, we will introduce constraints and apply scipy’s implementation of SLSQP to solve a constrained example.
For those interested in details, the code used in this article is completely available in this GIT repository.
Structure of an optimization problem
To formulate an optimization problem, one must define an objective f that is a function of a vector decision variables x and might be subject to some equality and inequality constraints, which are functions of x as well. This objective is usually defined in a minimization sense, therefore the goal is to find its lowest value respecting the constraints. Boundaries for each component of x might be explicit in the formulation when stating that x ∈ Ω, in which Ω is a set that defines limits for decision variables.
A general formulation can then be stated as the following.
There are several applications for nonlinear programming. Some of the most common are engineering design, control, data fitting, and economic planning. These applications usually share some attributes regarding problem structure that make convex optimization algorithms very effective. Understanding both what are these attributes and how the algorithms will interpret the problem can be very helpful in performing optimization tasks, from formulating the problem to selecting the most appropriate method to solve it. Therefore, the first thing to do is to understand what makes a solution optimal. Let us understand it better in the next section.
As a chemical engineer, I have created Jupyter notebooks with examples of maximizing o-xylene production in a catalytic reactor and calculating the composition of species in chemical equilibrium, for those interested in seeing some applications. Furthermore, I included an example of how to perform logistic regression from scratch using the algorithms implemented throughout this article.
Unconstrained problems: conditions to local optimality
In local optimization problems, we are looking for a solution better than its neighbors. In an unconstrained minimization problem, as there are no infeasibility rules, we look for a solution with a lower objective value than its neighbors. There are two necessary conditions for defining such a solution (Nocedal & Wright, 2006):
- If **x*** is a local minimizer and f is continuously differentiable in an open neighborhood of **x***, ∇f must be equal to zero.
- If the Hessian matrix of the objective f with respect to x, exists and is continuous in an open neighborhood of **x***, then the matrix ∇²f must be positive semidefinite.
In simpler terms, the slope of the objective function with respect to x is zero in a local optimum, and, when it changes, it goes up in any search direction. Visually, it looks like the image below.
Throughout this article, in the example problem, the objective function will be characterized as a quadratic parabolic function in x, defined as.
And visually, it looks like this.
In Python code.
def obj_fun(x):
return (x[0] - 0.5) ** 2 + 0.7 * x[0] * x[1]
+ 1.2 * (x[1] + 0.7) ** 2
def gradient_fun(x):
return np.array([2 * (x[0] - 0.5) + 0.7 * x[1],
0.7 * x[0] + 2 * 1.2 * (x[1] + 0.7)])
A general framework
Gradient-based (or descent) algorithms are characterized by an iterative process in which one starts at an initial point; determines, according to a fixed rule, a direction of movement; and then moves in that direction to a (relative) minimum of the objective function on that line. At the new point, a new direction is determined, and the process is repeated (Luenberger & Ye, 2008).
In minimization problems, this is analogous to a downhill descent process, in which gravity pulls an object towards a local geographical minimum. In the algorithms discussed in this article, the gradient of the objective function plays the role of gravity (or at least analogous to it).
There are two fundamental strategies for performing these iterations: line search, and trust region. In a sense, the line search and trust-region approaches differ in the order in which they choose the direction and distance of the move to the next iterate. Line search starts by fixing the direction and then identifying an appropriate distance. In trust region, we first choose a maximum distance, the trust-region radius, and then seek a direction and step that attain the best improvement possible subject to this distance constraint. If this step proves to be unsatisfactory, we reduce the distance measure and try again (Nocedal & Wright, 2006). In this article, line search methods will be addressed, which are described next in mathematical terms.
Let us define x as a vector of optimization variables, p as a search direction defined by some unknown rule, and α as a relative step size that gives satisfactory advance towards relative optima. Then, at each iteration k, we must update the current solution recursively as in the equation below.
The algorithm stops when, between two consecutive iterations, the improvements in the objective function, position in x, or both are lesser than a user-specified tolerance.
If this description reminds you of Neural Networks optimizers, such as Stochastic Gradient Descent, Adam, and RMSProp, it makes much sense. These algorithms update model parameters throughout iterations (batches) using gradient information in a minimization sense of the loss function. However, there is no line search, so α is defined by some rule associated with the learning rate and convergence. Furthermore, there is a momentum term that combines information from previous search directions with the current gradient values of each iteration, which shares some similarities with Conjugate Gradient.
Throughout this article, we will implement algorithms based on a generic framework: DescentAlgorithm, represented in the following code.
import numpy as np
from scipy.optimize import minimize, line_search, BFGS, SR1
from numdifftools import Gradient
class DescentAlgorithm:
def __init__(self, fun, gradient=None, hess=None, nd={},
wolfe_c1=1e-4, wolfe_c2=0.1, x_tol=1e-6,
f_tol=1e-6, max_iter=50, save_history=False):
self.fun = fun
if gradient is None:
self.gradient = Gradient(fun, **nd)
else:
self.gradient = gradient
self.hess = hess
self.wolfe_coefs = wolfe_c1, wolfe_c2
self.x_tol = x_tol
self.f_tol = f_tol
self.max_iter = max_iter
self.save_history = save_history
self.history = []
def optimize(self, x0, *args, **kwargs):
x0 = np.atleast_1d(x0).astype(float)
self.history = []
xk = x0.copy()
fk = self.fun(x0, *args, **kwargs)
gradk = self.gradient(x0, *args, **kwargs)
fc, gc = 1, 1
pk = self.prepare_initial_step(xk, fk, gradk, *args,
**kwargs)
advance_x, advance_f, advance_max = True, True, True
k = 0
if self.save_history:
self.history.append({"x":xk, "f":fk, "grad":gradk})
while (advance_x or advance_f) and (k <= self.max_iter):
alpha, fc_, gc_, fnew, fk, gradnew
= line_search(self.fun, self.gradient,
xk, sk, gradk, fk, args=args,
c1=self.wolfe_coefs[0],
c2=self.wolfe_coefs[1],
maxiter=15)
if alpha is None:
alpha = 1
fnew = self.fun(xk + alpha * sk, *args, **kwargs)
gradnew = self.gradient(xk + alpha * sk, *args,
**kwargs)
xnew = xk + alpha * pk
fc = fc + fc_
gc = gc + gc_
if gradnew is None:
gradnew = self.gradient(xnew, *args, **kwargs)
advance_f = abs(fnew - fk) > self.f_tol
advance_x = np.linalg.norm(xnew - xk) > self.x_tol
xk, fk, gradk, pk =
self.prepare_next_step(xk, fk, gradk, pk,
xnew, fnew, gradnew,
*args, **kwargs)
k = k + 1
if self.save_history:
self.history.append({"x":xk, "f":fk, "grad":gradk})
if np.linalg.norm(pk) < np.sqrt(np.finfo(float).eps):
self.message = 'Negligible step'
self.success = True
break
if not (advance_x or advance_f):
self.success = True
self.message = 'Tolerance reached'
elif k > self.max_iter:
self.success = False
self.message = 'Max iterations reached'
self.x = xk
self.f = fk
self.grad = gradk
self.fc = fc
self.gc = gc
self.result = {"x":xk, "f":fk, "grad":gradk, "iter":k,
"message":self.message,
"success":self.success}
def prepare_next_step(self, xk, fk, gradk, pk, xnew, fnew,
gradnew, *args, **kwargs):
pass
def prepare_initial_step(self, xk, fk, gradk, *args, **kwargs):
pass
Notice, as a general framework, DescentAlgorithm is still missing the implementation of how to choose a search direction, which will be specific for each algorithm. Moreover, the class Gradient from numdifftools is being imported to compute numerical derivatives if the gradient function is not provided.
In the framework above, _scipy.optimize.lineseach is being used to compute the relative step size at each iteration. It might have been too extensive to implement it from scratch in this article, but in the next section, the major concepts of line search will be presented.
The line search
There is a tradeoff when computing α as, although it is desirable to obtain the best solution in the search direction, it can lead to a large number of function evaluations, which is usually undesirable, as these functions might be complex and computationally expensive. Therefore the most common strategy to define α is iteratively by bracketing and interpolating until some convergence conditions are satisfied.
The most common convergence conditions are the Wolfe conditions. First, let us define a univariate function of φ(α) that computes the objective functions along the search direction according to step size.
The Wolfe conditions state that the value of the objective function must be smaller than a function of the original point (controlled by the parameter c1) and so must be the curvature of the objective function (controlled by the parameter c2). The suggestion of Nocedal & Wright (2006) is to use 1e-4 for c1, and define c2 as equal to 0.9 for Newton and Quasi-Newton methods, while 0.1 for Conjugate Directions and Steepest Descent.
Visually it is represented below, with a starting point in black, points that violate Wolfe conditions in red, and adequate step size in green.
Using gradient information: Steepest Descent
The steepest descent direction -_∇_f is the most obvious choice for search direction for a line search method. It is intuitive: among all the directions we could move from, it is the one along which f decreases most rapidly (Nocedal & Wright, 2006).
So, let us define the SteepestDescent algorithm as one that chooses the negative of the gradient at one point as search direction.
class SteepestDescent(DescentAlgorithm):
def prepare_next_step(self, xk, fk, gradk, pk, xnew, fnew,
gradnew, *args, **kwargs):
return xnew, fnew, gradnew, -gradnew
def prepare_initial_step(self, xk, fk, gradk, *args, **kwargs):
return -gradk
And let us solve the problem using our framework.
steepest = SteepestDescent(obj_fun, gradient=gradient_fun,
save_history=True)
steepest.optimize(x0)
The solutions are stored in the property result of the instance steepest and we can access the history of steps in the property history. Visually, the steps toward optimum look like this.
Notice, not necessarily the gradient information of the current iteration is sufficient to lead solutions toward the local minimum. In this example, the algorithm took 13 iterations to achieve the tolerance of 1e-6. In the next example, we will make an algorithm that converges much faster by using information from the previous search directions to improve the search and reduce the number of iterations necessary.
Conjugate Gradient
Conjugate direction methods can be regarded as being somewhat intermediate between the method of steepest descent and Newton’s method. They are motivated by the desire to accelerate the typically slow convergence associated with steepest descent while avoiding the information requirements associated with the evaluation (Luenberger & Ye, 2008).
These methods were designed based on a purely quadratic function, therefore, in our optimization problem, the performance is expected to be great.
The rule for the search direction at iteration k using the Fletcher–Reeves method is the following.
In Python code, it can be done as below.
class ConjugateGradient(SteepestDescent):
def prepare_next_step(self, xk, fk, gradk, pk, xnew, fnew,
gradnew, *args, **kwargs):
return xnew, fnew, gradnew, -gradnew +
pk * gradnew.dot(gradnew) / gradk.dot(gradk)
Solving the same problem, ConjugateGradient led to amazing results, converging to the local optimum within only two iterations. This was expected, because, as I have mentioned, the algorithm was designed based on properties of quadratic functions, and, in these functions, it is expected to converge within n iterations, being n the number of decision variables.
In the following section, we will see things can improve even more by using the power of curvature information when defining the search directions.
Newton’s method
The idea behind Newton’s method is that the function f being minimized is approximated locally by a quadratic function, and this approximate function is minimized exactly. Thus near xk we can approximate f by the truncated Taylor series (Luenberger & Ye, 2008).
The truncated Taylor series is represented by the equation below.
Therefore, the search direction p, in the iteration of index k, is defined as follows.
Which, in the case of quadratic functions, leads to the exact optimizer of the objective function f.
Let us implement it in our framework.
class Newton(DescentAlgorithm):
def __init__(self, fun, gradient=None, hess=None, nd={},
wolfe_c1=1e-4, wolfe_c2=0.9,
x_tol=1e-6, f_tol=1e-6, max_iter=50,
save_history=False):
if hess is None:
raise TypeError("Must provide hessian")
super().__init__(fun, gradient=gradient, hess=hess, nd=nd,
wolfe_c1=wolfe_c1, wolfe_c2=wolfe_c2,
x_tol=x_tol, f_tol=f_tol,
max_iter=max_iter,
save_history=save_history)
def prepare_next_step(self, xk, fk, gradk, pk, xnew, fnew,
gradnew, *args, **kwargs):
H = self.hess(xnew, *args, **kwargs)
return xnew, fnew, gradnew, np.linalg.solve(H, -gradnew)
def prepare_initial_step(self, xk, fk, gradk, *args, **kwargs):
H = self.hess(xk, *args, **kwargs)
return np.linalg.solve(H, -gradk)
And now, before solving the problem, we must define a function of x that returns the Hessian matrix ∇²f(x).
def hess_fun(x):
return np.array([[2., 0.7],
[0.7, 2. * 1.2]])
And it really takes only one iteration…
Of course, other objective functions rather than quadratic would take more iterations to be optimized, but still, it is a powerful method, with beautiful math behind it.
In the next section, we will see some strategies to avoid explicit computations of the Hessian matrix.
Quasi-Newton
Quasi-Newton methods provide an attractive alternative to Newton’s method in that they do not require computation of the Hessian and yet still attain a superlinear rate of convergence. In place of the true Hessian, they use an approximation, which is updated after each step to take account of the additional knowledge gained during the step. The updates make use of the fact that changes in the gradient provide information about the second derivative of the objective along the search direction (Nocedal & Wright, 2006).
In these methods, the search direction is defined with almost the same rule as in Newton’s method. However, rather than using the true Hessian, we will use the approximate matrix B.
The matrix B is initialized by the identity matrix multiplied by some constant and then recursively updated at each iteration. Two of the most common methods for updates are BFGS and SR1. Both are implemented in scipy using the white label structure of the class HessianUpdateStrategy. In our implementations, we will use either BFGS, SR1, or a custom HessianUpdateStrategy in scipy’s structure. The algorithm goes as the following.
class QuasiNewton(Newton):
def __init__(self, fun, gradient=None, hess=None, nd={},
wolfe_c1=1e-4, wolfe_c2=0.9,
x_tol=1e-6, f_tol=1e-6, max_iter=50,
save_history=False):
if hess is None:
hess = BFGS(exception_strategy="damp_update",
min_curvature=0.2)
super().__init__(fun, gradient=gradient, hess=hess, nd=nd,
wolfe_c1=wolfe_c1, wolfe_c2=wolfe_c2,
x_tol=x_tol, f_tol=f_tol,
max_iter=max_iter,
save_history=save_history)
def prepare_next_step(self, xk, fk, gradk, pk, xnew, fnew,
gradnew, *args, **kwargs):
self.hess.update(xnew - xk, gradnew - gradk)
H = self.hess.get_matrix()
return xnew, fnew, gradnew, np.linalg.solve(H, -gradnew)
def prepare_initial_step(self, xk, fk, gradk, *args, **kwargs):
self.hess.initialize(xk.shape[0], "hess")
H = self.hess.get_matrix()
return np.linalg.solve(H, -gradk)
The equations for BFGS and SR1 updates are described below.
In which, y corresponds to the change of gradient between iteration k and k-1, whereas s corresponds to the change in x between the same two iterations.
In solutions using the BFGS update, an interesting thing happened. Using the recommended value for c2=0.9, a relatively small search step was accepted, and, although the search direction pointed toward the exact minimum, the solutions took one more iteration to reach it. However using the more restrictive value of c2=0.1, it converged in exactly two iterations, as expected. These results are presented below.
After understanding how unconstrained optimization works, in the next section, let us dive into Constrained Optimization.
Constrained optimization and Lagrange multipliers
In this section some of the theoretical fundamentals of constrained optimization are discussed, but, if you are interested just in hands-on, I recommend you to skip it and go straight to the implementation example.
As described when formulating a general optimization problem, there are two possible types of constraints: equality and inequality constraints.
A fundamental concept that provides a great deal of insight as well as simplifies the required theoretical development is that of an active constraint. An inequality constraint is said to be active at a feasible point x if it is equal to zero, and inactive if smaller than zero. By convention, we refer to any equality constraint as active at any feasible point (Luenberger & Ye, 2008).
The search space, in constrained optimization problems, is limited by the active constraints at a given point x. This leads to a first-order condition analogous to the unconstrained equivalent. The gradient of the objective function projected in the tangent hyperplane of the feasible search space must be equal to zero in a local optimum. In the equation below, consider h including both equality and active inequality constraints in the optimum, and y a search direction on the tangent hyperplane.
Therefore, in the local minimum, f(**x***) is a linear combination of the gradients of active constraints, which leads to the introduction of Lagrange multipliers, and so, the Lagrangian function.
In which, λ and μ are vectors of the corresponding Lagrange multipliers of equality and inequality constraints. In this formulation, inequalities are stated as g(x)≤0, which leads to the condition of their corresponding Lagrange multipliers μ≤0. Therefore, the condition denoted complementary slackness is achieved for inactive inequality constraints by setting μ=0, whereas for active constraints g(x)=0.
And the analogous conditions First-order and Second-order optimality necessary conditions for convex constrained optimization are:
- ∇𝓛(x***, λ, μ)** with respect to x _m_ust be equal to zero with complementary slackness respected.
- ∇²𝓛(x***, λ, μ)** with respect to x _m_ust be positive semidefinite.
For those interested in the theoretical aspects, I recommend reading the works of Nocedal & Wright (2006) and Luenberger & Ye (2008). But let us now dive into an application…
A constrained problem and SLSQP
In this section, we will introduce a nonlinear constraint to our optimization problem, that makes the unconstrained optimum infeasible. In the example, it will be defined by a geometric circle in x with a squared radius of three and centered in the origin.
In scipy.optimize minimize function, inequalities must be defined as greater than or equal to zero. Therefore, the example constraint must be implemented as below.
def cons_fun(x):
return (x ** 2).sum() - 3
def cons_grad(x):
return 2 * np.array(x)
And now our problem looks like this.
We will solve this problem using the scipy implementation of SLSQP, an algorithm that uses information of the Lagrange multipliers to define the search directions. For this, we will call the minimize function from scipy.optimize.
sol_cons = minimize(
obj_fun, x0, jac=gradient_fun,
constraints={"type":"ineq", "fun":cons_fun, "grad":cons_grad},
method="SLSQP",
)
And the solution is represented below.
For those interested in details, Sequential Least Squares Programming (SLSQP) is an algorithm proposed by Dieter Kraft (1988) using a primal-dual strategy that solves iteratively quadratic subproblems by a least-squares approach. It has many similarities with Quasi-Newton methods, however replacing the Hessian of the objective function with the Hessian of the Lagrangian function with respect to x and adding constraints that limit the feasible search space by linear approximation of the constraints. It uses an active set strategy in which new active constraints are either added based on blocking conditions or removed to the active set based on their Lagrange multipliers, and the step length is defined using a merit function. Details on sequential quadratic programming can also be found at Boggs & Tolle (1996) and Nocedal & Wright (2006).
Further reading
Some problems require different approaches rather than classical nonlinear optimization. If the objective and all constraints can be formulated as linear functions of the decision variables one should resort to Linear Programming.
Still, some nonlinear problems require different approaches. For instance, nonconvex, multi-modal, nondifferentiable, and multi-objective problems present some interesting challenges. Swarm and Evolutionary computing are usually effective alternatives for such problems. Therefore, for those interested in exploring these topics, I suggest doing some research on Particle Swarm Optimization, Genetic Algorithms, and Differential Evolution and their applications.
You can find an overview of Differential Evolution with some interesting applications in the article below.
Conclusions
In this article, the relevant theoretical aspects of convex nonlinear optimization have been explained in detail and illustrated with practical implementation examples. Unconstrained gradient-based algorithms have been implemented from scratch, while an established implementation of a constrained algorithm was applied to an example problem. The code used throughout the article is entirely available for readers in this GIT repository.
References
Boggs, P. T. & Tolle, J. W., 1996. Sequential Quadratic Programming. Acta Numerica, Volume 4, pp. 1–51.
Fogler, H. S., 1999. Elements Of Chemical Reaction Engineering. 3rd ed. Upper Saddle River(N.J.): Prentice Hall PTR.
Kraft, D., 1988. A Software for Sequential Quadratic Programming. s.l.:Wiss. Berichtswesen d. DFVLR.
Luenberger, D. G. & Ye, Y., 2008. Linear and Nonlinear Programming. 3rd ed. Stanford: Springer.
Nocedal, J. & Wright, S. J., 2006. Numerical Optimization. 2nd ed. New York: Springer New York.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS