Differential Evolution: An alternative to nonlinear convex optimization
Understand the basics of Differential Evolution with applications in Python
Optimization is a kind of decision making, or more specifically, is one of the major quantitative tools in the machinery of decision making, where decisions have to be taken to optimize one or more objectives under some prescribed set of circumstances (Bilal et al., 2020).
Differential Evolution (DE) (Storn & Price, 1997) is an Evolutionary Algorithm (EA) originally designed for solving optimization problems over continuous domains. It has a simple implementation yet a great problem-solving quality, which makes it one of the most popular population-based algorithms, with several successful applications reported.
From its original conception, DE was designed to fulfill some requirements that have made it particularly useful:
- Ability to handle non-differentiable, nonlinear, and multimodal cost functions.
- Parallelizability to cope with computationally intensive cost functions.
- Ease of use: few control variables to steer the minimization. These variables should also be robust and easy to choose.
- Good convergence properties: consistent convergence to the global minimum in consecutive independent trials.
Throughout this article, we will see the basics of DE with applications to single-objective optimization problems – although it has several extensions to multi-objective optimization as well. It will be compared to traditional convex gradient-based algorithms to assess when each is more appropriate. All of this will be performed using the scipy.optimize Python module. Those interested can find the implementation details in my example notebook.
If you are not familiar with nonlinear optimization, I suggest reading my previous article Nonlinear programming: Theory and applications.
The downhill descent process was a good analogy back then… However, search spaces might get a little more confusing in this article.
The algorithm explained
Just as in nature, Evolutionary Operators operate on an evolutionary algorithm population attempting to generate solutions with higher and higher fitness. The three major operators associated with these algorithms are mutation, recombination (crossover), and selection (survival) (Coello, et al., 2007).
The basic structure of a DE algorithm is represented below, of which the main operators and their respective control parameters will be described next.
The algorithm starts by initializing a population based on a user-specified number of individuals N and the boundaries of each decision variable of the problem. Each individual corresponds to a vector of optimization variables. A choice of N between 5 and 10 times the number of decision variables might be a good start.
Individuals are assigned a fitness value, based on their corresponding objective function values and possibly constraint values. Originally DE has no rule for constraint handling, which was later the focus of several articles. A useful approach was proposed by Lampinen (2002), as it has presented competitive performance and requires no additional control parameter when being implemented. It is the method adopted in the scipy DE implementation.
The population then iterates through successive generations until some stopping criteria are met. In each of these iterations, new trial vectors are produced by operations known as mutation and crossover. The trial vectors are then compared to their corresponding parents of the same index and the best of each pair is passed to the next generation. Stopping criteria are usually based on improvements in the objective function and the number of generations.
Several reproduction schemes have been proposed for DE. Usually, they are denoted DE/x/y/z, in which x corresponds to the mutation parent selection scheme, y to the number of difference vectors, and z to the crossover strategy.
Probably the most popular mutation scheme is the DE/rand/1, represented by the equation below.
In which, v corresponds to a mutant vector of index i, and r1, r2, and r3 are three indexes mutually different and different from i. The parameter F is a control parameter user-specified, denoted mutation parameter or scale factor.
Some strategies might enforce exploitation over exploration depending on how the parent vectors are selected. A usual strategy that reinforces exploitation is the best/1 strategy, in which the base vector corresponds to the individual with the best fitness value in a population. In my experience this strategy has usually led to premature convergence, the reason why I usually avoid it.
The trade-off between exploitation versus exploration is also present in the choice of the mutation parameter F. For instance, when the emphasis on exploration is needed, such as in problems with discontinuous decision spaces, using higher values can be useful. Conversely, to emphasize exploitation, using lower values for limits can improve results.
Other operations usually performed in this stage are dither and jitter. Dither randomizes F individually for each mutant vector created based on a range of user-specified values. A good choice can be [0.3, 1.0] for a start. Jitter multiplies each component of each difference vector by a random value, thus adding rotation.
The most usual crossover strategy is the binomial crossover or just bin, which is described in the equation below.
In which u corresponds to a trial vector created combining elements of the corresponding mutant vectors v and target x. The parameter CR controls the probability of inheriting one attribute from each and an additional rule states that at least one attribute of u must be inherited from v to avoid duplicates.
According to Price et al. (2005) objective functions that perform well with low CR values are decomposable – can be written as a sum of one-dimensional functions, while those that require values close to one are not. Zaharie (2009) made a detailed study on the impact of crossover operators in DE which I recommend as further reading for those interested in algorithm details.
In my experience, most nonlinear real-world problems are non-separable, in which high CR values, such as 0.7-0.9, are a good choice. Exceptions include problems with periodical terms and strong multimodality in which the search might be favored along the independent coordinate axis. In such problems, choices as 0.2–0.5 might produce better results.
Let us see some applications next. In our implementations the _differentialevoluton function from scipy.optimize will be used.
Convex problems
In the first example, let us try the same objective function used in the previous article.
In Python code.
# Defining the objective function
def obj_fun(x):
return (x[0] - 0.5) ** 2 + 0.7 * x[0] * x[1]
+ 1.2 * (x[1] + 0.7) ** 2
# Defining the gradient function
def gradient_fun(x):
return np.array([2 * (x[0] - 0.5) + 0.7 * x[1],
0.7 * x[0] + 2 * 1.2 * (x[1] + 0.7)])
Which visually looks like this.
We will solve this problem using both the Quasi-Newton BFGS method and DE.
# Optimization using BFGS (gradient-based method)
sol_cvx = minimize(obj_fun, [2.5, 0.0], jac=gradient_fun,
method="BFGS")
# Otimization using Differential Evolution
sol_de = differential_evolution(
obj_fun,
bounds=[(-5., 5.), (-5., 5.)],
popsize=50, strategy="rand1bin",
mutation=(0.3, 1.0),
recombination=0.7,
tol=1e-8,
maxiter=200,
updating="deferred",
polish=False
)
Notice the first difference in specifying, rather than an initial estimation x0, bounds for the decision variables. These bounds are used to create the initial population.
The arguments popsize correspond to the N population size; strategy to the DE/x/y/z strategy; mutation to the F parameter; recombination to CR; tol to the stopping criterion in terms of improvement; and maxiter to the maximum number of generations. By setting the uptading as "deferred", I chose to use the original strategy that only updates the current population once per generation. The argument polish defines whether or not a local optimizer is used at each iteration over the best current solution, which I chose not to.
The solutions are basically overlapping in the true optimum.
The main drawback is that DE took 56 generations to converge using a population size of 50, therefore it has made several objective function evaluations, which can be computationally expensive. Conversely, BFGS took 6 iterations and 7 function evaluations.
Understanding the problem is crucial so that the algorithm chosen is effective and does not lead to unnecessary function evaluations. In the first problem, convex gradient-based algorithms are evidently a much more efficient choice than DE, although both lead to the very same final results.
Let us see how it could change…
Multimodal problems
What if there are several local optima in the search space? Convex gradient-based algorithms are local search methods, therefore they are likely to get stuck in these points. When local optima are not enough, DE is a much more efficient tool.
Let us add some multimodal terms to our objective function.
# In the nonconvex problem we add some periodical terms
def obj_fun_2(x):
A = np.array([[1, 0.3],
[0.3, 0.7]])
xt = x.dot(A)
rugosity_1 = (xt[0] * np.sin(15 * xt[0]) - 0.5) ** 2
+ (xt[1] * np.sin(15 * xt[1]) - 2) ** 2
rugosity_2 = np.sin(15 * x[0]) ** 2 + np.sin(25 * x[1]) ** 2
convex = obj_fun(x)
return 5 * rugosity_2 + rugosity_1 + convex
I find this image beautiful, although quite confusing, and gradient-based search directions are definitely leading us nowhere. Let us try and solve the problem to see.
sol_cvx_2 = minimize(obj_fun_2, [2.5, 0.0], method="BFGS")
sol_de_2 = differential_evolution(
obj_fun_2,
bounds=[(-5., 5.), (-5., 5.)],
popsize=50, strategy="rand1bin",
mutation=(0.2, 1.0),
recombination=0.7,
tol=1e-8, maxiter=200,
updating="deferred", polish=False)
Using the pre-specified initial estimation, BFGS has achieved an objective function of 12.12, much worse than the 1.613 achieved by DE (which is likely to be the global optimum).
And we could try several different initial estimations, which, in a two-dimensional problem, might lead us to satisfactory results. In this problem, I’ve tried 500 different initial estimations and achieved a top result of 1.855. However, the sparsity of high dimensional search spaces makes such a procedure very ineffective.
Engineering pressure vessel design
The next problem was proposed by Sandgren (1990) and it minimizes the cost of a pressure vessel design. The problem has two discrete variables x1 and x2, due to material supply standards, therefore it is, in its original formulation, nondifferentiable. However, in this section, we will at first solve a continuous real-valued variant of the problem and then solve the original.
The objective function is described by the following equation.
And the functional and boundary constraints by the following equations.
In Python code.
def f_vessel(x):
return 0.6224 * x[0] * x[2] * x[3]
+ 1.7781 * x[1] * x[2] ** 2
+ 3.1611 * x[0] ** 2 * x[3]
+ 19.84 * x[0] ** 2 * x[2]
def c1_vessel(x):
return - (0.0193 * x[2] - x[0])
def c2_vessel(x):
return - (0.00954 * x[2] - x[1])
def c3_vessel(x):
return - (750 * 1728 -
np.pi * x[2] ** 2 * x[3]
- 4/3 * np.pi * x[2] ** 3)
The first attempt was to solve the problem using SLSQP. Notice I have referred to the gradient of the objective function, which I have implemented in the example notebook. Here I decided to omit such definitions for being too long.
sol_cvx_vessel = minimize(
f_vessel, [2, 2, 50, 50],
jac=grad_vessel,
bounds=bounds_vessel,
constraints=cons_vessel,
method="SLSQP"
)
Unfortunately, due to the nonlinear nature of the third constraint, SLSQP could not find an adequate solution.
message: 'Positive directional derivative for linesearch'
However, DE was not the only suitable algorithm for the problem over continuous variables… Trust-region methods are also an alternative to line search methods, both using the concepts of iteratively taking steps in search directions defined by some rule. And both can use gradient information for defining the search directions. It could have been done by changing just the method argument. It would lead to a feasible point with a great objective function value.
sol_tr_vessel = minimize(
f_vessel, [2, 2, 50, 50],
jac=grad_vessel,
bounds=bounds_vessel,
constraints=cons_vessel,
method="trust-constr"
)
And now, let us implement DE to see how it performs.
sol_de_vessel = differential_evolution(
f_vessel, bounds=bounds_vessel,
constraints=cons_vessel,
popsize=50, strategy="rand1bin",
recombination=0.7, mutation=(0.3, 1.0),
maxiter=300, seed=12,
init='latinhypercube', polish=False
)
Although finding a feasible solution, DE (without polishment) could not produce objective function values as low as trust-constr, and still took more function calls.
Therefore, It can still be useful to try other convex algorithms before DE, especially if there is no evidence of either multimodality or non-differentiable terms.
Adding discrete variables
Adding discrete variables makes this problem non-differentiable. Therefore, it becomes a clear situation in which DE is expected to be necessary. Situations in which the objective function is mapped using tree-based models are not unusual in practice. These are non-differentiable problems as well, and DE provides a useful alternative for solving them.
Let us modify our objective function in the sense that it interprets continuous real variables x1 and x2 as discrete multiples of 0.0625.
def integer_x(x):
x[0] = int(x[0] / 0.0625 + 1) * 0.0625
x[1] = int(x[1] / 0.0625 + 1) * 0.0625
return x
def f_vessel_minlp(x):
x = integer_x(x)
return f_vessel(x)
And the implementation…
sol_de_minlp = differential_evolution(
f_vessel_minlp, bounds=bounds_vessel,
constraints=cons_vessel,
popsize=50, strategy="rand1bin",
recombination=0.7, mutation=(0.3, 1.0),
maxiter=300, seed=12,
init='latinhypercube', tol=1e-8, polish=False,
)
Which has led to the very same results reported by Lampinen & Storn (2004), thus it can be considered successful.
constr_violation: 0.0
fun: 7197.731584692868
Real x: [ 1.125 0.625 58.29011889 43.69286662]
Further reading
Differential Evolution is one of many population-based algorithms that can be very effective in solving optimization problems. I suggest interested readers explore Particle Swarm Optimization and Genetic Algorithms as two powerful alternatives.
Moreover, in this article, we have seen some applications of DE to single-objective problems. But what if we were in pursuit of multiple conflicting objectives? There are several extensions of DE to handle multi-objective optimization that can be very useful in these situations. Those interested can find the python library pymoode quite helpful. There is an overview available in this other Medium article:
Conclusions
In this article, the basic theoretical aspects of Differential Evolution have been explained and illustrated with practical implementation examples using scipy implementation. Convex, multimodal, and non-differentiable problems have been presented, in which DE had its performance compared to other optimization methods. The code used is entirely available for readers in this GIT repository.
References
Bilal, et al., 2020. Differential Evolution: A review of more than two decades of research. Eng. Appl. Artif. Intell., Volume 90, p. 103479.
Lampinen, J., 2002. A Constraint Handling Approach for the Differential Evolution Algorithm. Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02, Volume 2, pp. 1468–1473.
Lampinen, J. & Storn, R., 2004. Differential Evolution. In: New Optimization Techniques in Engineering. Berlin, Heidelberg: Springer, pp. 123–166.
Price, K. V., Storn, R. M. & Lampinen, J. A., 2005. Differential Evolution: A Practical Approach to Global Optimization. 1st ed. Springer: Berlin.
Sandgren, E., 1990. Nonlinear integer and discrete programming in mechanical design optimization. Journal of Mechanical Design, 112(2), pp. 223–229.
Storn, R. & Price, K., 1997. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim., 11(4), pp. 341–359.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS