Constr Opt by LSTMs MILCOM21 Submitted
Constr Opt by LSTMs MILCOM21 Submitted
Abstract—Many technical issues for communications and to generate. Besides the supervised learning, reinforcement
computer infrastructures, including resource sharing, network learning techniques and meta-learning techniques are also used
management and distributed analytics, can be formulated as for solving optimization problems. For instance, authors in
optimization problems. Gradient-based iterative algorithms have
been widely utilized to solve these problems. Much research [3] propose to model the optimization solving process as a
focuses on improving the iteration convergence. However, when Markov Decision process and employ a deep neural network
system parameters change, it requires a new solution from the as the decision policy, while authors in [6] propose to embed
iterative methods. Therefore, it is helpful to develop machine- the optimization problem into a meta-learning problem and
learning solution frameworks that can quickly produce solutions employ a Long Short-Term Memory (LSTM) network as the
over a range of system parameters.
We propose here a learning approach to solve non-convex, con- optimizer.
strained optimization problems. Two coupled Long Short Term Motivated by the success of these prior works, researchers
Memory (LSTM) networks are used to find the optimal solution. start to focus on more specific types of optimization problems.
The advantages of this new framework include: (1) near optimal For instance, authors of [4] propose a framework, named
solution for a given problem instance can be obtained in very TwinL2O, consisting two LSTMs for solving minimax opti-
few iterations (time steps) during the inference process, (2) the
learning approach allows selections of various hyper-parameters mization problems. Different from the Twin-L2O, we propose
to achieve desirable tradeoffs between the training time and the here to use the two Coupled LSTM networks, referred to
solution quality, and (3) the coupled-LSTM networks can be as CLSTMs, for solving non-convex, constrained optimization
trained using system parameters with distributions different from problems with user-defined objective and constraint functions.
those used during inference to generate solutions, thus enhancing Since the CLSTMs and the Twin-L2O are derived for solv-
the robustness of the learning technique. Numerical experiments
using a dataset from Alibaba reveal that the relative discrepancy ing different types of problems, different considerations are
between the generated solution and the optimum is less than 1% required in the design of the loss functions and the overall
and 0.1% after 2 and 12 iterations, respectively. workflows, although the proposed CLSTMs and the Twin-
Index Terms—Constrained optimization, LSTM, optimization, L2O both employ two LSTM networks. Instead of solving
SDC, stochastic optimization the minimax optimization problems as by the Twin-L2O, the
constrained optimization problem we consider here is given as
I. I NTRODUCTION follows:
Using learning techniques to solve optimization problems
has attracted great attention in recent years. In [1] and [2], the min f (θ)
supervised learning techniques are modified and enhanced to θ (1)
predict the optimal solutions for given optimization problem s.t. h(θ) ≤ 0.
parameters. Because the ground truth labels are required for By introducing a Lagrange multiplier λ, a Lagrange function
the supervised learning techniques, to train a well-performed can be formed for the optimization problem in (1):
prediction model for optimization needs sufficient problem
parameters, and the optimal solutions of these optimization J(θ, λ) = f (θ) + λh(θ). (2)
problems for the training process, which requires extra effort
The dual optimization problem of (1) is
This research was partly sponsored by the U.S. Army Research Laboratory max J(θ∗ , λ)
and the U.K. Ministry of Defence under Agreement Number W911NF-16- λ (3)
3-0001. The views and conclusions contained in this document are those of s.t. θ∗ = argmin J(θ, λ), λ ≥ 0
the authors and should not be interpreted as representing the official policies,
either expressed or implied, of the U.S. Army Research Laboratory, the U.S. According to the duality theory, the dual optimization
Government, the U.K. Ministry of Defence or the U.K. Government. The U.S.
and U.K. Governments are authorized to reproduce and distribute reprints for problem (3) has the same optimal solution for the original
Government purposes notwithstanding any copyright notation hereon. (primal) problem (1) under the condition that the duality gap
ℎ𝑘+1 ℎ 𝑘+1 𝑘+1 Algorithm 1: Training Process of the CLSTMs in a
frame
𝜃𝑘+1 𝜆𝑘+1
1: for iteration k = (i − 1)K, (i − 1)K + 1, . . . , iK do
𝑚 𝑚
ෝ 2: for (J(θ, λ), θ, λ) in the training data set do
𝑔𝑘 𝑔ො𝑘 3: Calculate the gradient of function J w.r.t. θ:
𝐽𝑘
∇θ J(θk , λk );
∇𝜃,𝑘 ∇𝜆,𝑘 4: Generate
the update step size gt by:
ℎ𝑘 𝜃𝑘 𝜆𝑘 ℎ 𝑘 𝑘 gk
= m(∇θ J(θk , λk ), hk , φi )
hk+1
𝑘−1
5: Update θ using (5);
Fig. 1: Computation graph of the coupled LSTM for the iter- 6: Calculate the gradient of function J w.r.t. λ:
ation k, where ∇θ,k = ∇θ J(θk , λk ), ∇λ,k = ∇λ J(θk+1 , λk ). ∇λ J(θk+1 , λk );
7: Generate
the update step size ĝt by:
ĝk
= m̂(∇λ J(θk+1 , λk ), ĥk , φ̂i )
ĥk+1
is zero. Therefore, our objective is to find the optimal θ and λ
8: Update λ using (7);
for minimizing and maximizing the function J, respectively.
9: end for
Note that the dual optimization problem (3) can be regarded as
10: end for
a special minimax optimization problem. Thus, the proposed
11: Calculate the loss functions L(φi ) and L̂(φ̂i ) using
CLSTMs can also be applied to solve minimax optimization
(8)(9), respectively;
problems.
12: Update the parameters φi and φ̂i using the gradients
We also formulate here a resource-allocation problem in ∇φi L(φi ) and ∇φ̂i L̂(φ̂i ), respectively;
the cloud cluster as a constrained optimization problem and
apply the proposed CLSTMs to solve it with practical data
from Alibaba [7]. Our evaluation results demonstrate that Algorithm 2: Training Procedure
the CLSTMs can achieve the 99% optimality accuracy after
1: for epoch =1,2,. . . do
2 iterations and the mean relative discrepancy between the
2: Randomly initialize the values of θ, λ for each
generated solution and the optimum is less than 0.1% after
optimization problem J(θ, λ);
12 iterations. Furthermore, we explore and demonstrate the
3: Randomly initialize the hidden state hk , ĥk for m and
impact of various hyper-parameters in the CLSTMs on perfor-
m̂, respectively;
mance. Specifically, our numerical results show that selecting
4: for frame i = 1, 2, . . . , I do
these hyper-parameters can serve as a mechanism to achieve
5: Algorithm 1;
desirable tradeoffs between the training time and the solution
6: end for
quality. Finally, we conduct an experiment to validate and
7: end for
show robustness where the CLSTMs can be trained using
system parameters with distributions different from those used
during inference to generate solutions.
is updated from iteration k to k+1 by the following equations:
The rest of the paper is organized as follows. Section II
presents the detail of the coupled LSTM networks. Section gk
= m(∇θ J(θk , λk ), hk , φ∗ ), (4)
III describes the resource-allocation problem under study and hk+1
the formulated constrained optimization problem. Section IV θk+1 = θk + gk , (5)
presents the evaluation of the coupled LSTM networks using
the cluster trace [7]. Finally, section V discusses related where φ∗ denotes the optimal parameters in m, ∇θ J(θk , λk )
research and section VI concludes the paper. is the gradient of function J with respect to (w.r.t.) θ, and
hk , hk+1 are the hidden state for m in iteration k and k + 1,
respectively. Then λ is updated according to:
II. P ROPOSED CLSTM S
ĝk
= m̂(∇λ J(θk+1 , λk ), ĥk , φ̂∗ ), (6)
ĥk+1
In this section, we propose the CLSTMs for solving con-
λk+1 = λk + ĝk , (7)
strained optimization problems.
Firstly, we describe the inference process that the CLSTMs where φ̂∗ denotes the optimal parameters in m̂,
are utilized to find the optimal θ and λ for a given Lagrange ∇λ J(θk+1 , λk ) is the gradient of function J w.r.t. λ,
function J, respectively, by iterations. The overall workflow is and ĥk , ĥk+1 are the hidden state for m̂ in iteration k and
shown in Fig. 1 where iterations indexed by k progress from k + 1, respectively.
the bottom upward. In each iteration k, the update step sizes of We define K consecutive iterations are a frame. The training
θ and λ are denoted by gk and ĝk , respectively. Specifically, θ process is to update the parameters for m and m̂ at the end of
each frame. Specifically, at the end of frame i, the parameters C. Scale function
φi are updated to minimize the loss function:
Note that there is a constraint on the Lagrange multiplier λ
iK+1
X required to be non-negative in the dual optimization problem
L(φi ) = E wk J(θk , λk ) , (8) (3). Thus, the scale function ψ(λ), as defined in (10), is
k=(i−1)K+1 used to ensure that the values of λ are larger than or equal
where wk are weighting factors and the sum of all wk equals to 0. Using the scale function ψ(λ) has the following two
1. Meanwhile, we update the parameters φ̂i to minimize the advantages. Firstly, compared with the absolute value of λ or
loss function: ReLU function max(0, λ), the derivative of the scale function
XiK
ψ(λ) exists everywhere. Compared with the square function
L̂(φ̂i ) = −E ŵk J(θk+1 , λk ) + ŵiK+1 J(θiK+1 , λiK+1 ) , λ2 , the scale function ψ(λ) remains the linear property expect
k=(i−1)K+1 the range [−1, 1]. These two advantages help the CLSTMs to
(9) converge and obtain better performance as confirmed in our
where ŵk are weighting factors and the sum of all ŵk equals 1. experiments.
The expectation is needed because that the objective functions
f (θ) used for training are sampled from a set of random −2λ − 1, if λ < −1
functions, while the constraint functions h(θ) are chosen ψ(λ) = λ2 , if − 1 ≤ λ ≤ 1 (10)
from another set of random functions. That is, the form of 2λ + 1, if λ > 1
the random functions is fixed, while function parameters are
randomly chosen from some distributions. III. P ROBLEM F ORMULATION
The detailed training process in a frame is illustrated in A. System Model
Algorithm 1, which works as follows. At the beginning of a
frame, all variables and Lagrange multipliers are updated in The resource-allocation problem is to allocate cluster re-
K iterations (Lines 1-10). In every iteration, for each sample sources to competing tasks for maximizing the sum of task
(J(θ, λ), θ, λ), the variables θ are updated (Lines 3-5) before utilities. Specifically, there are N tasks competing for a type
the Lagrange multipliers λ are updated (Lines 6-8). Finally, at of resource and the amount of available resource is denoted
the end of a frame, the parameters φi and φ̂i are updated (Lines by C. For each task n, let rn , Rn , un (rn ) denote the amount
11-12). Furthermore, to reduce the computational complexity of resource allocated to it, its resource requirement and its
of computing the gradients ∇φi L(φi ) and ∇φ̂i L̂(φ̂i ), we utility function given the allocated resource rn , respectively.
assume that the gradients of J w.r.t. to θ and λ are independent Moreover, each task n must be allocated with a minimum
of φ and φ̂, respectively (i.e., ∂∇ θJ ∂∇λ J amount of resource to provide good service, while it also
∂φi = 0, ∂ φ̂ = 0).
i cannot receive more than a maximum amount of resource in
A. Training Procedure order to guard against occupying a large amount of resources
To obtain sufficient sampling and experience about how by few tasks. By introducing two parameters α > 1 and β < 1,
to optimize near and far from the optimum, a group of the maximum and the minimum resource requirement of task
I consecutive frames are defined as an epoch, where the n are denoted by αRn and βRn .
optimization variables (i.e., λ and θ) and the hidden states
(i.e., hk and ĥk ) are randomly initialized at the beginning of B. Optimization Problem
each epoch. The detailed training procedure is provided in By using these notations, we can have the optimization
Algorithm 2. problem derived from the resource allocation problem:
B. Parameters Shared in the CLSTMs N
X
For each LSTM in the CLSTMs, directly feeding the vector max un (rn ) (11a)
r1 ,...,rN
of gradients w.r.t. every variable into the fully connected n=1
input layer of the LSTM will require a rather large LSTM XN
network if there are thousands of variables. Consequentially, s.t. rn ≤ C (11b)
two large LSTM networks will impose a tremendous burden n=1
on computation and storage. To reduce the computation and rn ≥ βRn , ∀n (11c)
storage requirements, the coordinate-wise LSTM structure rn ≤ αRn , ∀n. (11d)
proposed in [6] is adopted here. Specifically, the gradients
of function J w.r.t. every variable are fed into the LSTM The first constraint (11b) ensures that the amount of allo-
successively so that they can share the parameters of an LSTM cated resources for all tasks must not exceed the amount of
network. Thus, the number of LSTM parameters can keep available resources, while the constraints (11c) guarantee that
small when there are thousands of variables. Meanwhile, the the minimum resource requirements for tasks are satisfied.
hidden states for each variable are independent so that the The constraints (11d) ensure that the amount of allocated
LSTM can generate different update step sizes for different resources for each task must not exceed its maximum resource
variables even their gradients are equal. requirement.
C. Solve the Problem with the CLSTMs utilization requirement of task n. Moreover, α and β are set
Let θ denotes the vector of variables [r1 , . . . , rN ] and define to 1.4 and 0.7, respectively, for all tasks.
the objective function f (θ) and the constraint function h(θ) In the experiments, our algorithm is implemented with
as: Python and Tensorflow 2.1 and evaluated on an Ubuntu 20.04
LTS server with a NVIDIA TITAN Xp graphics card. Each
N
X LSTM of the CLSTMs has two layers and each layer has
f (θ) = − un (rn )
20 neural units. During the training process, the CLSTMs
n=1
PN are trained with 10,240 optimization problem scenarios. The
n=1 rn−C training process consists of 30 epochs where each epoch has
βR1 − r1
50 frames (I = 50) and each frame consists of 10 iterations
...
(K = 10). We set wk,∀k to 1 and the learning rate in the frame
h(θ) = βRN − rN (12) i−1
i to 0.01 × 0.97 300 . For the evaluation, the trained CLSTMs
r1 − αR1
are used to solve 1,000 optimization problem scenarios with
...
randomly selected parameters. For each problem scenario, the
rn − αRN
optimization (control) variables are updated using the trained
Then, we can have an optimization problem whose form is CLSTMs by iterations and the solutions are saved after 1,000
similar to (1): iterations.
We employ a gradient-based method to produce the optimal
min f (θ)
θ (13) solutions, which serve as a baseline for comparison with the
s.t. h(θ) ≤ 0. CLSTMs. Given the optimization problem (3), this method
updates the variables θ and λ by iterations. In each iteration,
The Lagrange function can be derived by introducing a La-
this method first finds the optimal θ for the given λ through
grange multiplier vector λ = [λ0 , . . . , λ2N ]:
the gradient-descent and then updates the λ with the found
J(θ, λ) = f (θ) + λh(θ) (14) optimal θ by gradient ascent. After the iterations converge,
the optimal solutions are saved.
Finally, by substituting J(θ, λ) for J(θ, λ), θ for θ, λ for λ To measure performance of the proposed CLTMs approach,
in the Algorithm 2, we can apply Algorithm 2 to train the we define the relative accuracy of a solution as:
CLSTMs for solving this constrained optimization problem. It
is worth noting that we have not made specific assumptions |x̂ − x|
α=1− , (16)
about the forms of the functions f (θ) and h(θ) in (13) as long |x|
as the problem (13) satisfies the zero-duality gap. where x̂ and x are the optimal values of the objective function
found by the CLSTMs and the gradient-based method, re-
IV. E XPERIMENT
spectively. Moreover, we define Mean relative accuracy as the
A. Setup relative accuracy averaged over the 1,000 problem scenarios
The Alibaba cluster trace [7] presents the resource utiliza- solved by the trained CLTMs in the evaluation (inference)
tion of 4,000 machines and the resource requirements of the process.
batch workloads. In the Alibaba’s cluster, the batch workloads
B. Results and Analysis
are described by the ’Job-Task-Instance’ structure, where
each job has multiple tasks and each task contains multiple 1) Compare with the baseline: In this experiment, we
instances. Furthermore, the resource requirements of each demonstrate that the CLSTMs can find the near-optimal so-
instance in a given task are identical. In our experiments, we lutions in a few iterations and obtain extremely high relative
allocate the available CPU in the cluster in terms of utilization accuracy in the end.
in percentage to tasks where the resource requirement of a task Fig. 2a shows the mean relative accuracy obtained by the
is the aggregate resource requirement of all its instances. We CLSTMs and the baseline in the first 200 iterations during
employ a cluster of 10 machines randomly selected from the the evaluation process. Note that the mean relative accuracy
Alibaba cluster trace to provide CPU resource to competing after the first iteration is omitted in Fig. 2a so that the
tasks in all optimization problem scenarios considered in the difference between the two curves can be made clear. Although
following experiments. In addition, for each problem scenario, not shown in the figure, the mean relative accuracy for the
we randomly select 10 tasks and each task is allowed to have CLSTMs and the baseline after the first iteration is in fact
most 100 instances. For each task n, its utility function given 0.972 and 0.729, respectively. We can further observe that the
the allocated CPU utilization rn is given by mean relative accuracy for the CLSTMs achieves 0.9929 after
2 iterations and reaches to 0.9993 after 12 iterations, while
1 the baseline still presents obvious fluctuation. This confirms
un (rn ) = , (15)
1+ e−µn (rn −Rn ) that the CLSTMs is much quicker in producing accurate and
where µn is a constant randomly selected from the uniform stable solutions than the conventional gradient-based method.
distribution in the range of [0.5, 1) and Rn is the CPU Furthermore, we can see that the mean relative accuracy
Mean relative accuracy
0.9993 a larger value of K contains a longer future impact of a single
0.99
iteration. Thus, the CLSTMs learning to minimize such loss
function can find better update step sizes for minimizing the
objective functions. On the other hand, the training process for
CLSTMs K equal to 1, 3, 5, 7, 10 consumes 14, 31, 51, 69, 113 minutes
0.9900 Baseline of CPU time, respectively. The reason for longer training time
12 100 200 0.000.983 0.997 for increased K is that the training procedure contains a fixed
Iterations Relative accuracy number of frames and the number of iterations in a frame
(a) (b)
increases as the K value grows.
Based on these observations from Fig. 3a, we can improve
Fig. 2: (a) The mean relative accuracy over iterations and (b) the CLSTMs performance by increasing the value of K,
the complementary cumulative distribution function (CCDF) although it will consume more time for training.
of relative accuracy. 4) Impact of wk : This experiment aims to present the impact
of wk on the performance. We fix K to 10 and employ three
different strategies to set weights. Specifically, the random
achieves 0.9995 at the end of 1,000 iterations. Although it strategy chooses weights with random values sampled from a
is possible to improve the solution quality (i.e., from 0.9993 uniform distribution in [0, 1), the decay strategy sets weights
to 0.9995) with more iterations in the evaluation process, the to be exponentially decayed with the decay factor equal to
improvement is quite marginal. 0.9, and the uniform strategy sets all weights to 1. Then, each
Fig. 2b shows the complementary cumulative distribution strategy normalizes the chosen weights so that the sum of
function (CCDF) of relative accuracy of solutions generated all weights chosen by the strategy equals 1. For the training
by the CLSTMs after 1,000 iterations. From the figure, we can process, three CLSTMs with weights chosen by the three
make the observation that the 99% relative accuracy is larger different strategies are trained for 30 epochs. Then, these
than 0.997 and the minimal relative accuracy is 0.983. trained CLSTMs are applied to solve identical 1,000 problem
Clearly, these numerical results validate that the CLSTMs scenarios and the generated solutions are saved after 1,000
can find a near-optimal solution quickly (e.g., achieving 0.993 iterations.
and 0.9993 for the mean relative accuracy after 2 and 12 iter- Fig. 3b presents the value of loss function (8) with different
ations, respectively) and the relative accuracy of the solutions weights setting strategies throughout the training epoch. First
found by the CLSTMs after enough iterations is practically of all, we can observe that the curve of the decay strategy
equal to 100% (e.g., achieving 0.9995 for the mean relative becomes relatively “flat” when it comes to the 20 epoch while
accuracy after 1,000 iterations). the curve of the uniform strategy still has obvious fluctuation
2) Impact of neural network structures: The purpose of this until the last few epochs. We can further observe that three
aspect of our experiment is to study the impact of number curves in this figure reach a similar value of the loss function at
of neural units in each LSTM layer on the performance the end of the training process after 30 epochs. This explains
of the CLSTMs. Toward this goal, we set the number of why the trained CLSTMs can generate similar performance
neural units in each LSTM layer to 10, 15, 20, 25 and (i.e., very narrow range of y-axis values) in Fig. 3c.
30 as five different settings. Consequently, the corresponding Specifically, Fig. 3c shows the mean relative accuracy and
numbers of parameters in the CLSTMs are 2662, 5792, 10122, the standard derivation of the relative accuracy with three
15652 and 22382 in the five settings, respectively. Through different weight settings. We can see that the CLSTMs trained
comparing the optimal solutions generated by CLSTMs of with randomly selected weights produces the highest standard
five different settings after 1,000 iterations, we find that their deviation and the lowest mean relative accuracy, while the
mean relative accuracy are similar (i.e., the difference between CLSTMs trained with weights selected by the uniform strategy
them is less than 0.001) within the range of (0.998, 0.999). generate the lowest standard deviation and the highest mean
Therefore, these confirm that the CLSTMs can perform well relative accuracy. Therefore, the strategy of uniform weights
with different neural network structures for the considered provides the most desirable performance. We can further
problem scenarios. observe that the mean relative accuracy with the decay strategy
3) Impact of K: In this experiment, we show the impact decreases by 0.0008 and the standard deviation increases by
of the parameter K in (8) and (9) on the performance of 0.0002 when compared with the uniform strategy. Since this
CLSTMs. Specifically, we train five CLSTMs with the same difference is very small, the decay strategy and the uniform
training procedure except that the K value is set to 1, 3, 5, 7 strategy are comparable with respect to the performance.
and 10, respectively. Based on the observations of Fig. 3c and Fig. 3b, we find
Fig. 3a presents the mean relative accuracy and the standard that the exponential decayed weights can help the CLSTMs
deviation of relative accuracy when K is set to the five differ- converge despite slight performance degradation.
ent values. We can see from the figure that the mean relative 5) Robustness: This experiment aims to show the robustness
accuracy improves and the standard deviation decreases with of a trained CLSTM. We use two datasets that are generated
the increase of K. This is so because that the loss function with from different distributions to train and evaluate the CLSTMs.
Mean relative accuracy
Mean relative accuracy 1.00 0.9995 0.99
Standard deviation
Standard deviation
0.006 Uniform
Loss function
Decay 0.9990
Random
0.004 0.002
0.9985
0.99 0.002 0.9980 0.001 0.00 0.986 0.991
1 3 5 7 10 0 10 20 30 Random Decay Uniform 1.000
The value of K Epoch Weights setting strategy Relative accuracy
(a) (b) (c) (d)
Fig. 3: The impact of (a) the value of K (b) wk on the the mean relative accuracy and the standard deviation of relative
accuracy, (c) The value of the loss function with different weights wk over epochs, and (d) the CCDF of relative accuracy
evaluated using system parameters with distributions different from those used during training
Specifically, for each utility function (15), the parameters µn research considers solving constrained optimization problems
are randomly selected from the set {0.5, 0.6, 0.7, 0.9} in the without the help of optimal labels.
training dataset, while they are randomly sampled from the
VI. C ONCLUSION
uniform distribution in [0.5, 1.0) in the evaluation dataset.
Fig. 3d shows the CCDF of relative accuracy for the In this paper, we have proposed the CLSTMs to solve
uniformly distributed dataset for the CLSTMs evaluation. nonconvex, constrained optimization problems. Furthermore,
From this figure, we see that the minimal relative accuracy we have formulated a resource-allocation problem and applied
is 98.6% in Fig. 3d, which is higher than the result of 98.3% the new CLSTMs to solve it by using the practical data
in Fig. 2b. On the other hand, the 99 percentile of the relative from Alibaba. Experiments have been conducted to study the
accuracy is 99.1%, which represents a small reduction of performance of the proposed CLSTMs. By considering 1,000
0.6% when compared with the corresponding result of 99.7% scenarios of the resource-allocation problem, our numerical
accuracy in Fig. 2b. This slight degradation of 99 percentile results have shown that (1) the trained CLSTMs can find the
of the relative accuracy is intuitively consistent because the near-optimal and stable solutions in very few iterations (e.g.,
combinations of system parameters (i.e., µn , Rn , C) in the achieving 99% and 99.9% optimality accuracy after 2 and 12
evaluation dataset are far more random than the combinations iterations, respectively), (2) the CLSTMs include a number of
included in the training dataset. Nevertheless, these results selectable hyper-parameters to trade off the training time for
show that the robustness of the trained CLSTMs as it still can the solution quality, and (3) the proposed approach is robust
find the optimal solutions for the problem scenarios even when as the trained CLSTMs can produce excellent solutions for
the system parameters are drawn from distributions different problems with system parameters drawn from distributions
from that used for training. different from those used in the training process.
R EFERENCES
V. R ELATED W ORK
[1] Sun, Haoran and Chen, Xiangyi and Shi, Qingjiang and Hong, Mingyi
The term ’learning to optimize’ can generally refer to using and Fu, Xiao and Sidiropoulos, Nicholas D., ”Learning to Optimize:
learning techniques to solve optimization problems. A possible Training Deep Neural Networks for Interference Management,” in IEEE
Transactions on Signal Processing, vol. 66, no. 20, pp. 5438-5453, 2018.
approach is to predict optimal solutions for these optimiza- [2] Fioretto, Ferdinando, Mak, Terrence W.K. and Van Hentenryck, Pascal,
tion problems using the supervised learning techniques. For “Predicting AC Optimal Power Flows: Combining Deep Learning and
instance, authors in [1] propose to use a deep neural network Lagrangian Dual Methods”, Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 34(01), pp. 630-637, 2020.
to approximate the unknown nonlinear mapping between the [3] Li, Ke, and Jitendra Malik. ”Learning to optimize.” Proceedings of 5th
parameters of signal processing problems and the optimal International Conference on Learning Representations (ICLR), 2017.
solutions. Authors in [2] propose a specific deep neural net- [4] Shen, Jiayi, Xiaohan Chen, Howard Heaton, Tianlong Chen, Jialin Liu,
Wotao Yin, and Zhangyang Wang. ”Learning a minimax optimizer: A
work structure to predict the optimal solution for constrained pilot study.” Proceeding of 9th ICLR, 2021.
optimization problems based on the supervised deep learning [5] Sepideh Nazemi, Kin K. Leung and Ananthram Swami. ”Distributed
techniques and demonstrate that the average prediction error Optimization Framework for In-Network Data Processing,” IEEE/ACM
Trans. on Networking, vol. 27, pp. 2432–2443, 2019.
evaluated on a realistic system is as low as 0.2%. Besides [6] Marcin Andrychowicz, et al. ”Learning to learn by gradient descent by
the supervised learning techniques, other learning techniques gradient descent,” Advances in neural information processing systems,
are also applied to solve optimization problems, such as the pp. 3981–3989, 2016.
[7] Alibaba Inc. 2018. Alibaba production cluster data v2018. Website.
deep reinforcement learning [3] and the meta-learning [6]. //github.com/alibaba/clusterdata/tree/v2018.
Meanwhile, learning techniques are used to solve other types [8] Yue Cao, Tianlong Chen, Zhangyang Wang, and Yang Shen. Learning
of optimization problems. For example, authors in [8] focus on to optimize in swarms. In Advances in Neural Information Processing
Systems, pp. 15018–15028, 2019.
solving the Bayesian swarm optimization problem, while au- [9] Xiong, Yuanhao, and Cho-Jui Hsieh. ”Improved Adversarial Training
thors in [4] and [9] propose to solve the minimax optimization via Learned Optimizer.” In European Conference on Computer Vision,
problems by learning. However, none of the aforementioned pp. 85-100. Springer, Cham, 2020.