0% found this document useful (0 votes)
4 views

DSA unit 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DSA unit 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

lOMoARcPSD|50021907

DSA unit5 - data science appplications

computer science (G.Narayanamma Institute of Technology & Science)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Lalu Yadav ([email protected])
lOMoARcPSD|50021907

13
Case Studies in Data Optimization Using Python

Jahangir Alam
AMU, India

CONTENTS
13.1 Introduction ������������������������������������������������������������������������������������������������������������������������ 255
13.2 Optimization and Data Science����������������������������������������������������������������������������������������258
13.3 Literature Review���������������������������������������������������������������������������������������������������������������259
13.4 Taxonomy of Tools Available for Optimization������������������������������������������������������������� 260
13.4.1 Modeling Tools ������������������������������������������������������������������������������������������������������� 261
13.4.2 Solving Tools����������������������������������������������������������������������������������������������������������� 261
13.4.3 Justification for Selecting OR-Tools ��������������������������������������������������������������������� 261
13.5 Case Studies Python Prerequisites ���������������������������������������������������������������������������������� 264
13.6 Case Studies: Solving Optimization Problems Through Python �������������������������������� 264
13.6.1 Case Study 1: Product Allocation Problem ������������������������������������������������������� 265
13.6.2 Case Study 2: The Transportation Problem ������������������������������������������������������� 268
13.6.3 Case Study 3: The Assignment Problem ������������������������������������������������������������ 270
13.7 Conclusions������������������������������������������������������������������������������������������������������������������������� 274
References �������������������������������������������������������������������������������������������������������������������������������������� 275

13.1 Introduction
Statistics, probability, and linear algebra topics that are recommended to any newcomer to
learn in the field of data science and machine learning (ML). High-performance computing
(HPC), machine learning, data science, and big data are buzz words these days. In order to
provide companies with a competitive advantage in the modern virtual world, data scien-
tists are discovering new ways to exploit leverage of the big data available to them. These
scientists are generally equipped with combination of skills, which include programming,
soft skills, and analytics (optimization, machine learning, and statistical techniques). It
would not be out of context to mention here that in the present digital world companies
regard data scientists as their wildcards or maybe gold miners who dig for chunks of gold
underground.
For a successful career in these fields, the value of strong basis in these topics is beyond
argument. However, the topic of data optimization, while undermined, is also equally
important to everyone willing to pursue a successful career in these fields. The importance

255

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

256 Data Science and Its Applications

of optimization as an essential step in every major social, economic, business, and personal
decision that is taken by a group of individuals, an individual, software personal decision
agents, and intelligent machines, can’t be underestimated.
The ingredients of AI, big data, and machine learning algorithms are incomplete with-
out optimization. The optimization process starts with formulating a cost function and
finishes with maximizing or minimizing the formulated function using one or another
optimization procedure under given constraints. What affects the accuracy of the results is
the selection of appropriate optimization procedure. The application area of optimization
is too wide, and it is always difficult to find a real-life situation where it can’t be applied.
Due to such a broad application area, optimization has been widely researched in aca-
demia as well as industry.
An optimization problem is defined as a problem where we maximize or minimize a
real valued function by carefully choosing input values from an allowed set of values and
compute the values of the real valued function. It means that when we consider optimiza-
tion, we always strive to find the best solution. Optimization is an essential step in model-
ing and problem solving related to AI and allied fields like machine learning and data
science. A large number of data science and machine learning problems ultimately con-
verge to optimization problems. As an example, consider the approach of a data analyst
who solves a machine learning problem for a large dataset. First of all, the analyst expresses
the problem using a suitable group of prototypes (called models) and transforms the infor-
mation into a format acceptable by the chosen group. The next step is to train the model.
This is done by optimizing the variables of the prototype with regard to the selected regu-
larization function or loss function using a core optimization problem. The process of
selecting and validating the model requires the core optimization problem to be solved
several times. Through these core optimization problems, the research related to the field
of machine learning, data science, and mathematical programming is related to each other.
At one end, mathematical programming provides the definition of optimality conditions,
which guide the analyst to decide what constitutes an optimal solution, and at the other
end, algorithms of mathematical programming enable data analysts with procedures
required to train large groups of models.
The general form an optimization problem is:

min f  z 
z

Subject to : g i  z   0 i  1, 2, 3  , n

hj  0 j  1, 2, 3  , p

z

Where:

• ∆, is an m-dimensional set of real numbers or integers or positive semi-definite


matrices.
• f : ∆m → ∆, is referred to as an objective function to be minimized over the m-variable
vector space z.
• gi(z) ≤ 0, are called inequality constraints and hj = 0 are called equality constraints,
and

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 257

• n ≥ 0, p ≥ 0.
• The problem is referred to as unconditional optimization problem if n = p = 0.
• The above general form defines what is known as a minimization optimization prob-
lem. There also exists a maximization optimization problem, which could be under-
stood by negation of the objective function. Solution of an optimization problem
refers to determine z ∈ ∆, to minimize/maximize the objective function f subject to
inequality constraints gi(z) ≤ 0 and equality constraints hj = 0.

Any problem that starts with the question “What is best?” can almost always be formu-
lated as an optimization problem (Boyd and Vandenberghe, 2004). For example:

• Which is the best route from Aligarh to Prayagraj?


• Which method should be adopted to produce shoes to maximize the profit?
• What is finest college for my son?
• What is the best fuel for my car?
• How to allocate oil fields to bidding companies to maximize the profit?

To help formulate the solutions to such problems, researchers have defined a framework
into which the solvers fit the questions. This framework is referred to as a model. The ulti-
mate feature of a model is that it has constraints and a function referred to as objective,
which must be achieved under the given constraints. In other words, the constraints are
obstacles in the way of achieving the objective. If a solver is capable to clearly state the
constraints and the objective function, he is nearer to a model. Figure 13.1 illustrates the
solution procedure for an optimization problem.
There are different classes of optimization problems. For a particular class of optimiza-
tion problem, the solution procedure refers to an algorithm that leads to a solution (up to
some desired accuracy) for a given problem (an instance of the class) of that class. Efforts
on developing viable algorithms for various classes of optimization problems, developing
S/W packages to solve them, and analyzing their properties are being put since late 1940s.
The viability of various algorithms significantly depends on factors like the particular form
of constraints and objective functions, number of constraints and variables, and special
features such as sparsity. If each constraint function of the problem depends on only a
small portion of the variables, the problem is referred to as sparse. This is unexpectedly
hard to solve the generalized optimization problem when constraint and objective func-
tions are smooth (e.g., polynomials) (Boyd and Vandenberghe, 2004). So it is imperative
that attempts to solve the general optimization problem include certain kinds of compro-
mise, such as not finding the exact solution, long execution time, and so on. There are,
however, few exceptions to this general rule. Efficient algorithms do exist for a particular
class of problems that can solve sufficiently large problems with thousands of constraints
and variables reliably. Linear programming and least squares problem belong to those
classes of optimization problems. Convex optimization is also an exception to the general

FIGURE 13.1
Steps to Solve an Optimization Problem

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

258 Data Science and Its Applications

rule of solving optimization problems (Agrawal et al., 2018). Efficient algorithms do exist
for convex optimization problems that can reliably solve the large optimization problems
efficiently. General non-convex optimization problems are proved to be NP-hard (Boyd
and Vandenberghe, 2004).
In the optimization field, the emphasis is switching to data driven optimization from
model based optimization. In this approach data is the main source around which the opti-
mization problem is formulated. The size of the optimization problem is becoming large as
the size of data to be processed through the problem is also large. This leads to significant
increase in solution time and complexity of the problem. In other words, we can say that
large optimization problems, including big datasets, require large interaction with a human
solver to find a feasible solution and require a longer solution time.
The rest of the chapter is organized as follows. Section 13.2 discusses optimization in the
context of data science and machine learning. Section 13.3 presents a review on related
research. Section 13.4 presents various tools that are used to solve the optimization prob-
lems and justifies author’s choice for a specific tools (Google OR-Tool). Section 13.5 pres-
ents Python’s prerequisites for running the models formulated in this chapter. Section 13.6
presents detailed case studies along with modeling process, code, and results. Section 13.7
concludes the chapter.

13.2 Optimization and Data Science


The present era is being driven by social media, big data, data analytics, AI, machine learn-
ing, deep learning, and IoT (Internet of things), along with high-performance computing.
This is the reason that almost every industry is on big data adoption curve. The most valu-
able asset for most of businesses today is the data captured and reserved by them. In this
highly professional environment, the effective use of data can lead to better decision-
making in business and other fields. Another aspect of this scenario is that if the businesses
fail to optimize their data, instead of gaining anything significant from their large data they
are only going spend their precious resources and time digging through the data.
Day by day, organizations are becoming data dependent, a state that is rapidly increas-
ing. The data may come from various sources and may be in different formats or may be
unstructured at all. However, in most of the cases, it is inaccurate, inconsistent, and redun-
dant. These variances make it difficult to handle the data for organizations which then
struggle to get the relevant information in a suitable manner. This indicates that there is a
need to optimize the data. Data optimization means collecting all the data at your disposal
and managing it in a way that maximizes the speed and comprehensiveness with which
critical information can be pull out, analyzed, and utilized.
As far as machine (and deep learning) is concerned, all machine learning algorithms can
be viewed as solutions to optimization problems and it is interesting that even in cases,
where the original machine learning technique has a basis derived from other fields for
example, chemistry, physics, biology, and so on one could still interpret all of these machine
learning algorithms as some solution to an optimization problem. A basic understanding
of optimization helps in:

• Deeply understand the working of machine learning algorithms.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 259

• Rationalize the working of the algorithm. That means if we get a result that we want
to interpret and we had a deep understanding of optimization, we will be able to real-
ize why we got the result.
• And at an even higher level of understanding, we might be able to develop new algo-
rithms ourselves.

13.3 Literature Review


Based on the objective and constraint functions, a taxonomy of optimization problems
exists. There are many classes of mathematical optimizations: linear, semi definite, qua-
dratic, integer, nonlinear, semi-infinite, geometric, fractional, goal, and so on. As an exam-
ple, a linear mathematical optimization has a linear constraint and objective functions. The
NEOS optimization guide (The NEOS, 2020) and the glossary of mathematical optimiza-
tion (MPG, 2020) provide complete descriptions of these problems. Each class of optimiza-
tion problems is a diverse research field with wide-ranging mathematical background and
procedures.
Surrogate assisted optimization techniques are employed to provide solutions to single
or multi-objective computationally expensive optimization problems (Chugh et al., 2017;
Jin, 2011). Computationally expensive problems are those problems for which evaluating
constrained and/or objective functions take longer than the reasonable time during simu-
lated experiments. In optimization problems involving large data, the timing complexity is
not high due to the evaluation of the objective and/or constraint functions, but is because
of the large size of data. In surrogate-driven optimization, surrogate functions are trained
using a small model of expensive function evaluations. These surrogate functions produce
approximate solutions but are computationally inexpensive (Chugh et al., 2017; Jin, 2011).
For mathematical background and algorithms related to nonlinear optimization, several
good resources are available (Nocedal and Wright, 1999; Bertsekas, 2004; Bazaraa, Sherali,
and Shetty, 2006). Convex optimization including semi-definite optimization is covered in
(Boyd and Vandenberghe, 2004); Diamond and Boyd, 2016). Goberna and Lopez (1998)
provide semi-infinite algorithms and mathematical background. Nemhauser and Wolsey
(1999) provide information about integer and combinatorial optimization.
Researchers have noticed that the connection between machine learning models and
optimization models, and this field is constantly advancing. Use of mathematical optimi-
zation in the field of machine learning has led to advanced research in this area. In the area
of neural networks researchers have gone from backpropagation (Hinton and Williams,
1986) to exploit the use of unconstrained nonlinear optimization (Bishop, 1996). The back-
propagation worked fine, so programmers simulated gradient descent to have a deep
insight into its properties (Mangasarian and Solodov, 1994). Advances in kernel methods
(Cortes and Vapnik, 1995) have made mathematical optimization terms like duality and
language multipliers, quadratic program, and so on more realistic for machine learning
students. To exploit the mathematical optimization tree more into depth with special focus
on convex optimization, machine learning, and data science, researchers are working on
novel methods and models. As a result of the advances in mathematical optimization, a
rich set of ML models are being explored without much worries about the algorithms
(Bergstra et al., 2015). Conversely machine learning and data science have inspired

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

260 Data Science and Its Applications

advances in mathematical optimization. The optimization problems emerging from AI,


data mining, data science, machine learning, and deep learning are way ahead in size of
the problems which have been reported in literature.
The relationship between data science, machine learning, and optimization becomes
complicated when machine learning mixes two things, i.e., methods and modeling. In such
cases, machine learning is more like operation research (OR). Historically, mathematical
optimization is a branch of OR. OR is related to system modeling while optimization ana-
lyzes and solves that model. Both the analysts (i.e., OR or ML) formulate problems in real
world using a model, derive the main problem and solve it using an optimization model.
So, the OR and machine learning analysts face the same kind of issues, and it is not a mat-
ter of surprise that both can explore the same set of tools (Radin, 1998).
As in machine learning, mathematical optimization also has large number of problems
for benchmarking. A benchmark performance evaluator measures speed of algorithms
referred to as performance of algorithm (Dolan and More, 2002). Karush-Kuhn-Tucker
optimality conditions (KKT, 2020) are applied to measure the quality of the solution.
Quality of solution is a function of the amount of the deviation from the constraints and
objective value. All these metrics related to solution quality are generally not reported in
machine learning literature. It has been observed that small adjustment or tuning in the
model can lead to much better solutions (Sonnenburg, Schafer, and Scholkopf, 2006;
Shalev-Shwartz and Singer, 2006). In Sonnenburg, Schafer, and Scholkopf (2006 and Shalev-
Shwartz and Singer (2006), the authors have reformulated the models and the new proce-
dures reproduce the problem into a collection of simpler familiar problems, which could
be solved easily.
In some cases, machine learning models are made convex. This is done by defining an
appropriate definition for system boundaries wherein parameters are treated as fixed. For
a fixed-ridge parameter, ridge regression is a convex unconstrained quadratic problem.
The cross-validation procedure (Golub and Mattvon, 1997) defines the ridge parameter
within boundary and the problem is converted into nonconvex.
Artificial intelligence and data science are comprehensive fields that encompass miscel-
laneous techniques, measures of success, and objectives. One branch of these fields is
related to find the viable solutions to some well-known problems in the field of optimiza-
tion (Blank and Deb, 2020). This chapter introduces the reader to the art of developing
models for some well-known optimization problems and the science behind of implement-
ing these models in Python. As pointed out earlier, the intention of the author is not to help
the user become a skillful theoretician but a skillful modeler. Therefore, little of mathemati-
cal principles related to the subject of optimization is discussed. This has been done by
undertaking some case studies in the related field.

13.4 Taxonomy of Tools Available for Optimization


Over the years, a sizeable number of specialized languages have been designed by the
researchers in the field of mathematical optimization. Figure 13.2 proposes a taxonomy of
these languages and lists various languages or software tools available in each category of
the taxonomy. As shown in Figure 13.2, optimization tools can be classified into two cate-
gories: modeling tools and solving tools. Whereas modeling languages provide specific

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 261

FIGURE 13.2
Taxonomy of Optimization Tools

vocabulary, formal constructions, and grammar for specifying the models, the solving lan-
guages can take as input the models programmed in certain modeling languages and pro-
vide the solution.
This section briefly introduces some prominent tools under each category and justifies
why we have chosen Python-based approach over the other approaches available for mod-
eling and solving the optimization problems.

13.4.1 Modeling Tools


Table 13.1 summarizes various modeling tools available for optimization problems.

13.4.2 Solving Tools


Table 13.2 summarizes various solving tools available for optimization problems.

13.4.3 Justification for Selecting OR-Tools


From subsection 13.4.2, it is clear that a model is formulated in a modeling language and
then the model is fed to solver (a different language) to get the results. This happens
because there exists a parser between the modeler and solver languages, which translates
the modeler language code into a format known to the solver. Figure 13.3 shows a parser
between a modeler and a solver. If the parser doesn’t translate into the format known to a
specific solver, that solver can’t be used with the modeling language. This is a major draw-
back of modeler and solver model. In addition to the above, there are very limited model-
ers and solvers that integrate with high-level languages like Python or R. If one of these is

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

262 Data Science and Its Applications

TABLE 13.1
Summary of Various Modeling Tools
S.No. Tool Description

1. AMPL To promote rapid development and reliable results, the AMPL supports the entire life
cycle of optimization modeling (AMPL, 2020). It supports a high-level algebraic
representation of optimization models, which is close to the ways people think about
the models. It provides special tools for modeling large scale optimization problems. A
command line language for analyzing and debugging of models and a debugging tool
for manipulation optimization strategies and data are also provided with the system.
AMPL’s APIs for C, C++, MATLAB, JAVA, R, Python, and so on to ensure easy
integration.
2. GAMS GAMS is a high-level language that supports both optimization and mathematical
programming (GASM, 2020). It provides a language compiler for analyzing and
debugging the models and several associated solvers. Real-world optimization
problems can quickly be transformed into computer code using GAMS modeling
language. Its compiler puts the model in a format that can easily be understood by
associated solvers. As many solvers are supported by GAMS format, it provides users
the flexibility of testing his model on various solvers.
3. Minzinc Minzinc is an open-source and free framework for modeling of the constraint
optimization problems (The MiniZinc, 2020). It could be used on model constraint
optimization problems in a solver independent high-level language. This is done by
taking benefit of a large library of predefined constraints. Models formulated with
Minzinc are compiled into another high-level language referred to as FlatZinc. FlatZinc
is a solver input language and is understood by a large number of solvers.
4. GMPL It stands for GNU mathematical programming language. GMPL is a modeling language
intended for describing mathematical programming models (GLPK, 2020). To develop
a model in GMPL, a high-level language is provided to the user. The model consists of
data blocks and a set of statements defined by the user. A program referred to as model
translator analyzes the user-defined model and translates it into internal data
structures. This process is referred to as translation. The translated model is submitted
to the appropriate solver for getting the solution of the problem.
5. ZIMPL ZIMPL is a relatively small language (ZIMPL, 2020). ZIMPLS facilitates to formulate the
mathematical model of a problem into a (mixed-)integer mathematical or linear
program. The output is generated in .mps or .ls file format that can be understood and
answered by a MIP or LP solver.
6. OPL Optimization Programming Language (OPL) is an algebraic modeling language. It
facilitates an easier and shorter coding mechanism compared to a general-purpose
programming language (The IBM ILOG, 2020). A part of the CPLEX (IBM, 2020)
software package, it is well supported by IBM through its ILOG CPLEX and ILOG
CPLEX-CP optimizers. OPL supports integer/(mixed)-integer, constraint and liner
programming.

compatible the other doesn’t have support. These constraints make the use of modeler and
solver limited. Another aspect of this incompatibility is that a large number of modelers
have support for some mathematical optimization problems. From his past experience, the
author has learnt that use of specialized modeler and solver languages should be avoided
and one must use a high-level language, e.g., C, C++, Python, R, interfaced with a library
that supports multiple solvers. Google’s Operation Research Tools (OR-Tools) come into
picture to support this idea. It is a well-structured, comprehensive library that offers a
user-friendly interface. It effectively supports constraint programming and has special
routines for network flow problems. In this chapter the author will demonstrate only a
very small portion of this encyclopedia of optimization.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 263

TABLE 13.2
Summary of Solving Tools
S.No. Tool Description

1. CLP CLP (COIN-OR): CLP also referred to as COIN-OR (Computational Infrastructure


for Operations Research). It is written in C++ and is an open-source
mathematical programming solver. COIN-OR is run by the educational,
non-profit COIN-OR Foundation (COIN-OR, 2020). It is accommodated by the
Institute for Operations Research and the Management Sciences (INFORMS).
Many peer-reviewed journals in the field of optimization use this tool to
cross-check the results claimed by researchers.
2. ECLiPSe It is an open-source software. This tool is especially used for cost-effective
deployment and development of applications related to scheduling, planning,
resource allocation, transportation, timetabling, and so on (constraints
programming) (ECLiPSe, 2020). For teaching combinatorial problem solving, e.g.,
constraint programming, modeling, and mathematical programming, it is an
ideal tool. It supports several constraint solvers, a control and high-level
modeling language and libraries. It easily interfaces with third-party software.
3. CPLEX Informally referred to as CPLEX, it stands for IBM ILOG CPLEX. Optimization
Studio is an award-winning optimization studio. In 2004 work on CPLEX was
awarded an impact prize by Institute for Operations Research and the
Management Sciences (INFORMS). The name CPLEX is given to the software
because of the fact that it implements well-known simplex method in C
programming language. As of now it has support for other types of mathematical
optimization and interfaces with other languages including C. CPLEX has the
power to solve very large linear programming problems using either dual or
primal variants of the well-known simplex method or the barrier interior point
method.
4. GLPO This refers to Google’s linear programming system. The primary linear
optimization solver for well-known OR-Tools is GLOP (OR-Tools. 2020).
According to Google, its memory is efficient, fast, and numerically stable. The
author has exploited OR-Tools using Python while solving the case studies
scenarios present in this chapter.
5. GeCode It is free and open-source toolkit (C++) for solving constraints satisfaction
problems. It stands for Generic Constraint Development Environment
(GECODE, 2020). It is in fact a library that is extensible and modular. It provides
a state-of-the art performance constraint solver. According to the developer,
GeCode is open, comprehensive, well-documented, parallel, efficient, portable,
and tested.
6. GurOBI It is a commercial solver and developers claim that it is the fastest solver on Earth
(GUROBI, 2020). It has support for various optimization problems such as
quadratically constrained programming, linear programming, mixed integer
linear programming, and so on.
7. GLPK It stands for GNU Linear Programming Kit (GLPK, 2020). It is a set of procedures
written in ANSI C and is used to solve large-scale linear programming, mixed-
integer programming, and other related problems. It is supported in the form of
a callable library.
8. SCIP It stands for Solving Constraint Integer Programs (SCIP, 2020). Developed at Zeus
Institute at Berlin, Germany, it is currently one of the fastest free solvers for
mixed integer nonlinear programming (MINLP) and mixed integer
programming (MIP). It is supported as a C callable library.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

264 Data Science and Its Applications

FIGURE 13.3
Parser Between Modeler and Solver

OR-Tools have been awarded four gold medals in the 2019 MiniZinc Challenge, the
international constraint programming competition (OR-Tools, 2020). Some other impor-
tant features of OR-Tools are listed below (Bodnia, 2020):

• Stability and Continuous Development: OR-Tools are continuously updated, bugs


are fixed, and new features are added by a dedicated team of programmers.
• High Performance: Complex calculations are performed using multithreaded algo-
rithms optimized for the purpose. This leads to get the results at a faster pace without
acquiring sophisticated hardware.
• Flexibility: It implies that best results could be obtained with minimal expenses on
infrastructure.
• Resource Utilization: Guarantees the best use of available resources by exploiting
vacant space or idle time.

Looking at the above properties of Google’s OR-Tools the author has selected them for the
proposed case studies.

13.5 Case Studies Python Prerequisites


Python has become a popular choice of programmers for both optimization and data ana-
lytics (Zegard and Paulino, 2015). This section briefly guides the reader to download and
install the Python packages required to perform the case studies presented in this chapter.
The author assumes that Python is installed on your system. For the consideration that the
required libraries are better supported by Python 2, the author has used Python 2.7.x,
where x stands for any version released later than Python 2.7.10. Table 13.3 shows step-by-
step process to install required Python packages or libraries.

13.6 Case Studies: Solving Optimization Problems Through Python


As stated earlier, the main goal of this chapter is to demonstrate the reader how to solve
real-life optimization problems through Python and Google’s OR-Tools. In this section the
author has selected certain optimization problems and demonstrated how they could be
solved using OR-Tools and Python. The simplest problems are like those encountered in a
first course on optimization. The nature of these problems is algebraic, which means they

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 265

TABLE 13.3
Python Commands to Install Required Packages
Step 1: Upgrade pip
Launch the Window’s command prompt as an administrator and to upgrade pip type in the
following:
python –m pip install - -upgrade pip
Step 2: Install Packages
To install any package using pip, launch the Window’s command prompt as an administrator and
use the following command (general form):
python –m pip install package_name
Step 3: OR-Tools Installation
The command to install OR-Tools is as follows:
python –m pip install ortools
OR
!pip install ortools
(For users working with Jupyter Notebook or Google Colab platform)
Note: For the purpose of installing packages two options, namely conda and pip, are there. These case studies
use Python Packaging Authority’s (PPA) recommended tool pip for installing packages from the Python Package
Index (PyPI). With pip, Python software package are installed as wheels or source distributions. pip is already
installed with all versions of Python after 2.7.9.

can be formulated and solved (not always) by applying simple liner algebraic techniques.
In Case Study 1, the author considers one such problem and shows how to model and
solve the problem.

13.6.1 Case Study 1: Product Allocation Problem (Swarup, Gupta, and Mohan, 2009)
An electronics company has three operational subdivisions—Fabrication, Testing, and Packing,
with a capacity to produce three different types of components, namely E1, E2, and E3, yielding a
profit of Rs. 4, Rs. 3 and Rs. 5 per component. Component E1 require 4 minutes in fabrication, 4
minutes in teasing, and 12 minutes in packing. Similarly, component E2 requires 12 minutes in
fabrication, 4 minutes in testing, and 4 minutes in packing. Product E3 requires 8 minutes in each
subdivision. In a week, total run time of each subdivision is 90, 60, and 100 hours for fabrication,
testing, and packing respectively. The goal is to model the problem and find the product mix to maxi-
mize the profit.

• Modeling the Problem

Table 13.4 summarizes the data of the problem:


The general steps to model a problem are as follows:
i. Read the problem carefully and precisely.
ii. Identify what is required to solve the problem, and based on that identify the deci-
sion variables.
iii. To streamline the constraints or of the objective function, define auxiliary variables.
They may also help in the presentation of the solution and analysis.
iv. Derive algebraic equalities or inequalities directly involving the decision variables or
indirectly the supplementary (auxiliary) variables. This is done by transforming each
constraint into an algebraic inequality or equality.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

266 Data Science and Its Applications

TABLE 13.4
Product Allocation Problem Data
Subdivisions
Fabrication Testing Packing Profit/ Component
(in minutes) (in minutes) (in minutes) (In. INR)

E1 4 4 12 4
E2 12 4 4 3
E3 8 8 8 5
Availability (minutes) 90 × 60 60 × 60 100 × 60

v. Formulate the objective function for the quantity to be maximized or minimized.


vi. Choose an appropriate solver and run the model.
vii. Solution should be displayed in proper manner.
viii. Validate the model and the results. Identify whether the solution appropriately sat-
isfy the constraints. Is the solution implementable and leads to proper solution? If
not, consider finetuning the model.

Figure 13.4 pictorially represents the complete process.


The above steps implemented for Case Study 1 problem are as follows:
Step 1: Let a, b, and c denote the weekly production units of components E1, E2, and E3
respectively. The key decision that is to be made is to determine the weekly rate of
production of these components so that the profit could be maximized.
Step 2: Since negative production of any component makes no sense, we have values of
a, b, and c bounded by inequalities shown in Equation 13.1:

a ≥ 0, b ≥ 0, and c ≥ 0 (13.1)

Step 3: The constraints are the limiting weekly working hour of each subdivision.
Production of one unit of component E1 requires 4 minutes in fabrication. The quantity
being a units, the requirement for fabrication for component E1 alone will be 4a fabrication
minutes. Similarly, b units of product E2 and c units of product E3 will require 12b and 8c
fabrication minutes respectively. Thus the total weekly requirement of fabrication minutes
will be 4a + 12b + 8c, which should not exceed the available 5,400 minutes. So, the first
constraints can be formulated as shown in Equation 13.2:

4 a  12b  8c  5400 (13.2)

Step 4: Similarly, the constraints for testing and packing subdivisions can be formulated
as shown in Equations 13.3 and 13.4:

4 a  4b  8c  3600 (13.3)

12a  4b  8c  6000 (13.4)

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 267

FIGURE 13.4
Flowchart for Modeling and Solving an Optimization Problem

Step 5: The objective is to maximize the weekly total profit. Assuming that all compo-
nents produced are immediately sold in the market, the total profit is given by Equation
13.5:

z  4 a  3b  5c (13.5)

Clearly, the mathematical model for the problem under consideration can be summa-
rized as shown in Table 13.5.
Step 6: The author selects Google’s OR-Tool GLOP solver and Python language to run
the above model. The Python code used to solve the model are listed Table 13.6:
A model is coded in Python in the same way as shown in above solution code. The line
sol = pywraplp.Solver.CreateSolver('Product Allocation Problem', 'GLOP') invokes

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

268 Data Science and Its Applications

TABLE 13.5
Summarized Mathematical Model for Product Allocation Problem
Mathematical Model for Product Allocation Problem

Find a, b, and c so as to maximize:

z = 4a + 3b + 5c

Subject to the constraints:

4a + 12b + 8c ≤ 5400
4a + 4b + 8c ≤ 3600
12a + 4b + 8c ≤ 6000
a ≥ 0, b ≥ 0, and c ≥ 0

Google’s own GLOP solver (OR-Tools 2020) and names it as sol. The OR-Tools could be
interfaced with a variety of solvers. Altering a solver, say CLP (COIN-OR) (COIN-OR,
2020) or GLPK from GNU (GLPK, 2020) is just a matter of altering this line.
Step 7: The optimal solution to the product allocation problem, therefore, is shown in
Table 13.7:
Hence maximum profit that could be earned is: 4(300) + 3(225) + 5 (187) = Rs. 2810.
Step 8: Model validation: If the solution obtained is correct, then it should satisfy every
constraint. Table 13.8 validates the model:

13.6.2 Case Study 2: The Transportation Problem (Swarup, Gupta, and Mohan, 2009)
XYZ makes trailers at plants in Frankfurt, Copenhagen, and Seoul, and ships these units to distri-
bution centers in London, Paris, New York, and Tokyo. In planning production for the next year,
XYZ estimates unit shipping cost (in US dollars) between any plant and distribution center, plant
capacities, and distribution center demands. These numbers are given in the Table 13.9.
XYZ faces the problem of determining how much to ship between each plant and distribution
center to minimize the total transportation cost, while not exceeding capacity and while meeting
demand.
(a) Formulate a mathematical model to minimize the total shipping cost.
(b) Set up and solve the problem on a spreadsheet. What is the optimal solution?
Steps 1 and 2: From the statement of the problem, it is clear that twelve decision vari-
ables are required to make the decision stated in the problem. The decision variables could
be expressed x11,x12, x13, x14,…, x32, x33, x34.
Step 3: Let ci be the cost of shipping one unit of trailer from plant i to distribution center
j. Therefore the cost of shipping units x could be expressed as cijxij.
Steps 4 and 5: The objective function to be minimized and the applicable constraints
therefore could be formulated as shown in Table 13.10. Equation 13.6 expresses the objec-
tive function while Equations 13.7 to 13.11 express the constraints.
Steps 6 and 7: As is obvious from the mathematical model of the problem, the solution
to the model requires the use of two-dimensional subscripted variables and Python dic-
tionaries to be utilized. The author selects Google’s OR-Tool GLOP solver and Python lan-
guage to run the above model. Python code to solve the model and output obtained are
shown in Table 13.11:

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 269

TABLE 13.6
Python Code to Solve Product Allocation Problem
Python Code for Product Allocation Problem
#Install Required Package
!pip install ortools #Execute only once
#Import required functions
from __future__ import print_function
from ortools.linear_solver import pywraplp
# Invoke the solver with GLOP.
sol = pywraplp.Solver.CreateSolver('Product Allocation Problem', 'GLOP')
# Populate variables a,b,c
a = sol.NumVar(0, sol.infinity(), 'a') #Enables the constraint a>=0
b = sol.NumVar(0, sol.infinity(), 'b') #Enables the constraint b>=0
c = sol.NumVar(0, sol.infinity(), 'c') #Enables the constraint c>=0
print('Decision variables =', sol.NumVariables())
#Formulate First Constraint 4a + 12b + 8c <= 5400
cst1 = sol.Constraint(0, 5400, 'cst1')
cst1.SetCoefficient(a, 4)
cst1.SetCoefficient(b, 12)
cst1.SetCoefficient(c, 8)
#Formulate Second Constraint 4a + 4b + 8c <= 3600
cst2 = sol.Constraint(0, 3600, 'cst2')
cst2.SetCoefficient(a, 4)
cst2.SetCoefficient(b, 4)
cst2.SetCoefficient(c, 8)
#Formulate Third Constraint 12a + 4b + 8c <= 6000
cst3 = sol.Constraint(0, 6000, 'cst3')
cst3.SetCoefficient(a, 12)
cst3.SetCoefficient(b, 4)
cst3.SetCoefficient(c, 8)
print('Total constraints =', sol.NumConstraints())
# Formulate the objective function z = 4a + 3b + 5c
objf = sol.Objective()
objf.SetCoefficient(a, 4)
objf.SetCoefficient(b, 3)
objf.SetCoefficient(c, 5)
objf.SetMaximization()
sol.Solve()
print('Product Allocation Problem Solution:')
print('Objective value =', objf.Value())
print('a =', a.solution_value())
print('b =', b.solution_value())
print('c =', c.solution_value())
Output
Decision variables = 3
Total constraints = 3
Product Allocation Problem Solution:
Objective value = 2812.5
a = 300.00000000000006
b = 225.0
c = 187.49999999999997
Note: All Python code used in this chapter has been run using Jupyter Notebook and Google Colab web applica-
tions. These applications allow users to share and create documents that contain equations, live code, narrative
text, and visualizations (Jupyter, 2020). Code is available on author’s repository at Github. URL to access the code
is: https://github.com/jahangir-amu2020/ORCS.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

270 Data Science and Its Applications

TABLE 13.7
Optimal Production per Week
Product Units to be produced per week

E1 300
E2 225
E3 187

TABLE 13.8
Model Validation
Constraint Calculated Value Equality/Inequality Satisfied/Unsatisfied

4(300) + 12(225) + 8(187) = 5396 <= 5400 Satisfied


4(300) + 4(225) + 8(187) = 3596 <= 3600 Satisfied
12(300) + 4(225) + 8(187) = 5996 <= 6000 Satisfied

TABLE 13.9
Unit Shipping Cost (in US Dollars) Between Plants and Distribution Center, Plant Capacities, and
Distribution Center Demands
Distribution Center
Plant London Paris New York Tokyo Capacity

Frankfurt 35 40 60 120 12000


Copenhagen 30 30 45 130 8000
Seoul 60 65 50 100 5000
Demand 9000 3000 9500 1500

Step 8: Model Verification: This could be done in the same way as we did in the last step
of Case Study 1. We notice that all constraint are satisfied, so the model is valid.

13.6.3 Case Study 3: The Assignment Problem (Swarup, Gupta, and Mohan 2009)
As the last case study of this introductory chapter, the author presents a solution for
another important optimization problem referred to as the assignment problem. After
carefully examining the assignment problem, it is easy to conclude that the transportation
problem is a special case when the objective is to assign a certain number of resources to
the equal number of activities at a maximum profit (or minimum cost) is actually named
as an assignment problem. Another form of assignment problem is referred to as an unbal-
anced assignment problem in which number of resources are greater than the number of
activities to be performed.
Following is an example of an assignment problem:
A department head has four subordinates and four tasks to be performed. The subordinates differ
in efficiency, and the tasks differ in their intrinsic difficulty. His estimate of the time each subordi-
nate would take to perform each task is given in Table 13.12:
How should the head allocate the task to subordinates (one task to each) so as to minimize the
total time to complete the tasks?

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 271

TABLE 13.10
Mathematical Model for Transportation Problem
Mathematical Model for the Transportation Problem

Find x11,x12, x13, x14,…, x32, x33, x34 so as to minimize:


3 4

z c  x
i 1 j
ij ij (13.6)

or

z = 35x11 + 40x12 + 65x13 + 120x15 + … + 100x34

Subject to the following capacity and demand constraints:


3 4

x
i 1
i1  9000, x j 1
1j  12000 (13.7)

3 4

x
i 1
i2  3000, x j 1
2j  8000 (13.8)

3 4

x
i 1
i3  9500, x j 1
3j  5000 (13.9)

x
i 1
i4  1500 (13.10)

xij ≥ 0 for all i , j (13.11)

Step 1: To mathematically express the above assignment problem we consider the gen-
eralized form of an assignment problem in which n resources are to be assigned to n activi-
ties. The cost of assigning resource i to activity j is known as cij. Table 13.13 describes the
cost matrix for the problem.
The cost matrix is same as it is with the transportation problem. However, this time the
requirement at each of the destinations and the availability at each of the resources is unity
(1). This is because of the fact that assignments are to be made on one-to-one basis.
Step 2: Let xij denotes the assignment of the ith resource to jth activity, such that:

xij = {10,,ifOtherwise
resource i is assigned to activity j
(13.12)

Step 2: Following above notions, the generalized assignment problem can be mathemat-
ically formulated as shown in Table 13.14:
Steps 3, 4, and 5: Based on above general mathematical formulation, the problem con-
sidered in the present case study could be formulated as shown in Table 13.15:
Steps 6 and 7: As obvious from the mathematical model of the problem, solution to the
model requires the use of two dimensional subscripted variables and Python dictionaries
to be utilized. The author selects Google’s OR-Tool CBC solver (a MIP solver) and Python
language to run the above model.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

272 Data Science and Its Applications

TABLE 13.11
Python Code to Solve Transportation Problem
Python Code for Transportation Problem
#Install Required Package
!pip install ortools #Execute only once
#Import required functions
from __future__ import print_function
from ortools.linear_solver import pywraplp
def transmodel():
"Initialize Problem Data"
pd = {}
pd['cbound'] = [12000,8000,5000]
pd['dbound'] = [9000,3000,9500,1500]
pd['obcoeff'] = [
[35,40,60,120],
[30,30,45,130],
[60,65,50,100],
]
pd['ncc'] = 3
pd['ndc'] = 4
return pd
pd = transmodel()
solver = pywraplp.Solver.CreateSolver('simple_mip_program', 'GLOP')
inf = solver.infinity()
x={}
#Create Variables and enforce greater than zero constraints
for i in range(pd['ncc']):
for j in range(pd['ndc']):
x[i, j] = solver.NumVar(0, inf, “)
print('Number of variables =', solver.NumVariables())
#Enforce Capacity Constraints
for i in range(pd['ncc']):
constraint = solver.RowConstraint(0, pd['cbound'][i], “)
for j in range(pd['ndc']):
constraint.SetCoefficient(x[i,j], 1)
#Enforce Capacity Constraints
for i in range(pd['ndc']):
constraint = solver.RowConstraint(pd['dbound'][i],inf, “)
for j in range(pd['ncc']):
constraint.SetCoefficient(x[j,i], 1)
print('Number of constraints =', solver.NumConstraints())
# Formulate the objective function
objf = solver.Objective()
for i in range(pd['ncc']):
for j in range(pd['ndc']):
objf.SetCoefficient(x[i,j], pd['obcoeff'][i][j])
objf.SetMinimization()
solver.Solve()
print('Transportation Problem Solution:')
print('Objective value =', objf.Value())
for i in range(pd['ncc']):
for j in range(pd['ndc']):
print('x[', i, j,']', ' = ', x[i,j].solution_value())

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 273

Python Code for Transportation Problem


Output
Decision variables = 12
Total constraints = 7
Transportation Problem Solution:
Objective value = 1010000.0
x[ 0 0 ] = 9000.0
x[ 0 1 ] = 999.9999999999998
x[ 0 2 ] = 0.0
x[ 0 3 ] = 0.0
x[ 1 0 ] = 0.0
x[ 1 1 ] = 2000.0000000000002
x[ 1 2 ] = 5999.999999999999
x[ 1 3 ] = 0.0
x[ 2 0 ] = 0.0
x[ 2 1 ] = 0.0
x[ 2 2 ] = 3500.0
x[ 2 3 ] = 1500.0

TABLE 13.12
Time Required by Each Subordinate
to Perform Each Task
Subordinate
Tasks A B C D

T1 18 26 17 11
T2 13 28 14 26
T3 38 19 18 15
T4 19 26 24 10

TABLE 13.13
Cost Matrix for Assignment Problem
Activities
Resources A1 A2 … An Available

R1 c11 c12 … c1n 1


R2 c21 c22 … c2n 1
R3 c31 c32 … c3n 1
… … … … … …
Rn cn1 cn2 … cnn
Required 1 1 1 1 1

Python code to solve the model is available at author’s repository and can be accessed
using the URL: https://github.com/jahangir-amu2020/ORCS/blob/master/Case-
Study-3.pdf.
The code also illustrates how to solve the assignment problem using a mixed-integer
programming (MIP) solver.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

274 Data Science and Its Applications

TABLE 13.14
Mathematical Model for Generalized Assignment Problem

Mathematical Model for the Generalized Assignment Problem

Minimize:

n n

z c .x
i 1 j 1
ij ij (13.13)

Subject to the constraints:

n n

x
i 1
ij  1 and x
j 1
ij  1; where xij  0 or 1 (13.14)

Subject to : g i  z   0 i  1, 2, 3 

TABLE 13.15
Mathematical model for case study assignment problem

Mathematical Model for the Case Study Assignment Problem

Minimize:

4 4

z c .x
i 1 j 1
ij ij (13.15)

Subject to the constraints:

4 4

x
i 1
ij  1 and x
j 1
ij  1; where xij  0 or 1 (13.16)

hj  0 j  1, 2, 3 

13.7 Conclusions
In the past few years, research in mathematical optimization, machine learning, and data
science have become highly interrelated. Branches of mathematical optimization are being
fully exploited by machine learning researchers. With the help of available mathematical
optimization modelers, algorithms, and robust solvers, data scientists have an ideal toolkit
for exploring new machine learning problems. Machine learning models so obtained
require highly efficient and accurate modelers and solvers. As pointed out earlier, not all
models support all solvers, so we must use the modeler and solver in combination with
some high-level language like C/ C++/ Java or Python.
In this chapter the author has focused on Google’s Operation Research Tools (OR-Tools)
and has shown that how some well-known optimization problems can be solved using
OR-Tools in combination of Python language. In each case, first a mathematical model has
been developed, which is the primary aim of the author. The model has then been coded
and solved using OR-Tools and Python. To keep the subject matter simple and easy for all
those who are entering in the field of data science and optimization, the author has only

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Case Studies in Data Optimization 275

focused on simple optimization problems like transportation problem, assignment prob-


lem, and so on. The author has focused on the general steps of developing, solving, and
validating the mathematical model of an optimization problem and has demonstrated
these steps by developing, solving, and validating three important classes of optimization
problem using Python and OR-Tools. Other optimization problems can also be modeled
and solved using the concepts presented in this chapter and the author intends to take this
assignment in future. The chapter can be effectively used to create easy yet powerful and
efficient models for optimization problems.

References
Agrawal, A., Verschueren, R., Diamond, S., and S. Boyd. 2018. A rewriting system for convex optimi-
zation problems. Journal of Control Decision 5(1):42–60
AMPL. 2020. “AMPL streamedlined modeling for real optimization.” (accessed October 17, 2020)
https://ampl.com/
Bazaraa, M., Sherali, H., and C. Shetty. 2006. Nonlinear Programming Theory and Algorithms. Wiley
Bergstra J., Komer B., Eliasmith C., Yamins D., and D.D. Cox. 2015. Hyperopt: a Python library for
model selection and hyperparameter optimization. Computational Science & Discovery. 8(1)
Bertsekas, D.P. 2004. Nonlinear Programming. Athena Scientific, Cambridge
Bishop, C. 1996. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.
Blank J., and K. Deb. 2020. pymoo: multi-objective optimization in python. IEEE Access 8:
89497–89509
Bodnia V. 2020. Google OR-Tools business value and potential. (accessed October, 2020) https://
freshcodeit.com/google-or tools#:~:text=The%20primary%20purpose%20of%20
using,%2C%20graph%20algorithms%2C%20and%20more.
Boyd S., and L. Vandenberghe. 2004. Convex Optimization, Cambridge University Press, The
Edinburgh Building, Cambridge
Chugh, T., Sindhya, K., Hakanen, J., and K. Miettinen. 2017. Handling computationally expensive
multiobjective optimization problems with evolutionary algorithms: a survey. Soft Computing
23: 3137–3166
COIN-OR. 2020. “Computational Infrastructure for Operations Research” (accessed October, 2020)
https://en.wikipedia.org/wiki/COIN-OR
Cortes, C., and V. Vapnik. 1995. Support-vector networks. Machine Learning 20(3): 273–297
Diamond, S., and S. Boyd. 2016. CVXPY: A Python-Embedded Modeling Language for Convex
Optimization. Journal of Machine Learning Research. 17(83): 1-5
Dolan, D. and J. More. 2002. Benchmarking optimization software with performance profiles.
Mathematical Programming 91(2):201–213.
ECLiPSe. 2020. “The ECLiPSe Constraint Programming System.” (accessed September, 2020) http://
eclipseclp.org/
GASM. 2020. “GAMS System Overview” (accessed August, 2020). https://www.gams.com/prod-
ucts/gams/gams-language/
GECODE. 2020. “Generic constraint development environment.” (accessed August, 2020) https://
www.gecode.org/
GLPK. 2020. “GNU Linear Programming Kit.” (accessed August, 2020) https://www.gnu.org/soft-
ware/glpk/
Goberna, M.A., and M.A. Lopez. 1998. Linear Semi-Infinite Optimization. John Wiley, New York.
Golub, G.H., and U. Mattvon. 1997. Generalized cross-validation for large scale problems. Journal of
Computational and Graphical Statistics 6(1):1–34.
GUROBI. 2020. “GUROBI Optimization.” (accessed August, 2020) https://www.gurobi.com/

Downloaded by Lalu Yadav ([email protected])

You might also like