Fast Newton-Raphson Power Flow Analysis Based On Sparse Techniques and Parallel Processing
Fast Newton-Raphson Power Flow Analysis Based On Sparse Techniques and Parallel Processing
Abstract—Power flow (PF) calculation provides the basis for more frequently and in a shorter time scale to handle the system
the steady-state power system analysis and is the backbone of dynamics. As a result, it is necessary to develop faster mathemat-
many power system applications ranging from operations to plan-
ical methods and utilize more computational power to ensure the
ning. The calculated voltage and power values by PF are essential
to determining the system condition and ensuring the security adequacy and security of energy delivery. Fortunately, advance-
and stability of the grid. The emergence of multicore processors ments in computer technology and the widespread availability
provides an opportunity to accelerate the speed of PF computa- of multicore processors open up new possibilities to address this
tion and, consequently, improve the performance of applications problem.
that run PF within their processes. This paper introduces a fast
Parallel computing has proven as a viable solution to ac-
Newton-Raphson power flow implementation on multicore CPUs
by combining sparse matrix techniques, mathematical methods, celerate several computationally intensive applications in the
and parallel processing. Experimental results validate the effec- power system [1]–[3]. These applications typically belong to
tiveness of our approach by finding the power flow solution of a real-time control and simulation, optimization, and probabilistic
synthetic U.S. grid test case with 82,000 buses in just 1.8 seconds. assessment. Many of these studies rely on the solution of power
Index Terms—Parallel, multicore, sparse, power flow, newton- flow (PF) within their process. PF analysis aims to obtain voltage
raphson, OpenMP, SIMD. magnitudes and angles at load buses, real and reactive power
flow through the transmission lines, and voltage angles and
reactive power injection at generator buses. This information is
I. INTRODUCTION essential to determine the steady-state condition of the network
OWER systems throughout the world are undergoing a and ensure the security and stability of the grid. However, PF
P significant transformation, mainly driven by the increase
in penetration of renewable energy, integration of distributed
is a non-linear, computationally demanding problem, where the
solution is only meaningful for a short time since the state of the
energy resources, and advances in digital technologies. Future power system continuously changes.
grids are powered by clean energy, enable the bidirectional Over the years, researchers have developed various numerical
flow of electricity, and have a centralized-decentralized con- and analytical methods to solve power flow, such as Newton-
trol scheme. The resulting benefits are improved reliability Raphson (N-R), Gauss-Seidel (G-S), Fast Decouple (FD), Holo-
and resiliency, more efficient supply and delivery of energy, morphic Embedding, Continuation, and several others. To date,
reduced environmental impact, and cost-effective energy gen- Newton-Raphson is still the most commonly used method be-
eration. However, this transformation will make the planning, cause of its quadratic convergence property. On the other hand,
operation, and control of the power grid more complicated and this method requires more computational resources compared
computationally demanding. The variability and uncertainty of with others [4]. Researchers around the world have investi-
decision variables result in complex models that must be solved gated the possibility of using parallel processing to boost the
Newton-Raphson Power Flow (NRPF) performance. These at-
tempts can be classified based on the parallel programming
Manuscript received August 19, 2020; revised February 10, 2021, April 20,
2021, and August 12, 2021; accepted September 19, 2021. Date of publication model and processor architecture: shared memory, distributed
September 28, 2021; date of current version April 19, 2022. This work was memory, graphic processing unit (GPU), configurable chips,
supported in part by the NSF-MRI Award 1725573. Paper no. TPWRS-01413- and hybrid approaches. Only a few studies have employed
2020. (Corresponding author: Afshin Ahmadi.)
Afshin Ahmadi, Melissa C. Smith, and Edward R. Collins are with the shared and distributed memory models to accelerate the
the Holcombe Department of Electrical, and Computer Engineering, Clem- solution of the NRPF algorithm on multicore processors. In [5], a
son University, Clemson, SC 29634 USA (e-mail: [email protected]; maximum speedup of 5.6× was achieved for a 1,944-bus system
[email protected]; [email protected]).
Shuangshuang Jin is with the School of Computing, Clemson University, by employing OpenMP and eight CPU cores. Reference [6]
North Charleston, SC 29405 USA (e-mail: [email protected]). reported a speedup of 3.3× using four threads and OpenMP for a
Vahid Dargahi is with the School of Engineering and Technology, University power system with 1,354 buses. Guera and Martinez-Velasco [7]
of Washington, Tacoma, WA 98402 USA (e-mail: [email protected]).
Color versions of one or more figures in this article are available at integrated the high-performance Intel MKL PARDISO solver
https://doi.org/10.1109/TPWRS.2021.3116182. into the source code of the MATPOWER [8] library and the
Digital Object Identifier 10.1109/TPWRS.2021.3116182 solution of a 9,241-bus test system was found in 560 ms.
0885-8950 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
1696 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 37, NO. 3, MAY 2022
Distributed computing for the solution of NRPF was investi- scope of this study since our focus is on balanced transmission
gated in [9] and [10], with an average speedup of 2× in both networks.
studies. The communication time delay between nodes limits Review of the literature shows that recent studies for ac-
the performance of the distributed model. Additionally, authors celerating PF calculations are mostly designed to exploit the
in [11] examined the possibility of solving PF equations on computational power of GPUs. Although the reported speedups
field-programmable gate arrays (FPGAs) and achieved about 6x are encouraging, there are some disadvantages to this approach.
higher performance compared with the CPU implementation. First, the overhead time for data transfer between the host
However, using FPGA devices for accelerating power system computer and GPU poses an obstacle for the performance of
applications has not received much attention by engineers and these implementations. Second, only a limited number of GPU-
scientists since it requires significant programming effort and based sparse direct solver libraries are currently available and
expertise. they typically utilize the entire device to find the solution of
Over the past decade, there have been enormous advance- a linear system, which poses a significant limitation when it
ments in GPU hardware and programming tools, which has comes to applications that are reliant on concurrent execution of
made these devices more powerful, popular, and accessible. As many PF scenarios. Finally, GPUs are generally equipped with
a result, a significant number of researchers have focused on less memory compared with shared-memory computers, which
GPU-based implementations of both PF algorithms and appli- makes them undesirable for memory-intensive applications such
cations that are reliant on PF calculation such as contingency as contingency analysis.
analysis [12]–[15]. A hybrid GPU-CPU NRPF implementation
based on vectorization parallelization and sparse techniques was II. CONTRIBUTIONS
proposed in [16], where the solution of a 3,012-bus system
This paper aims to maximize the performance of Newton-
was calculated in 206 ms. Although the results are promising,
Raphson power flow on multicore CPUs by combining software
unfortunately, the scalability and effectiveness of the proposed
techniques, mathematical methods, and parallel processing. The
approach is unknown since test systems are limited to 3,012-bus
main contributions of this study are summarized as follows:
and less. A GPU version of three different PF algorithms was r Both SIMD (Single Instruction, Multiple Data) vectoriza-
developed in [17] using the CUDA library and dense matrices.
tion and multicore processing are targeted to accelerate
For a system with 974 buses, the run time was 19.6, 10.8, and 5.5
power flow calculations.
seconds for G-S, N-R, and P-Q decoupled power flow methods, r The implementation employs the Compressed Sparse Row
respectively. Moreover, the performance of Gauss-Jacobi and
(CSR) storage format to address two major constraints in
N-R power flow algorithms on GPUs was studied in [18]. Results
computing PF solution - memory and time.
for a 4,766 bus system shows an execution time of 3.98 sec- r A parallel approach for forming and updating the sparse
onds for N-R and 3.06 seconds for Gauss-Jacobi. Furthermore,
Jacobian arrays is introduced that significantly improves
authors in [19] have compared the parallel implementation of
the execution time.
N-R and G-S algorithms on both CPUs and GPUs using sparse r The nested dissection algorithm is applied to reduce the
techniques with the objective of accelerating the concurrent PF
fill-in and improve the computation time and memory usage
calculation of many cases of a single network. Reported speedup
in solving the system Ax = b.
for a 2,383-bus system ranges from 6× to 13× depending on the r Various combinations of scenarios are benchmarked on
target architecture and number of simultaneous runs. In the case
power systems ranging from 1,354 to 82,000 buses.
of the G-S power flow method, although the reported speedups r The proposed NRPF implementation is combined with
in [17]–[21] are better than the sequential G-S algorithm, this
contingency analysis, and results are presented to further
PF method cannot compete with the NRPF method since it is
demonstrate the significance of this research.
prone to divergence and requires many iterations to obtain a
solution. Moreover, the nonlinear devices in the network cannot
be modeled in the G-S method. III. POWER FLOW PROBLEM
Additionally, references [22] and [23] sought to execute the The PF study aims to calculate the bus voltages and power
fast decoupled power flow (FDPF) on GPUs. In [22], exploiting flow in the network given the nodal admittance matrix (Ybus ),
both a GPU-based preconditioned conjugate gradient solver and known complex power at load buses (PQ), known voltage mag-
an inexact Newton method improved the performance up to 2.86 nitude and angle for the slack generator bus, and known voltage
times for test systems upto 13,173 buses. Researchers in [23] magnitude and injected real power for the remaining generator
compared the performance of running FDPF on GPUs with two buses (PV). Network equations in a power system are commonly
different fill-in reduction algorithms, namely, reverse Cuthill- formulated by the node-voltage method, which results in a
Mckee (RCM) and approximate minimum degree (AMD). By complex linear system of equations in terms of injected bus
utilizing CUDA libraries, sparse matrices, and the AMD algo- currents,
rithm, a 4.19x speedup was achieved for a 13,659 bus power
system. Parallel implementation of the PF algorithms for an I = Ybus V (1)
unbalanced distribution network has been investigated in several However, the complex power values are generally known in a
studies as well [25]–[28]. However, this topic is beyond the power system rather than the bus currents. Thus, Eq. (1) should
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
AHMADI et al.: FAST NEWTON-RAPHSON POWER FLOW ANALYSIS BASED ON SPARSE TECHNIQUES AND PARALLEL PROCESSING 1697
TABLE I
PSEUDO-CODE OF IMPLEMENTED NEWTON-RAPHSON POWER FLOW
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
1698 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 37, NO. 3, MAY 2022
TABLE II
Fig. 3. Levels of Hardware Parallelism.
IMPACT OF MATRIX REORDERING ON THE NUMBER OF NONZEROS ELEMENTS
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
AHMADI et al.: FAST NEWTON-RAPHSON POWER FLOW ANALYSIS BASED ON SPARSE TECHNIQUES AND PARALLEL PROCESSING 1699
1) Exploiting SIMD Vectorization: SIMD performs same in- stands for Open Multi-processing, is an Application Program
struction on multiple data elements in a clock cycle. Today’s Interface (API) that provides a powerful directive-based pro-
modern processors are equipped with up to 512 KB vector gramming model for developing parallel applications on plat-
registers that offer higher data-level parallelism while using lessforms ranging from multicore and shared-memory systems, to
power. For instance, a CPU with an AVX-512 instruction set coprocessors and embedded systems. OpenMP API supports
can hold 8 × 64-bit double precision floats and execute a single parallel programming in C, C++, and Fortran languages. There
instruction on them in parallel, which is 8 times faster than per- are many advantages associated with OpenMP: the developed
forming a single instruction at a time. Although modern CPUs code is portable between standard compilers and operating
provide direct access for SIMD level vectorization, exploiting systems, and existing sequential programs can conveniently be
this feature is not automatic and requires advanced programming converted to parallel implementations. In OpenMP, the com-
skills and software changes. Modern compilers allow access to piler handles the low-level implementation details for gen-
SIMD features by using intrinsic functions similar to the C style erating and terminating the threads. Most compilers support
function calls instead of writing the code in assembly language. OpenMP.
Fig. 4 describes the difference between scalar and vectorized The NRPF code is comprised of many for-loops that can
SIMD operation of a Fused Multiply Add (FMA) instruction for effectively be parallelized with OpenMP directives. Without
calculating combined addition and multiplication. Data types OpenMP, the for-loop iterations in Fig. 5 are processed in
have the following naming convention, serial and by only one thread. On the other hand, the omp
__m<register_width_in_bits><data_type> parallel for pragma provides a way to utilize multiple CPU
where register_width_in_bits is architecture dependent (e.g., cores and perform the calculations faster and more efficiently.
64 bits in MMX, 128 bits in SSE, 256 bits in AVX, etc.) and This directive allows the user to specify several configuration
parameters: the number of threads to use, shared and private
data_type is either d, i, or no letter for double precision, integer,
or single precision floating point, respectively. For example, variables, workload scheduling mechanism, and reduction op-
__m256 d vector contains eight 64-bit doubles. Intrinsic func- eration. Workload scheduling is helpful when the amount of
tions have the following format, work varies across iterations. For example, if no scheduling
_mm<register_width_in_bits>_<operation>_<suffix> mechanism is specified, all large jobs may be assigned to thread
where register_width_in_bits is similar to data type, op- 0, and all small jobs may be processed by thread 1. Hence, thread
eration is arithmetic, logical or data operation to perform on 1 will finish faster, and will stay idle until thread 0 completes
the data stream, and suffix is one or two letters that indi- its task. As a result, the total execution time will significantly
cate the data type the instruction operates upon. For example, increase. OpenMP provides three scheduling options to balance
_mm512_fmadd_pd performs fused multiply add on three given the workload: static, dynamic, and guided. Interested readers
vectors as demonstrated in Fig. 4. Notice that single underscore are referred to the OpenMP manual to understand the difference
prefix is used to call intrinsic functions and double underscore between each work scheduling mechanism. However, based on
is used to declare data types. In this study, the high-performance extensive testing and evaluation, guided scheduling was deter-
Intel MKL functions are thoroughly used to maximize the mined to be the best choice in terms of performance and accuracy
peak performance of sparse matrix operations since they are of the results for this study.
optimized to take advantage of both SIMD and multicore Another OpenMP for-loop option that was widely used in our
level parallelism (i.e., mkl_sparse_spmm, mkl_sparse_z_add, implementation of NRPF is the parallel reduction primitives.
mkl_sparse_z_mm, mkl_sparse_z_mv). Reductions are used in loops with dependencies, where a series
2) Shared-Memory Parallel Programming: there are a hand- of variables must be processed to produce an output (e.g., sum,
ful of extensions for developing shared-memory parallel pro- max, min) at the end of the parallel region. Unlike atomic
grams, including pthreads [30], OpenMP [31], OpenACC [32], operations, thread synchronization is not needed in the case
Intel TBB [33], etc. Among these extensions, OpenMP, which of parallel reductions. Thus, the overall performance is much
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
1700 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 37, NO. 3, MAY 2022
higher. A parallel reduction can reduce the operations from N the program only uses these two mapping arrays to locate and
steps to log2 N steps [19]. update the values of the Jacobian matrix. The proposed strategy
is shown in Fig. 6. Examples are provided in this figure to better
illustrate the output of each step in the process. These examples
D. Initializing and Updating the Sparse Jacobian Matrix relate to the matrix A in Fig. 1 and are only for illustrative
Populating the Jacobian matrix in the dense implementation purposes. The key steps are as follows:
of NRPF is straightforward because the matrix dimension is 1) The function receives the sparse sub-matrices J1 to J4
(npv + npq × 2)2 , and each entry can bed directly accessed that contain the partial derivative values, together with the
given the row and column index. Thus, memory allocation at list of PV and PQ buses.
the beginning and updating the values at each iteration can be 2) The number of nonzero elements in each row of the
performed quickly. However, this is not the case for the sparse Jacobian matrix are calculated by processing the J1 to
implementation because the total number of nonzero elements J4 arrays. This step is parallelized by using the OpenMP
in the matrix and the number of nonzero entries in each row for-loop pragma and sum reduction. The number of
are needed to allocate memory for the CSR arrays. Therefore, nonzero elements is necessary to allocate memory and
two problems must be addressed: forming the CSR arrays and create the row pointer array Jrow . The binary search is
updating the values at each iteration. used to quickly find the bus types.
We propose a parallel workflow for generating the CSR ar- 3) Next, the sparse column Jcolumn and value Jvalues arrays
rays using the OpenMP lock mechanism and the binary search are constructed by processing the input data. This step is
algorithm. The OpenMP lock mechanism is necessary to ensure parallelized with OpenMP locks so that multiple threads
that only one thread can modify a data element at a time. We can work on values in each matrix row simultaneously.
also need to determine whether a bus number belongs to a PV We also create the mapping arrays during this process.
or PQ bus during the process. Binary search is beneficial for this 4) Because of the parallel approach used in the previous
purpose. It is a fast algorithm to find the index of a value within step, the column indices in the Jcolumn array are not in
a sorted array by repeatedly splitting the search space in half and incremental order. We use a parallel sorting algorithm to
comparing the target value with the middle element of the list sort the column indices in increasing order. The mapping
until the value is found or the list is empty. arrays are also sorted accordingly.
Creating sparse Jacobian arrays is time-consuming, and re- To further explain the parallel approach for constructing the
peating the same procedure at every iteration will undesirably sparse Jcolumn array, let us assume that the second row of the
affect the execution time. To address this issue, we develop two matrix A in Fig. 1 is being processed by two threads in parallel.
mapping arrays while populating the Jacobian CSR arrays in the There are two nonzero columns in this row. Suppose that the
first iteration and use them in successive iterations to update the thread that is processing the column number of value “9” has
Jacobian values. We take advantage of the fact that the Jacobian finished first, and the column number is to be saved in the
matrix’s sparse row and column arrays remain the same, and Jcolumn array. This thread is aware that positions one and two in
only the values change. The first mapping array (Jtype ) refers to the column indices array are dedicated to the second row, but it
one of the four partial derivative arrays J1 to J4, and the second is unaware of which position belongs to the column index of the
mapping array (Jposition ) refers to the position of that value in “9” entry. Therefore, this thread begins by checking the status
the respective partial derivative array. So, after the first iteration, of position one in the Jcolumn array and locks the access to it
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
AHMADI et al.: FAST NEWTON-RAPHSON POWER FLOW ANALYSIS BASED ON SPARSE TECHNIQUES AND PARALLEL PROCESSING 1701
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
1702 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 37, NO. 3, MAY 2022
TABLE V
AVERAGE EXECUTION TIME OF PROPOSED NEWTON-RAPHSON POWER FLOW IMPLEMENTATION WITH SIMD AND MULTICORE LEVEL PARALLELISM (SECOND,
100 TRIALS, TOLERANCE 1E-8)
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
AHMADI et al.: FAST NEWTON-RAPHSON POWER FLOW ANALYSIS BASED ON SPARSE TECHNIQUES AND PARALLEL PROCESSING 1703
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
1704 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 37, NO. 3, MAY 2022
of the 82,000-bus power system. It is also known that FD [14] S. Huang and V. Dinavahi, “Real-time contingency analysis on massively
power flow may diverge for cases that are solvable by the parallel architectures with compensation method,” IEEE Access, vol. 6,
pp. 44 519–44530, 2018, doi: 10.1109/ACCESS.2018.2864757.
N-R method. [15] G. Zhou et al., “A novel GPU-accelerated strategy for contingency screen-
ing of static security analysis,” Int. J. Elect. Power Energy Syst., vol. 83,
pp. 33–39, 2016.
VII. CONCLUSION [16] X. Su, C. He, T. Liu, and L. Wu, “Full parallel power flow solution: A
GPU-CPU-based vectorization parallelization and sparse techniques for
In this paper, we presented the development of a fast Newton- Newton-Raphson implementation,” IEEE Trans. Smart Grid, vol. 11, no. 3,
pp. 1833–1844, May 2020.
Raphson power flow on multicore CPUs by combining software [17] C. Guo, B. Jiang, H. Yuan, Z. Yang, L. Wang, and S. Ren, “Performance
techniques, mathematical methods, and parallel processing. We comparisons of parallel power flow solvers on GPU system,” in Proc. IEEE
proposed a parallel approach for developing the Jacobian CSR Int. Conf. Embedded Real-Time Comput. Syst. Appl., 2012, pp. 232–239.
[18] J. Singh and I. Aruni, “Accelerating power flow studies on graphics
arrays and also a set of mapping arrays that eliminate the need processing unit,” in Proc. Annu. IEEE India Conf., 2010, pp. 1–5.
for creating the Jacobian arrays in successive iterations. The [19] V. Roberge, M. Tarbouchi, and F. Okou, “Parallel power flow on graphics
accuracy of the proposed algorithm was verified by comparing processing units for concurrent evaluation of many networks,” IEEE Trans.
Smart Grid, vol. 8, no. 4, pp. 1639–1648, Jul. 2017.
the error norm with MATPOWER program. The code was [20] C. Vilacha, J. Moreira, E. Miguez, and A. F. Otero, “Massive Jacobi power
precisely profiled to show the computation time for each step of flow based on SIMD-processor,” in Proc. 10th Int. Conf. Environ. Elect.
the algorithm and provide a reference for future work in this area. Eng., 2011, pp. 1–4.
[21] A. Ahmadi, F. Manganiello, A. Khademi, and M. C. Smith, “A parallel
Based on the results obtained, the performance of the proposed Jacobi-embedded Gauss-Seidel method,” IEEE Trans. Parallel Distrib.
approach is better than GPU-based accelerated power flows. To Syst., vol. 32, no. 6, pp. 1452–1464, Jun. 2021.
the best of our knowledge, this is the first time in the literature [22] X. Li, F. Li, H. Yuan, H. Cui, and Q. Hu, “GPU-based fast decoupled power
flow with preconditioned iterative solver and inexact Newton method,”
that the performance of the NRPF algorithm on an 82,000-bus IEEE Trans. Power Syst., vol. 32, no. 4, pp. 2695–2703, Jul. 2017.
system was measured, and the proposed implementation was [23] S. Huang and V. Dinavahi, “Performance analysis of GPU-accelerated
able to solve the system in 1.87 seconds, which is 2.1 times fast decoupled power flow using direct linear solver,” in Proc. IEEE Elect.
Power Energy Conf., 2017, pp. 1–6.
faster than the high-performance MATPOWER library. This [24] M. M. A. Abdelaziz, “OpenCL-accelerated probabilistic power flow for
study opens the opportunity to improve the performance of other active distribution networks,” IEEE Trans. Sustain. Energy, vol. 9, no. 3,
applications that are reliant on PF computation. pp. 1255–1264, Jul. 2018.
[25] D. Ablakovic, I. Dzafic, and S. Kecici, “Parallelization of radial three-
phase distribution power flow using GPU,” in Proc. 3rd IEEE PES Innov.
Smart Grid Technol. Europe, 2012, pp. 1–7.
REFERENCES [26] M. Abdelaziz, “GPU-OpenCL accelerated probabilistic power flow anal-
ysis using Monte-Carlo simulation,” Electric Power Syst. Res., vol. 147,
[1] R. C. Green, L. Wang, and M. Alam, “Applications and trends of high
pp. 70–72, 2017.
performance computing for electric power systems: Focusing on smart
[27] T. Cui and F. Franchetti, “A multi-core high performance computing
grid,” IEEE Trans. Smart Grid, vol. 4, no. 2, pp. 922–931, Jun. 2013.
framework for distribution power flow,” in Proc. North Amer. Power Symp.,
[2] R. C. Green, L. Wang, and M. Alam, “High performance computing for
2011, pp. 1–5.
electric power systems: Applications and trends,” in Proc. IEEE Power
[28] Intel Math Kernel Library. [Online]. Available: https://software.intel.com/
Energy Soc. Gen. Meeting, 2011, pp. 1–8.
mkl
[3] D. M. Falcão, “High performance computing in power system applica-
[29] B. Nichols, D. Buttlar, and J. P. Farrell, Pthreads Programming. Se-
tions,” in Proc. Int. Conf. Vector Parallel Process., 1996, pp. 1–23.
bastopol, CA, USA: O’Reilly Associates, 1996, pp. 1–29.
[4] M. Čepin, Assessment of Power System Reliability: Methods and Applica-
[30] OpenMP. [Online]. Available: https://www.openmp.org/
tions. Berlin, Germany: Springer, 2011, pp. 147–154.
[31] OpenACC. [Online]. Available: https://www.openacc.org/
[5] H. Dağ and G. Soykan, “Power flow using thread programming,” in Proc.
[32] Intel Threading Building Blocks. [Online]. Available: https://software.
IEEE Trondheim PowerTech, 2011, pp. 1–5.
intel.com/tbb
[6] A. Ahmadi, S. Jin, M. C. Smith, E. R. Collins, and A. Goudarzi, “Parallel
[33] Matpower 7.0. [Online]. Available: https://matpower.org/download/
power flow based on OpenMP,” in Proc. North Amer. Power Symp., 2018,
[34] A. B. Birchfield, T. Xu, K. M. Gegner, K. S. Shetye, and T. J. Overbye,
pp. 1–6.
“Grid structural characteristics as validation criteria for synthetic net-
[7] G. Guerra and J. A. Martinez-Velasco, “Evaluation of MATPOWER and
works,” IEEE Trans. Power Syst., vol. 32, no. 4, pp. 3258–3265, Jul. 2017.
OpenDSS load flow calculations in power systems using parallel comput-
[35] H. Jiang, D. Chen, Y. Li, and R. Zheng, “A fine-grained parallel power
ing,” J. Eng., vol. 2017, no. 6, pp. 195–204, 2017.
flow method for large scale grid based on lightweight GPU threads,” in
[8] R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas, “MAT-
Proc. IEEE 22nd Int. Conf. Parallel Distrib. Syst., 2016, pp. 785–790.
POWER: Steady-state operations, planning, and analysis tools for power
systems research and education,” IEEE Trans. Power Syst., vol. 26, no. 1,
pp. 12–19, Feb. 2011.
[9] R. S. Kumar and E. Chandrasekharan, “A parallel distributed computing
framework for Newton-Raphson load flow analysis of large interconnected
power systems,” Int. J. Elect. Power Energy Syst., vol. 73, pp. 1–6,
2015.
[10] L. Ao, B. Cheng, and F. Li, “Research of power flow parallel computing
based on MPI and P-Q decomposition method,” in Proc. Int. Conf. Elect. Afshin Ahmadi (Member, IEEE) received the M.S.
Control Eng., 2010, pp. 2925–2928. degree in electrical engineering from the Univer-
[11] X. Wang, S. G. Ziavras, C. Nwankpa, J. Johnson, and P. Nagvajara, sity of the Philippines, Philippines, in 2012, and the
“Parallel solution of Newton’s power flow equations on configurable Ph.D. degree in computer engineering from Clemson
chips,” Int. J. Elect. Power Energy Syst., vol. 29, no. 5, pp. 422–431, 2007. University, SC, United States, in 2020. He is cur-
[12] G. Zhou et al., “GPU-accelerated batch-ACPF solution for N-1 static rently a Power System Application Developer with
security analysis,” IEEE Trans. Smart Grid, vol. 8, no. 3, pp. 1406–1416, the Electric Reliability Council of Texas (ERCOT).
May 2017. His main research interests include high-performance
[13] I. Araújo, V. Tadaiesky, D. Cardoso, Y. Fukuyama, and Á. Santana, computing in power and energy systems, smart grid,
“Simultaneous parallel power flow calculations using hybrid CPU-GPU and power system planning.
approach,” Int. J. Elect. Power Energy Syst., vol. 105, pp. 229–236, 2019.
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.
AHMADI et al.: FAST NEWTON-RAPHSON POWER FLOW ANALYSIS BASED ON SPARSE TECHNIQUES AND PARALLEL PROCESSING 1705
Melissa Crawley Smith (Senior Member, IEEE) re- Vahid Dargahi (Member, IEEE) received the Ph.D.
ceived the B.S. and M.S. degrees in electrical engi- degree in electrical engineering, with an emphasis in
neering from Florida State University, Tallahassee, power electronics and power systems, from Clemson
Florida, in 1993 and 1994, respectively, and the Ph.D. University, Clemson, SC, USA, in 2017. From 2016 to
degree in electrical engineering from the University 2017, he was a Graduate Research Assistant with the
of Tennessee, TN, United States, in 2003. She is eGRID Center, CURI, North Charleston, SC, USA.
currently an Associate Professor of electrical and From 2018 to 2019, he was a Postdoctoral Research
computer engineering with Clemson University. She Fellow with the Electrical and Computer Engineer-
has more than 25 years of experience developing ing Department, UC Santa Cruz, CA, USA. He is
and implementing scientific workloads and machine currently an Assistant Professor with the School of
learning applications across multiple domains, in- Engineering and Technology, University of Washing-
cluding 12 years as a research associate at Oak Ridge National Laboratory. Her ton, Tacoma, WA, USA. His current research interests include power systems,
current research interests include the performance analysis and optimization of parallel processing, power electronics circuits, novel converter topologies, wide-
emerging heterogeneous computing architectures (GPGPU- and FPGA-based bandgap semiconductor devices, multilevel inverters, control of power electronic
systems) for various application domains including machine learning, high- systems, grid-tied inverters, and active rectifiers.
performance or real-time embedded applications, and image processing.
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 28,2022 at 05:43:53 UTC from IEEE Xplore. Restrictions apply.