0% found this document useful (0 votes)
35 views

Improving System Performance in Homogeneous Multicore Systems

Improving system performance using homogeneous systems

Uploaded by

Bhavyasree Nara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Improving System Performance in Homogeneous Multicore Systems

Improving system performance using homogeneous systems

Uploaded by

Bhavyasree Nara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 66, Issue 2, 2022

Journal of Scientific Research


of
The Banaras Hindu University

Improving system performance in Homogeneous Multicore Systems

Savita Gautam1, Abdus Samad2 and M. Sarosh Umar3


1University Women’s Polytechnic
2 Z.H. College of Engg. & Technology
3 Aligarh Muslim University, Aligarh, India
[email protected]
[email protected]
[email protected]

Abstract. Allocation of parallel load in multicore systems has related to simulation on mesh-based topologies that empha-
become a challenging task for high performance computing sizes on modeling and analyzing on chip interconnect [2] [3]
system. There are several parameters to evaluate the perfor- [4]. Metrics such as packet delay, load imbalance factor have
mance of a scheduling algorithm such as task imbalance and been used as a function of the communication load, speedup
execution time. This paper proposes a task scheduling approach
and utilization factor. Some networks are designed specifical-
that targets multiple cores connected through appropriate in-
ly with customized application in order to achieve better per-
terconnection network. The proposed approach utilizes the
computing resources effectively by assigning the tasks dynami- formance. The main objective behind customization is to fit
cally among different cores of the system in a realistic time. the requirements of specific applications under certain condi-
Each task has its own timeline and multiple sequence of tasks tions [5]. However, a generalized task-based programming
are mapped on different cores of the system. In particular, per- model is inevitable solution for multicore architectures.
formance is evaluated on n x n Mesh, DMesh, ZMesh and Torus In this paper, we explore the interplay between architec-
networks. The load imbalance and execution times are consid- tures and algorithm design in the context of dynamic task
ered as metrics to evaluate the performance of the proposed allocation. A dynamic scheduling algorithm is designed and
algorithm. Simulation results are obtained and compared with evaluated by mapping tasks on a number of mesh-based mul-
well-known minimum distance scheduling algorithm which ticore architectures. The proposed approach is based on
shows reduction in execution time while maintaining the load
standard minimum distance scheduling approach that has
imbalance. An improvement of 20-30% is obtained in load im-
been used extensively for conventional parallel systems in a
balance for considered multicore systems with improved execu-
tion time. The simulation study reveals that the proposed algo- variety of ways [6]. For better analysis of results different
rithm is best suited to take architectural benefits for mesh-based data sets are applied to similar architectures for the perfor-
multicore systems. mance evaluation of the proposed algorithm.
The rest of the paper is organized as follows. In section 2,
Keywords: Multicore, Scheduling Algorithm, Load Imbalance, various approaches related with scheduling of tasks on Ho-
Execution Time. mogeneous/heterogeneous multicore system are presented.
Section 3 describes the problem formation and the target
systems considered for study. The proposed algorithm is ex-
1 Introduction plained in section 4. Based on the experimental results, the
performance evaluation is carried out and presented in section
Multicore systems are found in variety of computing systems 5. Concluding remarks are presented in section 6.
from high-performance servers to special purpose embedded
systems. The industrial applications are utilizing embedded
systems with more cores in processors [1]. The performance 2 Related Work
of these systems depends upon how extensively the parallel-
ism is exploited among different cores in the system. In order
A programming model schedule tasks dynamically according
to address the problem of parallelism in a multicore system,
to the availability of computing resources. Mapping of ready
the load is partitioned into small independent tasks and are
to execute tasks to different cores of the system requires criti-
mapped onto different available cores in the systems. The
cally task aware schedular [7]. The efficient scheduling prob-
problem of efficient allocation of a group of tasks to carry out
lem has been extensively studied for asymmetric multicore
parallel execution in multicore systems has drawn attention of
systems. Some of them are based on dividing the tasks into
researchers.
groups of critical and non-critical tasks and mapping each
Designing an efficient communication network and apply-
group to one core type. In this method deciding which task is
ing efficient scheduling algorithm for utilizing computing
critical is a major issue [8]. Task prioritization is another
resources is critical for achieving high performance in multi-
processor multicore systems. There is a number of studies

46
DOI: 10.37398/JSR.2022.660207
Journal of Scientific Research, Volume 66, Issue 2, 2022

approach which assigns priority to different tasks based on active load or when the core becomes idle. At a particular
information discoverable at run-time [9]. point of time the system manages a uniform distribution of
A number of programming models have been developed tasks. The resource utilization and uniform allocation of tasks
for high-performance computing such as task parallelism are carried out dynamically in parallel among different avail-
[10], data parallelism for example OpenMp loops [11] to able cores of the system. If tasks in an application are unbal-
exploit parallelism in multicore systems architectures. These anced, the overloaded and underloaded cores are identified
models support both inter-task parallelism as well as intra- and tasks migration take place until the system obtain an even
task parallelism. In general, the sequence of tasks is mapped distribution of tasks. Therefore, in application of wide range
as a group of parallel sub tasks that are allowed to execute in graph such as Zmesh and higher-level mesh having large
parallel on multiple cores. The directed acyclic graph (DAG) number of cores or with large volume of tasks the task sched-
is one of the most famous parallel task models used in multi- ular reconfigures the tasks dynamically based on the value of
core architectures [12]. A DAG consists of directed edges ideal load and load imbalance factor (LIF).
between a set of nodes in which each node is a sequential The minimum distance scheduling (MDS) is considered
sub-task that are allowed to execute on any core using di- suitable for parallel interconnection networks in traditional
rected edges. Subtasks are allowed to execute on different parallel systems [17]. The algorithm relies on minimum dis-
cores that can significantly improve resource utilization. On a tance property in which only adjacent cores are allowed to
multicore system meeting deadlines of parallel tasks is more migrate the tasks. This is followed in order to reduce
complex due to possible interleaving of threads across the makespan and complexity of scheduling algorithm. Several
cores. Therefore, to incorporate full speed up there is a great variations of MDS have been proposed and found suitable for
challenge to maximize the utilization of parallel multicore a particular class of architectures. The performance of these
architectures which meet the deadlines of application cores. algorithms has not been studied for multicore systems. The
List scheduling has been used in variety of ways to obtain proposed algorithm is an effort to extend the concept of min-
optimal/sub-optimal solutions [13]. List scheduling is de- imum distance property with some alteration and tested for
signed on the basis of assigning priorities to the tasks of DAG considered multicore systems.
and arranging the tasks in the form of list which are config-
3.2 The Target Architectures
ured in descending order of priorities. Task having higher
priority is allowed to execute first. The algorithm performs To evaluate the performance of proposed scheduling algo-
better with small heterogeneity factor for randomly generated rithm the topology of target system is a modeled un-directed
applications. However, to reduce task execution time a dupli- graph G (Ci, Ei) where C is a finite set of cores/vertices and E
cation approach to identify heavily communicating tasks is is a finite set connected edges. A vertex Ci represents the
applied. processor core i and Ei represents a bidirectional communica-
In heterogeneous computing system the cost of executing a tion link between adjacent cores. The resource graph is a
task may vary from one core to another. The priority of tasks complete graph consisting of n fully connected cores. We
is not fixed rather change when migrated between different assume contention free communication between cores.
cores. To handle this problem, Heterogenous Earliest Finish For the purpose of simulation four similar topologies
Time Schedular (HEFT) [14] and Heterogeneity through namely Mesh, Dmesh, Zmesh and Torus networks are con-
Limited Duplicated [HLD] approach [15] are used in order to sidered [18]. The system consists of a set of homogeneous
get a single computation cost of a task. However, perfor- cores and all considered topologies are modeled as 4 x 4 net-
mance of these algorithms is limited with the significant vari- works shown in Fig. 1. Task-to-core assignment is identical
ations in the execution makespan. in all the considered topologies.
System performance can also be improved by non-
contiguous allocation of parallel jobs in multicomputer sys-
tems [16]. In this approach the author claimed better perfor-
mance in terms of execution time for different traffic pattern
particularly with uniform-decreasing job size distribution.
The algorithm, however, is not tested for Torus type architec-
ture.
3 Problem Formation and Target Systems

3.1 Task Scheduling Model (a) 4 x4 Mesh network (b) 4 x4 DMesh network

The task scheduling problem has been widely studied for


both homogeneous and heterogeneous multicore systems. The
implementation of these algorithms performs action on the
state of tasks depending upon the architecture of the target
system. The main objective is to map the ready tasks onto
available cores until all the ready tasks are assigned evenly.
Task dependency is another factor that effect the performance
of the scheduling policy. However, for simplicity we consid- (c) 4 x4 ZMesh network (d) 4 x4 Torus network
ered all tasks as independent tasks. Tasks are submitted uni-
formly and assigned to a particular core depending upon the Fig.1. Target Systems

47
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 66, Issue 2, 2022

4 Proposed Algorithm Void TaskMigration (int overloaded_Nodes, int under-


loaded_Nodes)
As discussed in section 3 we propose a dynamic task schedul-
{
ing algorithm that detects the load imbalance among different
available cores and map the tasks accordingly during runtime. Int p=0, Idealload, q=0,Max= underloaded_Nodes;
Among different models the tasks are first created and then for(p=0;p<=overloaded_Nodes;p++)
made ready after certain level of input. In the proposed ap- {
proach we assume that the ready tasks are available and at a
given point of time tasks are assigned to different cores based
While(value[p]>Idealload)
on the scheduling policy. Overloaded cores receive the tasks {
from underloaded cores based on the value of ideal load and for(q=0;q<=underloaded_Nodes;q++)
LIF.
shift(Task[p],Task[p][q]);
The load Imbalance Factor (LIF) at a particular stage of
task (k) structure is calculated as. }}
return overloaded_Nodes;
LIF = [max{loadk(Ci)}-(ideal_load)k] / (ideal_load)k (1)
}
The ideal load is calculated by the ratio of the total number int TaskAllocation(int n)
of tasks and the number of available cores (N).
{ int totaltask,i,n;
(ideal_load)i =[loadk(C0)+loadk(C1)+…+loadk(CN--1)] / N (2)
Generate_Random_Task(n);
Maximum load denoted as max(loadk(Ci)) is the value of for(i=1;i<=n;i++)
maximum load on a particular core Ci ,where, Ci ,0≤i≤N-1. totalTask+=Task(n);
For the same stage of task structure, the execution time is return Totaltask;
evaluated which is the total time the schedule algorithm takes
to produce LIF after the balancing process is complete. }
However, task migration is allowed only after examine the
connectivity of cores. Selecting the communicating core di- Int maximu(int nodes)
rectly affects the complexity of algorithm and leads larger {
execution time. The five steps of the proposed algorithms are int i;
as follows.
1. A valid taskset is generated to map on available for(i=0;i<nodes;i++)
number of cores connected through bi-directional compare(max_value,node_value);
links. return(max_value);
2. The adjacency matrix is scanned to examine the di-
}
rect connectivity of cores.
3. The connected cores are identified and tasks are as-
signed from one core to another until the value
float lif(int max, int idealload)
reaches to ideal load. {
4. The LIF is evaluated and allocation of tasks is con- load_imbalance=(float)(max-idealload)*100/idealload;
tinued. To maintain the integrity of MDS only di-
return load_imbalance;
rectly connected cores are allowed to migrate the
tasks. }
5. For optimum results step 4 is repeated for non-
adjacent cores to migrate the tasks between over-
loaded or underloaded cores. Fig. 2. Pseudo-Code for task allocation and migration
The outline of the given algorithm is illustrated in Fig. 2. It is
clearly shown that allocation of tasks always succeeds if the
underloaded cores with direct connectivity exist. The pro- 5 Simulation Experiment Results
posed algorithm and the well-known MDS algorithm were
implemented in Java with Windows 10 on 2.60 GHz Intel(R) In this section, we evaluate the performance of the proposed
Core (TM) i7 x64 base processor and 16.0 GB of RAM. algorithm by carrying out experiments on different multicore
Many different graphs were drawn by varying the task struc- architectures in a wide spectrum of input types. We have
ture for input into the proposed algorithm and discussed in measured load imbalance and execution time for three sets of
the next section. task structures. To show how well the proposed algorithm is
contributed the results obtained are compared with standard
minimum distance scheduling (MDS) algorithm in terms of
LIF and execution time by implementing both the algorithms
on same architectures under same environment. Fig. 3 shows
the performance of proposed algorithm by comparing against
the MDS algorithm when applied on 16-cores mesh network.

48
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 66, Issue 2, 2022

The results show that initial value of LIF in case of MDS


algorithm is much larger than the value obtained by imple-
menting the proposed algorithm. The best-case performance
for average load is improved by 20% throughout the genera-
tion of tasks. The changing behavior of LIF, however, similar
for both the scheduling algorithms.

Fig. 5 (a). Performance of proposed algorithm with low task


structure (LIF)

Fig. 3. Performance of proposed scheduling on 16-core mesh


network

Another important parameter to evaluate the performance is


execution time. The total time to make the network fully bal-
anced after generation a finite number of tasks is evaluated Fig. 5 (b). Performance of proposed algorithm with low task
and shown in Fig. 4. The results of the effect of enhancing structure (Execution Time)
task migrations by considering non-adjacent cores in the pro-
posed algorithm are undoubtedly depicted in Fig. 4 which
shows an increasing trend as compared to when MDS is ap-
plied on the same network. This is due to the fact that the
proportion of task migration on cores other than adjacent
cores increases to obtain the desirable value of LIF. If we
consider conventionally acceptable value of average LIF
between 30-40%, then the increase in execution time will be
insignificant.

Fig. 6 (a). Performance of proposed algorithm with medium


task structure (LIF)

Fig. 4. Performance of proposed scheduling on 16-core mesh


network Fig. 6 (b). Performance of proposed algorithm with low me-
dium structure (Exec. Time)
Motivated form the results obtained for 4 x 4 mesh net-
work and to test the actual performance of the proposed algo-
rithm the same is also applied on 16-cores Dmesh, Zmesh and
Torus networks with three data sets. Each data set consists of
finite range of task structure. The first set of experiment is
carried out with data set having tasks ranging from 1000 to
4,00,000 tasks, second data set covers tasks up to 16,00,000
and the third data set may go up to 50,00,000. The simulation
results obtained for all four considered networks using LIF
and execution time as metrices are shown in the form of
graphs and are presented in Fig. 5 to Fig. 7.
Fig. 7 (a). Performance of proposed algorithm with high task
structure (LIF)

49
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 66, Issue 2, 2022

Simulation results show an improvement of 20-30% in load


imbalance while maintaining an overall execution makespan.
A promising future direction in this area is to consider the
performance of proposed scheduling approach on heteroge-
neous computing systems. We plan to extend the presented
algorithm to the dynamic environment where process load,
computing resources and network conditions during the exe-
cution of varying input applications. Apart from LIF and
execution time, other performance metrics such as Computa-
tion-To-Communication Ratio (CCR), Normalized Schedule
Fig. 7 (b). Performance of proposed algorithm with high task Length (NSL), Speedup Rate (SR), etc. will be considered for
structure (Exec. Time) performance evaluation.

In all the graphs shown from Fig. 5 to Fig. 7, it is clearly


observed that with the proposed algorithm the initial value of References
LIF is improved for all the considered topologies. In particu-
lar, there is an improvement of approximately 20% for [1] Geer, D.: Chip makers turn to multicore processors. Computer,
Dmesh and Torus networks. It is because that Dmesh and 38, 11-13 (2005).
[2] Al-daloo, M., Soltan, A., Yakovlev, A.: Overview study of on-
Torus both are having extra links that constitutes alternative
chip interconnect modelling approaches and its trend. In: Pro-
path for task migration. Due to this reason, the execution time ceedings of 7th International Conference on Modern Circuits
also has no significant increment for these networks. Perfor- and Systems Technologies (MOCAST). pp. 1-5 (2018).
mance with other networks is also comparable. Figures also [3] Alimi, I., Patel, R., Aboderin, O., Abdalla, A.: Network-On-Chip
show that the performance of proposed algorithm is approxi- Topologies: Potentials, Technical Challenges, Recent Ad-
mately same for different types of task structures. This is vances and Research Direction. Reviewed Chapter (2021).
because most of the tasks are having similar execution times. 10.5772/intechopen.97262.
The schedular dynamically select the best available path [4] Ghosh., A., Sinha, A., Nancy, Chatterjee, A.: Exploring Network
when the application has a large number of tasks. on Chip Architectures Using GEM5. In: International Confer-
In Fig. 5(b), Fig. 6(b) and Fig. 7(b), the execution times for ence on Information Technology (ICIT), pp. 50-55, (2017).
minimum value of LIF obtained by the proposed algorithm [5] Augonnet. C., Thibault, S., Namyst, R., Wacrenier, P. StarPU: A
unified platform for task scheduling on heterogeneous multi-
are plotted for all the considered networks with different vol-
core architectures. Concurr Comput Pract Exper. 23(2): 187-
ume of system loads. The results reveal that there is small 198 (2011).
increment in execution times with the proposed algorithm, [6] Lakshmanan, K., Kato, S., Rajkumar, S.: Scheduling Parallel
however, the performance is similar for all the considered Real-Time Tasks on Multi-core Processors. In: Proceedings of
networks. This is due to the fact that we considered networks 31st IEEE Real-Time Systems Symposium, pp. 259-268,
in which each core is connected by bidirectional communi- (2010).
cating links to its neighbor cores, as depicted in Fig.1. The [7] Chronaki, K., Rico, A., Casas, M., Moretó, M., Badia, R.M.,
minor increment in execution time is tolerable with highly Ayguadé, E., Labarta, J., & Valero, M.: Task Scheduling
reduced value of LIF which ultimately improve the system Techniques for Asymmetric Multi-Core Systems. IEEE
utilization. The main attraction of the proposed algorithm is Transactions on Parallel and Distributed Systems, 28(7),
the independent load which does not have impact on the effi- 2074-2087 (2017).
[8] Chronaki, K., Rico, A., Badia, R.M., Ayguadé, E., Labarta, J., &
cacy of selected architecture.
Valero, M.: Criticality-Aware Dynamic Task Scheduling for
Heterogeneous Architectures. In: Proceedings of the 29th
ACM on International Conference on Supercomputing, pp.
6 Conclusion and Future Work
329-459, (2015).
[9] Yao, X., Geng, P., Du, X.: A Task Scheduling Algorithm for
In this paper, we have incorporated an enhancement to the Multi-core Processors. In proceedings: International Confer-
minimum distance scheduling (MDS) algorithm to obtain a ence on Parallel and Distributed Computing, Applications and
suboptimal solution for task scheduling on multicore systems. Technologies. pp. 259-264 (2013).
The performance of proposed algorithm is tested for 4x4 [10] Zheng, Z., Chen, X., Wang, Z., Shen, Li., Li, J.: Performance
mesh-based networks i.e., Mesh, DMesh, ZMesh and Torus model for OpenMP parallelized loops. In: Proceedings 2011
networks. The main objective is to schedule the independent International Conference on Transportation, Mechanical, and
tasks on 16-cores systems uniformly with minimum Electrical Engineering (TMEE). pp. 383-387 (2011).
makespan of execution. [11] Yuan, L., Jia, P., Yang, Y.: Efficient scheduling of DAG tasks
The performance is measured by considering homogene- on multi-core processor based parallel systems. In: Proceed-
ings: IEEE Region Conference pp. 1-6 (2015).
ous cores of the system. The load imbalance and execution
[12] Wunderlich, S., Cabrera, J., Fitzek, F., Reisslein, M.: Network
times are considered as metrics to evaluate the performance Coding in Heterogeneous Multicore IoT Nodes with DAG
of the proposed algorithm. The makespan is minimized by Scheduling of Parallel Matrix Block Operations. IEEE Inter-
exploiting duplication approach in which non-adjacent cores net of Things Journal, 4(4) pp. 917-933 (2017).
of the system are effectively utilized for computations. [13] Tang, x., Li, K., Liao, G., Li, R.: List scheduling with duplica-
Curves are drawn and comparative analysis is carried out. tion for heterogeneous computing systems. J. Parallel Distrb.
Computing. 70(4) pp. 322-329 (2010).

50
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 66, Issue 2, 2022

[14] Topcuoglu, H., Hariri, S., Wu, M.: Performance-effective and


low-complexity task scheduling for heterogeneous computing.
IEEE Transactions on Parallel and Distributed Systems. 13 (3)
pp. 260-274 (2002).
[15] Bansal, S., Kumar, P., Singh, K.: Dealing with heterogeneity
through limited duplication for scheduling precedence con-
strained task graphs. J. Parallel Distrb. Computing. 70(4) pp.
479-491 (2005).
[16] Mohammad, S., Ababneh.: Improving system performance in
non-contiguous processor allocation for mesh interconnection
networks. J. Simulation Modelling Practice and Theory.
[17] Simulation Modelling Practice and Theory, 80, pp. 19-31
(2018).
[18] Manaullah.: Minimum Distance Scheduling Scheme on Line-
arly Extensible Multiprocessor Network. International Journal
of Emerging Technology and Advanced Engineering. 3 (10)
pp. 536- 541 (2013).
[19] Prasad, N., Mukherjee, P., Chattopadhyay, S., Chakrabarti, I.:
Design and Evaluation of ZMesh Topology for On-Chip In-
terconnection Networks. Journal of Parallel and Distributed
(5) pp. 17-36 (2018).
***

51
Institute of Science, BHU Varanasi, India

You might also like