0% found this document useful (0 votes)
38 views

Stanley Assignment

The document discusses vector processing and its advantages over scalar processing for scientific and engineering computations. It describes features of vector processing like processing elements in parallel. It also discusses profiling algorithms for a vector processor using QEMU and optimizing algorithms for vector architectures.

Uploaded by

Timson
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Stanley Assignment

The document discusses vector processing and its advantages over scalar processing for scientific and engineering computations. It describes features of vector processing like processing elements in parallel. It also discusses profiling algorithms for a vector processor using QEMU and optimizing algorithms for vector architectures.

Uploaded by

Timson
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

DEPARTMENT OF COMPUTER

SCIENCE
FEDERAL POLYTECHNIC IDAH
ASSINGMENT ON

COM 314 – COMPUTER ARCHITECHTURE

BY

NAME: OLAWUMI EMMANUEL OMOTAYO


LEVEL: HND1
DEPARTMENT: COMPUTER SCIENCE

QUESTION:
Write about application of vector processor in image processing, scalar processor.

INTRODUCTION
The development and design of modern microprocessor solutions requires a lot of effort that
is why in order to increase the efficiency of their development it is necessary to use complex tools
that allow the efficiency evaluation on a test sample.
This allows us to rapidly compare alternative approaches and reasonably choose optimal solutions for the
development and modification of new microprocessor architectures.
The paper relevance lies in the designing of the approaches and the choices of methods and tools for
the practical application of these optimization approaches that enhance the design quality of domestic
high-performance vector processors at the design stage. Thus, the goal of the work is to create
optimization approaches designed for application in the design process based on the new and

1
standard architectures and the elaboration of technologies that significantly improve the efficiency
of hardware and software development. The tasks that will make it possible to approach the goal amount
to investigating the command execution flow, monitoring the work with memory, and empirical
evaluation of the received data. To assess the productivity and optimality of software solutions, it is
convenient to use statistical methods applicable to various metrics of the object under study, for

example, time, cyclomatic complexity, deviation error and others.

The scientific and research computations involve many computations which require
extensive and high-power computers. These computations when run in a conventional computer
may take days or weeks to complete. The science and engineering problems can be specified in
methods of vectors and matrices using vector processing.
Vector processing is a central processing unit that can perform the complete vector input
in individual instruction. It is a complete unit of hardware resources that implements a sequential
set of similar data elements in the memory using individual instruction.

FEATURES OF VECTOR PROCESSING


There are various features of Vector processing which are as follows:
1. A vector is a structured set of elements. The elements in a vector are scalar quantities. A
vector operand includes an ordered set of n elements, where n is known as the length of
the vector.
2. Each clock period processes two successive pairs of elements. During one single clock
period, the dual vector pipes and the dual sets of vector functional units allow the
processing of two pairs of elements. As the completion of each pair of operations takes
place, the results are delivered to appropriate elements of the result register. The
operation continues just before the various elements processed are similar to the count
particularized by the vector length register.
3. In parallel vector processing, more than two results are generated per clock cycle. The
parallel vector operations are automatically started under the following two
circumstances.
4. When successive vector instructions facilitate different functional units and multiple
vector registers.
5. When successive vector instructions use the resulting flow from one vector register as the
operand of another operation utilizing a different functional unit. This phase is known as
chaining.

2
6. A vector processor implements better with higher vectors because of the foundation delay
in a pipeline.
7. Vector processing decrease the overhead related to maintenance of the loop-control
variables which creates it more efficient than scalar processing.

PROFILING WITH THE USE OF QEMU


As part of this work the processor architecture is analyzed with the use of QEMU [4, 5]
emulator, that allows emulating self-contained user applications written for one architecture on a
different one. The open source code of the program allows you to implement the necessary for
research metrics directly into the emulated model.

Fig. 1 shows a diagram illustrating model profiling in the tool set based on the QEMU virtual
machine. The simulator interprets guest programs instructions, which is actually a model of the
microprocessor and its parts that form the structure of the computer system. The simulator itself
is an application program that runs on a host machine under the host operating system.
In this case, for more convenient work the data need to be streamlined. The most effective way is
to convert the data for vector processing into a one-dimensional array. A cycle scan is applicable
to cycles with a small body size. It is similar to manual vectorization. In this case, you can use
each iteration more efficiently. Therefore, the body of the cycle is repeatedly duplicated
depending on the number of executing devices. In vector architectures, this optimization method
can be replaced by SIMD instructions. But such optimization can cause dependence on data, to
get rid of it additional variables are introduced. You should also consider the number of
iterations and the iterative step – their greatest common divisor should be equal to the iteration
step. In the absence of this condition, actions on the remaining block are performed outside the
cycle.
Not any algorithm can be vectorized; therefore, it is necessary to use loop optimization
methods applied to scalar architectures. Moving the base blocks would allow placing the code of
frequently executed basic commands close to each other and shorten the time for calculating the
addresses of the transitions. The decomposition of frequently encountered command blocks that

3
have many incoming and outgoing edges most likely indicates a non-optimal memory operation.
In this case, it may be possible to avoid unnecessary downloads of data and speed up the process
of the program execution. The use of embedded functions in cycles allows to avoid using the
stack when calling simple functions, which in some cases increases the performance of
algorithms.
A good option is to reorder the condition branches, based on their logic and frequency of
execution, in order to minimize the cost of predicting. It is recommended to place the most
probable branches at the beginning of branching. In this case, part of the logical conditions can
be replaced by arithmetic expressions. Obviously, this allows you to test fewer conditions, as
well as less frequently make conditional transfers, which are one of the most resource-intensive
operations.
One of the common difficulties in vector programming is the transformation of branching into
arithmetic expressions. A code with a large number of branches is difficult to vectorize, and with
vectorization it may even degrade performance due to the addition of new operations (replacing
branches).
This can be tested on the following;
 Image filtering by convolution with a window.
 Color spaces conversion (RGB-YUV).
 Preliminary and post processing FDCT and IDCT (forward / inverse discrete cosine
transform).
 Quantization and dequantization.
 Motion estimation.
 Intra-prediction.
The proposed solutions allow estimating parameters of algorithms for a vector processor and
determine a set of commands that make a significant contribution to performance and are
suitable for implementation on the developed architecture. To estimate the time spent, a high
precision timer was used from the chrono C ++ 11 library and a test image with a rainbow
gradient, providing the maximum color gamut. The choice of the best version of the algorithm
was made taking into account the minimization of the time spent.
The running time of the algorithms measuring the memory while maintaining the level of the
permissible error makes it possible to estimate possible distortions when converting algorithms
into their integer counterpart. The use of the standard deviation estimate allows us to take into
account the image size and reduce the individual perception factor:

where "W" and "H" are image sizes in pixels, "a" is the value in the reference algorithm, "b" is
the value in its integer version.
An example of the data obtained for the color space transformation algorithm is shown in Fig. 3.

4
The obtained data indicate that the memory allocated for the time variables can be reduced from
16 bits to 7 bits without loss of conversion quality. The estimates described are objective criteria
for accuracy, since they depend solely on numerical data. Nevertheless, these criteria do not
always correspond to subjective estimates. Images are intended for human perception, so the
only thing that can be said is that poor indicators of objective criteria usually correspond to low
subjective estimates, but good indicators
Images are intended for human perception, so the only thing that can be said is that poor
indicators of objective criteria usually correspond to low subjective estimates, but good
indicators of objective criteria do not guarantee high subjective estimates.

CONCLUSION
This research work provide an area for further work is the improvement of methods for compiler
performance evaluating in order to ensure the speed and reliability of the results, depending on
the level of optimizations. A possible solution is to use the obtained statistical information on a
set of test tasks for graphic processing.
A separate problem that requires careful study is the choice of a representative class of tasks
image processing, computer graphics, and computational tasks for performance analysis. For
example, from computer graphics, we can take a rendering of a large number of objects on one
of the libraries compiled with different keys. From computational problems, we can take the
calculation of 100,000 integrals, by some complicated method, and when building a program,
again, change keys. From image processing, we can take one of the filters and compile a
program with different keys.
Performance comparison of executable files obtained in different compilers is of particular
However, it should be remembered that when profiling an important step is the selection of
criteria. For example, we can count the execution time, the amount of memory required, the
number of operations, etc. When selecting criteria, it is necessary to carefully study the task, its
requirements, then select the most appropriate profiling method.

REFERENCES

J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.-N. Pouchet, A. Rountev, and P.


Sadayappan. Dynamic tracebased analysis of vectorization potential of applications. In
Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design
and Implementation, PLDI ’12, p. 371–382, New York, NY, USA, 2012.
G. C. Evans, S. Abraham, B. Kuhn, and D. A. Padua. Vector seeker: A tool for finding vector
potential. In Proceedings of the 2014 Workshop on Programming Models for
SIMD/Vector Processing, WPMVP ’14, p. 41–48, New York, NY, USA, 2014.
5
R. Barik, J. Zhao, and V. Sarkar. Automatic vector instruction selection for dynamic
compilation. In Proceedings of the 19th International Conference on Parallel
Architectures and Compilation Techniques, PACT ’10, p. 573-574, New York, NY,
USA, 2010.
QEMU. Emulator user documentation. URL: http://wiki.qemu.org/download/qemu-doc.html.
Bellard Fabrice. QEMU, a fast and portable dynamic translator. Proceedings of the annual
conference on USENIX Annual Technical Conference. ATEC '05. Berkley, USA:
USENIX Association, 2005, pp. 41-46.
Shen, J. P. and M. H. LIPASTI. Modern Processor Design: Fundamentals of Su-perscalar
Processors. New York: McGraw-Hill, 2005.
S.F. Kurmangaleyev. Methods for optimizing C/C ++ applications distributed in the LLVM bit-
code, taking into account the equipment specificity. Proceedings of ISP RAS, volume 24,
p. 127-144, 2013. DOI: 10.15514/ISPRAS-2013-24-7.
R. Levin, I. Newman, G. Haber. Complementing missing and inaccurate profiling using a
minimum cost circulation algorithm. Proceedings of the 3rd international conference on
High performance embedded architectures and compilers.— HiPEAC’08.— Berlin,
Heidelberg: Springer-Verlag, 2008, pp. 291–304.
M. Hohenauer, F. Engel, R. Leupers, G. Ascheid, and H. Meyr. A SIMD Optimization
Framework for Retargetable Compilers. ACM Trans. Archit. Code Optim. 6(1), 1–27,
2009.
A.C. Bovik, Handbook of image and video processing, 2nd ed. San Diego, Elsevier Academic
Press, 2005.
GCC 4.8.2 Manual. URL: http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/.

You might also like