100% found this document useful (1 vote)
6 views

PDF Image Processing Using FPGAs Donald G Bailey Editor download

Editor

Uploaded by

mookaalmaki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
6 views

PDF Image Processing Using FPGAs Donald G Bailey Editor download

Editor

Uploaded by

mookaalmaki
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Download Full Version ebook - Visit ebookmeta.

com

Image Processing Using FPGAs Donald G Bailey


Editor

https://ebookmeta.com/product/image-processing-using-fpgas-
donald-g-bailey-editor/

OR CLICK HERE

DOWLOAD NOW

Discover More Ebook - Explore Now at ebookmeta.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

Design for Embedded Image Processing on FPGAs, 2nd Edition


Donald G. Bailey

https://ebookmeta.com/product/design-for-embedded-image-processing-on-
fpgas-2nd-edition-donald-g-bailey/

ebookmeta.com

Digital Image Processing Using MATLAB 4th Edition Rafael


C. Gonzalez

https://ebookmeta.com/product/digital-image-processing-using-
matlab-4th-edition-rafael-c-gonzalez/

ebookmeta.com

Embedded Microprocessor System Design using FPGAs Uwe


Meyer-Baese

https://ebookmeta.com/product/embedded-microprocessor-system-design-
using-fpgas-uwe-meyer-baese/

ebookmeta.com

Blandy s Urology 3rd Edition Omar M Aboumarzouk Ed

https://ebookmeta.com/product/blandy-s-urology-3rd-edition-omar-m-
aboumarzouk-ed/

ebookmeta.com
Serpent s Blood 1st Edition Beth Alvarez Alvarez Beth

https://ebookmeta.com/product/serpent-s-blood-1st-edition-beth-
alvarez-alvarez-beth/

ebookmeta.com

Alec Keepers of the Lake 3 1st Edition Emilia Hartley

https://ebookmeta.com/product/alec-keepers-of-the-lake-3-1st-edition-
emilia-hartley/

ebookmeta.com

Solid State NMR Principles Methods and Applications 1st


Edition Klaus Müller Marco Geppi

https://ebookmeta.com/product/solid-state-nmr-principles-methods-and-
applications-1st-edition-klaus-muller-marco-geppi/

ebookmeta.com

Indigenous Management Practices in Africa A Guide for


Educators and Practitioners 1st Edition Uchenna Uzo

https://ebookmeta.com/product/indigenous-management-practices-in-
africa-a-guide-for-educators-and-practitioners-1st-edition-uchenna-
uzo/
ebookmeta.com

Procedural Generation in Godot: Learn to Generate


Enjoyable Content for Your Games 1st Edition Christopher
Pitt
https://ebookmeta.com/product/procedural-generation-in-godot-learn-to-
generate-enjoyable-content-for-your-games-1st-edition-christopher-
pitt-2/
ebookmeta.com
Monster Girl Safari 1 1st Edition Roland Carlsson

https://ebookmeta.com/product/monster-girl-safari-1-1st-edition-
roland-carlsson/

ebookmeta.com
Journal of
Imaging

Image Processing
Using FPGAs
Edited by
Donald G. Bailey
Printed Edition of the Special Issue Published in Journal of Imaging

www.mdpi.com/journal/jimaging
Image Processing Using FPGAs
Image Processing Using FPGAs

Special Issue Editor


Donald G. Bailey

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade


Special Issue Editor
Donald G. Bailey
Massey University
New Zealand

Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal
Journal of Imaging (ISSN 2313-433X) from 2018 to 2019 (available at: https://www.mdpi.com/
journal/jimaging/special issues/Image FPGAs).

For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
Page Range.

ISBN 978-3-03897-918-0 (Pbk)


ISBN 978-3-03897-919-7 (PDF)


c 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license, which allows users to download, copy and build upon
published articles, as long as the author and publisher are properly credited, which ensures maximum
dissemination and a wider impact of our publications.
The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
license CC BY-NC-ND.
Contents

About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Preface to ”Image Processing Using FPGAs” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Donald Bailey
Image Processing Using FPGAs
Reprinted from: Journal of Imaging 2019, 5, 53, doi:10.3390/jimaging5050053 . . . . . . . . . . . . 1

Fahad Siddiqui, Sam Amiri, Umar Ibrahim Minhas, Tiantai Deng, Roger Woods,
Karen Rafferty and Daniel Crookes
FPGA-Based Processor Acceleration for Image Processing Applications
Reprinted from: Journal of Imaging 2019, 5, 16, doi:10.3390/jimaging5010016 . . . . . . . . . . . . 5

Paulo Garcia, Deepayan Bhowmik, Robert Stewart, Greg Michaelson and Andrew Wallace
Optimized Memory Allocation and Power Minimization for FPGA-Based Image Processing
Reprinted from: Journal of Imaging 2019, 5, 7, doi:10.3390/jimaging5010007 . . . . . . . . . . . . . 27

Runbin Shi, Justin S.J. Wong and Hayden K.-H. So


High-Throughput Line Buffer Microarchitecture for Arbitrary Sized Streaming Image
Processing
Reprinted from: Journal of Imaging 2019, 5, 34, doi:10.3390/jimaging5030034 . . . . . . . . . . . . 50

Donald Bailey and Anoop Ambikumar


Border Handling for 2D Transpose Filter Structures on an FPGA
Reprinted from: Journal of Imaging 2018, 4, 138, doi:10.3390/jimaging4120138 . . . . . . . . . . . . 70

Andrew Tzer-Yeu Chen, Rohaan Gupta, Anton Borzenko, Kevin I-Kai Wang and
Morteza Biglari-Abhari
Accelerating SuperBE with Hardware/Software Co-Design
Reprinted from: Journal of Imaging 2018, 4, 122, doi:10.3390/jimaging4100122 . . . . . . . . . . . . 91

Aiman Badawi and Muhammad Bilal


High-Level Synthesis of Online K-Means Clustering Hardware for a Real-Time Image
Processing Pipeline
Reprinted from: Journal of Imaging 2019, 5, 38, doi:10.3390/jimaging5030038 . . . . . . . . . . . . 108

Haonan Zhou, Raju Machupalli and Mrinal Mandal


Efficient FPGA Implementation of Automatic Nuclei Detection in Histopathology Images
Reprinted from: Journal of Imaging 2019, 5, 21, doi:10.3390/jimaging5010021 . . . . . . . . . . . . 125

Donald Bailey, and Michael Klaiber


Zig-Zag Based Single-Pass Connected Components Analysis
Reprinted from: Journal of Imaging 2019, 5, 45, doi:10.3390/jimaging5040045 . . . . . . . . . . . . 138

Zhe Wang, Trung-Hieu Tran, Ponnanna Kelettira Muthappa and Sven Simon
A JND-Based Pixel-Domain Algorithm and Hardware Architecture for Perceptual Image
Coding
Reprinted from: Journal of Imaging 2019, 5, 50, doi:10.3390/jimaging5050050 . . . . . . . . . . . . 164

v
About the Special Issue Editor
Donald G. Bailey received his Bachelor of Engineering (Honours) degree in Electrical Engineering
in 1982, and Ph.D. degree in Electrical and Electronic Engineering from the University of Canterbury,
New Zealand, in 1985. From 1985 to 1987, he applied image analysis to the wool and paper industries
of New Zealand. From 1987 to 1989, he was a Visiting Research Engineer at University of California,
Santa Barbara. Prof. Bailey joined Massey University in Palmerston North, New Zealand, as Director
of the Image Analysis Unit in November 1989. He was a Visiting Researcher at the University of
Wales, Cardiff, in 1996; University of California, Santa Barbara, in 2001–2002; and Imperial College
London in 2008. He is currently Professor of Imaging Systems in the Department of Mechanical
and Electrical Engineering in the School of Food and Advanced Technology at Massey University,
where he is Leader of the Centre for Research in Image and Signal Processing. Prof. Bailey has spent
over 35 years applying image processing to a range of industrial, machine vision, and robot vision
applications. For the last 18 years, one area of particular focus has been exploring different aspects
of using FPGAs for implementing and accelerating image processing algorithms. He is the author
of many publications in this field, including the book “Design for Embedded Image Processing on
FPGAs”, published by Wiley/IEEE Press. He is a Senior Member of the IEEE, and is active in the
New Zealand Central Section.

vii
Preface to ”Image Processing Using FPGAs”
Over the last 20 years, FPGAs have moved from glue logic through to computing platforms.
They effectively provide a reconfigurable hardware platform for implementing logic and algorithms.
Being fine-grained hardware, FPGAs are able to exploit the parallelism inherent within a hardware
design while at the same time maintaining the reconfigurability and programmability of software.
This has led to FPGAs being used as a platform for accelerating computationally intensive tasks. This
is particularly seen in the field of image processing, where the FPGA-based acceleration of imaging
algorithms has become mainstream. This is even more so within an embedded environment, where
the power and computational resources of conventional processors are not up to the task of managing
the data throughput and computational requirements of real-time imaging applications.
Unfortunately, the fine-grained nature of FPGAs also makes them difficult to programme
effectively. Conventional processors have a fixed computational architecture, which is able to provide
a high level of abstraction. By contrast, on an FPGA, it is necessary to design not only the algorithm
but also the computational architecture, which leads to an explosion in the design space complexity.
This, coupled with the complexities of managing the concurrency of a highly parallel design and the
bandwidth issues associated with the high volume of data associated with images and video, has
led to a wide range of approaches and architectures used for realising FPGA-based image processing
systems. This Special Issue provides an opportunity for researchers in this area to present some of
their latest results and designs. The diversity of presented techniques and applications reflects the
nature and current state of FPGA-based design for image processing.

Donald G. Bailey
Special Issue Editor

ix
Journal of
Imaging
Editorial
Image Processing Using FPGAs
Donald G. Bailey
Department of Mechanical and Electrical Engineering, School of Food and Advanced Technology,
Massey University, Palmerston North 4442, New Zealand; [email protected]

Received: 6 May 2019; Accepted: 7 May 2019; Published: 10 May 2019

Abstract: Nine articles have been published in this Special Issue on image processing using
field programmable gate arrays (FPGAs). The papers address a diverse range of topics relating
to the application of FPGA technology to accelerate image processing tasks. The range includes:
Custom processor design to reduce the programming burden; memory management for full frames,
line buffers, and image border management; image segmentation through background modelling,
online K-means clustering, and generalised Laplacian of Gaussian filtering; connected components
analysis; and visually lossless image compression.

Keywords: field programmable gate arrays (FPGA); image processing; hardware/software co-design;
memory management; segmentation; image analysis; compression

1. Introduction to This Special Issue


Field programmable gate arrays (FPGAs) are increasingly being used for the implementation
of image processing applications. This is especially the case for real-time embedded applications,
where latency and power are important considerations. An FPGA embedded in a smart camera is able
to perform much of the image processing directly as the image is streamed from the sensor, with the
camera providing a processed output data stream, rather than a sequence of images. The parallelism of
hardware is able to exploit the spatial (data level) and temporal (task level) parallelism implicit within
many image processing tasks. Unfortunately, simply porting a software algorithm onto an FPGA often
gives disappointing results, because many image processing algorithms have been optimised for a
serial processor. It is usually necessary to transform the algorithm to efficiently exploit the parallelism
and resources available on an FPGA. This can lead to novel algorithms and hardware computational
architectures, both at the image processing operation level and also at the application level.
The aim of this Special Issue is to present and highlight novel algorithms, architectures, techniques,
and applications of FPGAs for image processing. A total of 20 submissions were received for the
Special Issue, with nine papers being selected for final publication.

2. Contributions
Programming an FPGA to accelerate complex algorithms is difficult, with one of four approaches
commonly used [1]:

• Custom hardware design of the algorithm using a hardware description language, optimised for
performance and resources;
• implementing the algorithm by instantiating a set of application-specific intellectual property
cores (from a library);
• using high-level synthesis to convert a C-based representation of the algorithm to
synthesisable hardware; or
• mapping the algorithm onto a parallel set of programmable soft-core processors.

J. Imaging 2019, 5, 53; doi:10.3390/jimaging5050053 1 www.mdpi.com/journal/jimaging


J. Imaging 2019, 5, 53

The article by Siddiqui et al. [1] took this last approach, and describes the design of an efficient
16-bit integer soft-core processor, IPPro, capable of operating at 337 MHz, specifically targetting the
dataflow seen in complex image processing algorithms. The presented architecture uses dedicated
stream access instructions on the input and output, with a 32-element local memory for storing pixels
and intermediate results, and a separate 32-element kernel memory for storing filter coefficients
and other parameters and constants. The exploitation of both data-level parallelism and task-level
parallelism is demonstrated through the mapping of a K-means clustering algorithm onto the
architecture, showing good scalability of processing speed with multiple cores. A second case study of
traffic sign recognition is partitioned between the IPPro cores and an ARM processor, with the colour
conversion and morphological filtering stages mapped to the IPPro. Again, the use of parallel IPPro
cores can significantly accelerate these tasks, compared to conventional software, without having to
resort to the tedious effort of custom hardware design.
Garcia et al. [2] worked on the thesis that the image processing operations which require random
access to the whole frame (including iterative algorithms) are particularly difficult to realise in FPGAs.
They investigate the mapping of a frame buffer onto the memory resources of an FPGA, and explore
the optimal mapping onto combinations of configurable on-chip memory blocks. They demonstrate
that, for many image sizes, the default mapping by the synthesis tools results in poor utilisation, and is
also inefficient in terms of power requirements. A procedure is described that determines the best
memory configuration, based on balancing resource utilisation and power requirements. The mapping
scheme is demonstrated with optical flow and mean shift tracking algorithms.
On the other hand, local operations (such as filters) only need part of the image to produce an
output, and operate efficiently in stream processing mode, using line buffers to cache data for scanning
a local window through the image. This works well when the image size is fixed, and is known in
advance. Two situations where this approach is less effective [3] are in the region of interest processing,
where only a small region of the image is processed (usually determined from the image contents at
run-time), and cloud processing of user-uploaded images (which may be of arbitrary size). This is
complicated further in high-speed systems, where the real-time requirements demand processing
multiple pixels in every clock cycle, because, if the line width is not a multiple of the number of pixels
processed each cycle, then it is necessary to assemble the output window pixels from more than one
memory block. Shi et al. [3], in their paper, extend their earlier work on assembling the output window
to allow arbitrary image widths. The resulting line buffer must be configurable at run-time, which is
achieved through a series of “instructions”, which control the assembly of the output processing
window when the required data spans two memory blocks. Re-configuration only takes a few clock
cycles (to load the instructions), rather than conventional approach of reconfiguring the FPGA each
time the image width changes. The results demonstrate better resource utilisation, higher throughput,
and lower power than their earlier approach.
When applying window operations to an image, the size of the output image is smaller than
the input because data is not valid when the window extends beyond the image border. If necessary,
this may be mitigated by extending the input image to provide data to allow such border pixels to be
calculated. Prior work only considered border management using direct form filter structures, because
the window formation and filter function can be kept independent. However, in some applications,
transpose-form filter structures are desirable because the corresponding filter function is automatically
pipelined, leading to fewer resources and faster clock frequencies. Bailey and Ambikumar [4] provide
a design methodology for border management using transpose filter structures, and show that the
resource requirements are similar to those for direct-form border management.
An important task in computer vision is segmenting objects from a complex background. While
there are many background modelling algorithms, the complexity of robust algorithms make them
difficult to realise on an FPGA, especially for larger image sizes. Chen et al. [5] address scalability issues
with increasing image size by using super-pixels—small blocks of adjacent pixels that are treated as a
single unit. As each super-pixel is considered to be either object or background, this means that fewer

2
J. Imaging 2019, 5, 53

models need to be maintained (less memory) and fewer elements need to be classified (reduced
computation time). Using hardware/software co-design, they accelerated the computationally
expensive steps of Gaussian filtering and calculating the mean and variance within each super-pixel
with hardware, with the rest of the algorithm being realised on the on-chip CPU. The resulting system
gave close to state-of-the-art classification accuracy.
A related paper, by Badawi and Bilal [6], used K-means clustering to segment objects within video
sequences. Rather than taking the conventional iterative approach to K-means clustering, they rely
on the temporal coherence of video streams and use the cluster centres from the previous frame as
initialisation for the current frame. Additionally, rather than waiting until the complete frame has
been accumulated before updating the cluster centres, an online algorithm is used, with the clusters
updated for each pixel. To reduce the computational requirements, the centres are updated using a
weighted average. They demonstrate that, for typical video streams, this gives similar performance to
conventional K-means algorithms, but with far less computation and power.
In another segmentation paper, Zhou et al. [7] describe the use of a generalised Laplacian of
Gaussian (LoG) filter for detecting cell nuclei for a histopathology application. The LoG filters detect
elliptical blobs at a range of scales and orientations. Local maxima of the responses are used as
candidate seeds for cell centres, and mean-shift clustering is used to combine multiple detections
from different scales and orientations. Their FPGA design gave modest acceleration over a software
implementation on a high-end computer.
Given a segmented image, a common task is to measure feature vectors of each connected
component for analysis. Bailey and Klaiber [8] present a new single-pass connected components
analysis algorithm, which does this with minimum latency and relatively few resources. The key novelty
of this paper is the use of a zig-zag based scan, rather than a conventional raster scan. This eliminates the
end-of-row processing for label resolution by integrating it directly within the reverse scan. The result is
true single-pixel-per-clock-cycle processing, with no overheads at the end of each row or frame.
An important real-time application of image processing is embedded online image compression
for reducing the data bandwidth for image transmission. In the final paper within this Special Issue,
Wang et al. [9] defined a new image compression codec which works efficiently with a streamed image,
and minimises the perceptual distortion within the reconstructed images. Through small local filters,
each pixel is classified as either an edge, a smooth region, or a textured region. These relate to a
perceptual model of contrast masking, allowing just noticeable distortion (JND) thresholds to be
defined. The image is compressed by downsampling; however, if the error in any of the contributing
pixels exceeds the visibility thresholds, the 2 × 2 block is considered a region of interest, with the
4 pixels coded separately. In both cases, the pixel values are predicted using a 2-dimensional predictor,
and the prediction residuals are quantised and entropy-encoded. Results typically give a visually
lossless 4:1 compression, which is significantly better than other visually lossless codecs.

3. Conclusions
Overall, this collection of papers reflects the diversity of approaches taken to applying FPGAs to
image processing applications. From one end, using the programmable logic to design lightweight
custom processors to enable parallelism, through overcoming some of the limitations of current
high-level synthesis tools, to the other end with the design of custom hardware designs at the
register-transfer level.
The range of image processing techniques include filtering, segmentation, clustering, and
compression. Applications include traffic sign recognition for autonomous driving, histopathology,
and video compression.

3
J. Imaging 2019, 5, 53

Funding: This research received no external funding.


Acknowledgments: The Guest Editor would like to acknowledge the time and contributions of the authors
(both successful and unsuccessful) who prepared papers for this Special Issue. Special thanks go to all the
reviewers who provided constructive reviews of the papers in a timely manner; your analysis and feedback has
ensured the quality of the papers selected. It is also necessary to acknowledge the assistance given by the MDPI
editorial team, in particular Managing Editors Alicia Wang and Veronica Wang, who made my task as Guest
Editor much easier.
Conflicts of Interest: The author declares no conflict of interest.

References
1. Siddiqui, F.; Amiri, S.; Minhas, U.I.; Deng, T.; Woods, R.; Rafferty, K.; Crookes, D. FPGA-based processor
acceleration for image processing applications. J. Imaging 2019, 5, 16. [CrossRef]
2. Garcia, P.; Bhowmik, D.; Stewart, R.; Michaelson, G.; Wallace, A. Optimized memory allocation and power
minimization for FPGA-based image processing. J. Imaging 2019, 5, 7. [CrossRef]
3. Shi, R.; Wong, J.S.; So, H.K.H. High-throughput line buffer microarchitecture for arbitrary sized streaming
image processing. J. Imaging 2019, 5, 34. [CrossRef]
4. Bailey, D.G.; Ambikumar, A.S. Border handling for 2D transpose filter structures on an FPGA. J. Imaging
2018, 4, 138. [CrossRef]
5. Chen, A.T.Y.; Gupta, R.; Borzenko, A.; Wang, K.I.K.; Biglari-Abhari, M. Accelerating SuperBE with
hardware/software co-design. J. Imaging 2018, 4, 122. [CrossRef]
6. Badawi, A.; Bilal, M. High-level synthesis of online K-Means clustering hardware for a real-time image
processing pipeline. J. Imaging 2019, 5, 38. [CrossRef]
7. Zhou, H.; Machupalli, R.; Mandal, M. Efficient FPGA implementation of automatic nuclei detection in
histopathology images. J. Imaging 2019, 5, 21. [CrossRef]
8. Bailey, D.G.; Klaiber, M.J. Zig-zag based single pass connected components analysis. J. Imaging 2019, 5, 45.
[CrossRef]
9. Wang, Z.; Tran, T.H.; Muthappa, P.K.; Simon, S. A JND-based pixel-domain algorithm and hardware
architecture for perceptual image coding. J. Imaging 2019, 5, 50. [CrossRef]

c 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

4
Journal of
Imaging
Article
FPGA-Based Processor Acceleration for Image
Processing Applications
Fahad Siddiqui 1,† , Sam Amiri 2,† , Umar Ibrahim Minhas 1 , Tiantai Deng 1 , Roger Woods 1, * ,
Karen Rafferty 1 and Daniel Crookes 1
1 School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast,
Belfast BT7 1NN, UK; [email protected] (F.S.); [email protected] (U.I.M.); [email protected] (T.D.);
[email protected] (K.R.); [email protected] (D.C.)
2 School of Computing, Electronics and Maths, Coventry University, Coventry CV1 5FB, UK;
[email protected]
* Correspondence: [email protected]; Tel.: +44-289-097-4081
† These authors contributed equally to this work.

Received: 27 November 2018; Accepted: 7 January 2019; Published: 13 January 2019

Abstract: FPGA-based embedded image processing systems offer considerable computing resources
but present programming challenges when compared to software systems. The paper describes an
approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can
operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based
programming environment. The approach is demonstrated for a k-means clustering operation and
a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard
that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options
were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a
speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using
16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for
k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient
(fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded
GPU respectively.

Keywords: FPGA; hardware acceleration; processor architectures; image processing;


heterogeneous computing

1. Introduction
With improved sensor technology, there has been a considerable growth in the amount of data
being generated by security cameras. In many remote environments with limited communication
bandwidth, there is a clear need to overcome this by employing remote functionality in the system such
as employing motion estimation in smart cameras [1]. As security requirements grow, the processing
needs will only need to increase.
New forms of computing architectures are needed. In late 70’s, Lamport [2] laid the foundation
of parallel architectures exploiting data-level parallelism (DLP) using work load vectorisation and
shared memory parallelisation, used extensively in Graphical Processing Units (GPUs). Current energy
requirements and limitations of Dennard scaling have acted to limit clock scaling and thus reduce
future processing capabilities of GPUs or multi-core architectures [3]. Recent field programmable gate
array (FPGA) architectures represent an attractive alternative for acceleration as they comprise ARM
processors and programmable logic for accelerating computing intensive operations.
FPGAs are proven computing platforms that offer reconfigurability, concurrency and pipelining,
but have not been accepted as a mainstream computing platform. The primary inhibitor is the need to

J. Imaging 2019, 5, 16; doi:10.3390/jimaging5010016 5 www.mdpi.com/journal/jimaging


J. Imaging 2019, 5, 16

use specialist programming tools, describing algorithms in hardware description language (HDL), altough
this has been alleviated by the introduction of high-level programming tools such as Xilinx’s Vivado
High-level Synthesis (HLS) and Intel’s (Altera’s) compiler for OpenCL. While the level of abstraction
has been raised, a gap still exists between adaptability, performance and efficient utilisation of FPGA
resources. Nevertheless, the FPGA design flow still requires design synthesis and place-and-route that
can be time-consuming depending on the complexity and size of the design [4,5]. This FPGA design
flow is alien to software/algorithm developers and inhibits wider use of the technology.
One way to approach this research problem is to develop adaptable FPGA hardware architecture
that enables edit-compile-run flow familiar to software and algorithm developers instead of hardware
synthesis and place-and-route. This can be achieved by populating FPGA logic with a number of efficient
soft core processors used for programmable hardware acceleration. This underlying architecture will
be adaptable and can be programmed using conventional software development approaches. However,
the challenge is to build an FPGA solution that is more easily programmed whilst still providing high
performance. Whilst FPGA-based processor architectures exist such as Xilinx’s MicroBlaze, Altera’s
NIOS and others [6–9], we propose an Image Processing Processor (IPPro) processor [10] tailored to
accelerate image processing operations, thereby providing an excellent mapping between FPGA
resources, speed and programming efficiency. The main purpose of the paper is to give insights into
the multi-core processor architecture built using the IPPro architecture, its programming environment
and outline its applications to two image processing applications. Our main contributions are:

• Creation of an efficient, FPGA-based multicore processor which advances previous work [10],
[11] and an associated dataflow-based compiler environment for programming a heterogeneous
FPGA resource comprising it and ARM processors.
• Exploration of mapping the functionality for a k-means clustering function, resulting in a possible
speedup of up to 8 times that is 57, 28 and 1.7 times more power efficient (fps/W) than ARM
Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU.
• Acceleration of colour and morphology operations of traffic sign recognition application, resulting
in a speedup of 4.5 and 9.6 times respectively on a Zedboard.

The rest of paper is organized as follows: Section 2 outlines the various image processing
requirements and outlines how these can be matched to FPGA; relevant research is also reviewed.
System requirements are outlined in Section 3 and the soft core processor architecture is also briefly
reviewed in Section 4. The system architecture is outlined in Section 5. Experiments to accelerate a
k-means clustering algorithm and a traffic sign recognition example, are presented in Sections 6 and 7
respectively. Conclusions and future work are described in Section 8.

2. Background
Traditionally, vision systems have been created in a centralized manner where video from
multiple cameras is sent to a central back-end computing unit to extract significant features. However,
with increasing number of nodes and wireless communications, this approach becomes increasingly
limited, particularly with higher resolution cameras [12]. A distributed processing approach can be
employed where data-intensive, front-end preprocessing such as sharpening, object detection etc. can
be deployed remotely, thus avoiding the need to transmit high data, video streams back to the server.

2.1. Accelerating Image Processing Algorithms


Nugteren et al. has characterized image processing operations based on the computation and
communication patterns [13] as highlighted in Table 1. The vision processing architecture can be
composed of general and special purpose processors, FPGAs or combinations thereof. FPGAs offer
opportunities to exploit the fine/coarse grained parallelism that most of the image processing
applications exhibit at front-end processing. Heterogeneous architectures comprising CPUs and FPGA
fabrics thus offer a good balance in terms of performance, cost and energy efficiency.

6
J. Imaging 2019, 5, 16

Brodtkorb et al. has compared architectural and programming language properties of


heterogeneous architectures comprising CPU, GPU and FPGA [14] showing that FPGAs deliver
a better performance/W ratio for fixed-point operations; however, they are difficult to program.
Different design approaches have been adopted by the research community to build FPGA-based
hardware accelerators. These include:

• Customised hardware accelerator designs in HDLs which require long development times but
can be optimised in terms of performance and area.
• Application specific hardware accelerators which are generally optimized for a single function,
non-programmable and created using IP cores.
• Designs created using high-level synthesis tools such as Xilinx’s Vivado HLS tool and
Altera’s OpenCL compiler which convert a C-based specification into an RTL implementation
synthesizable code [15] allowing pipelining and parallelization to be explored.
• Programmable hardware accelerator in the form of vendor specific soft processors such as
Xilinx’s Microblaze and Altera’s NIOS II processors and customized hard/soft processors.

Table 1. Categorisation of image processing operations based on their memory and execution
patterns [13] allow features of compute and memory patterns to be highlighted and therefore identifying
what can be mapped into FPGA.

Operation Domain Output Memory Execution Examples


Type Depends on Pattern Pattern
Point and Spatial Single input Pipelined One-to-one Intensity change by factor,
Line pixel Negative image-inversion.
Area/Local Spatial Neighbouring Coalesced Tree Convolution functions: Sobel,
pixels Sharpen, Emboss.
Geometric Spatial Whole frame Recursive Large reduction Rotate, Scale, Translate, Reflect,
non-coalesced tree Perspective and Affine.

2.2. Soft Processor Architectures


Numerous FPGA multiprocessor architectures have been created to accelerate applications.
Strik et al. used a heterogeneous multiprocessor system with a reconfigurable network-on-chip to
process multiple video streams concurrently in real-time [16]. VectorBlox MXP [7] is the latest of a
series of vector-based soft core processor architectures designed to exploit DLP by processing vectors.
Optimizations employed include replacing a vector register file with a scratchpad memory to allow for
arbitrary data packing and access, removing vector length limits, enabling sub-word single-instruction,
multiple-data (SIMD) within each lane and a DMA-based memory interface.
Zhang et al. has created composable vector units [17] and allows a vector program of a dataflow
graph (DFG) to be statically compiled and clusters of operations to be composed together to create a
new streaming instruction that uses multiple operators and operands. This is similar to traditional
vector chaining but is not easily extended to support wide SIMD-style parallelism. The reported
speed-ups were less than a factor of two. Further optimizations have been employed in a custom
SVP Bluespec [18] where they compared a custom pipeline to the SVP implementation and found that
performance was within a factor of two given similar resource usage. Kapre et al. has proposed a
GraphSoC custom soft processor for accelerating graph algorithms [19]. It is a three-stage pipelined
processor that supports graph semantics (node, edge operations). The processor was designed with
Vivado HLS. Each core uses nine BRAMs and runs at 200 MHz.
Octavo [20] is a multi-threaded, ten-cycle processor that runs at 550 MHz on a Stratix IV, equivalent
to the maximum frequency supported by memory blocks. A deep pipeline is necessary to support this
high operating frequency, but suffers from the need to pad dependent instructions to overcome data
hazards. The authors sidestep this issue by designing Octavo as a multi-processor, thus dependent
instructions are always sufficiently far apart and NOP padding is not needed. Andryc et al. presented

7
J. Imaging 2019, 5, 16

a GPGPU architecture called FlexGrip [8] which like vector processors, supports wide data parallel,
SIMD-style computation using multiple parallel compute lanes, provides support for conditional
operations, and requires optimized interfaces to on- and off-chip memory. FlexGrip maps pre-compiled
CUDA kernels on soft core processors which are programmable and operate at 100 MHz.

3. System Implementation
Whilst earlier versions of FPGAs just comprised multiple Lookup Tables (LUT) connected to
registers and accelerated by fast adders, FPGAs now comprise more coarse-grained functions such as
dedicated, full-custom, low-power DSP slices. For example, the Xilinx DSP48E1 block comprises a
25-bit pre-adder, a 25 × 18-bit multiplier and a 48-bit adder/subtracter/logic unit, multiple distributed
RAM blocks which offer high bandwidth capability (Figure 1), and a plethora of registers which
supports high levels of pipelining.

)' %(  +$ 


      
    
)**! &%)!
 

Figure 1. Bandwidth/memory distribution in Xilinx Virtex-7 FPGA which highlight how bandwidth
and computation improves as we near the datapath parts of the FPGA.

Whilst FPGAs have been successfully applied in embedded systems and communications,
they have struggled as a mainstream computational platform. Addressing the following considerations
would make FPGAs a major platform rival for “data-intensive” applications:

• Programmability: there is a need for a design methodology which includes a flexible data
communication interface to exchange data. Intellectual Property (IP) cores and HLS tools [15]/
OpenCL design routes increase programming abstraction but do not provide the flexible system
infrastructure for image processing systems.
• Dataflow support: the dataflow model of computation is a recognized model for data-intensive
applications. Algorithms are represented as a directed graph composed of nodes (actors) as
computational units and edges as communication channels [21]. While the actors run explicitly in
parallel decided by the user, actor functionality can either be sequential or concurrent. Current
FPGA realizations use the concurrency of the whole design at a higher level but eliminate
reprogrammability. A better approach is to keep reprogrammability while still maximizing
parallelism by running actors on simple “pipelined” processors; the actors still run their code
explicitly in parallel (user-specified).
• Heterogeneity: the processing features of FPGAs should be integrated with CPUs. Since dataflow
supports both sequential and concurrent platforms, the challenge is then to allow effective
mapping onto CPUs with parallelizable code onto FPGA.
• Toolset availability: design tools created to specifically compile user-defined dataflow programs at
higher levels to fully reprogrammable heterogeneous platform should be available.

High-Level Programming Environment


The proposed methodology employs a reprogrammable model comprising multi-core processors
supporting SIMD operation and an associated inter-processor communication methodology.
A dataflow design methodology has chosen as the high-level programming approach as it offers
concurrency, scalability, modularity and provides data driven properties, all of which match the design
requirements associated with image processing systems. A dataflow model allows algorithms to be
realized as actors with specific firing rules that are mapped into directed graphs where the nodes
represent computations and arcs represent the movement of data. The term data-driven is used to
express the execution control of dataflow with the availability of the data itself. In this context, an actor

8
J. Imaging 2019, 5, 16

is a standalone entity, which defines an execution procedure and can be implemented in the IPPro
processor. Actors communicate with other actors by passing data tokens, and the execution is done
through the token passing through First-In-First-Out (FIFO) units. The combination of a set of actors
with a set of connections between actors constructs a network, which maps well to the system level
architecture of the IPPro processors. An earlier version of the programming environment has been
is detailed in [11] allowing the user to explore parallel implementation and providing the necessary
back-end compilation support.
In our flow, every processor can be thought of as an actor and data is fired through the FIFO
structures but the approach needs to be sensitive to FPGA-based limitations such as restricted memory.
Cal Actor Language (CAL) [22] is a dataflow programming language that has been focussed at image
processing and FPGAs and it offers the necessary constructs for expressing parallel or sequential
coding, bitwise types, a consistent memory model, and a communication between parallel tasks
through queues. RVC-CAL is supported by an open source dataflow development environment and
compiler framework, Orcc, that allows the trans-compilation of actors and generates equivalent code
depending on the chosen back-ends [23]. An RVC-CAL based design is composed of a dataflow
network file (.xdf file) that supports task and data-level parallelism.
Figure 2 illustrates the possible pipelined decomposition of dataflow actors. These dataflow
actors need to be balanced as the worst-case execution time of the actor determines the overall
achievable performance. Data-level parallelism is achieved by making multiple instances of an actor
and requires SIMD operations that shall be supported by the underlying processor architecture.
In addition, it requires software configurable system-level infrastructure that manages control and data
distribution/collection tasks. It involves the initialisation of the soft core processors (programming the
decomposed dataflow actor description), receiving data from the host processor, distributing them to
first-level actors, gathering processed data from the final-level actors and send it back to host processor.
Data-level parallelism directly impacts the system performance; the major limiting factor is the
number of resources available on FPGA. An example pipeline structure with an algorithm composed
of four actors each having different execution times, and multiple instances of the algorithm realised
in SIMD fashion is shown in Figure 2. The performance metric, frames-per-second (fps) can be
approximated using N(total_pixels) the number of pixels in a frame, N( pixel_consumption) the number of
pixels consumed by an actor in each iteration and f ( processor) is operating frequency of processor.

f ( processor) ∗ N( pixel_consumption)
fps ≈ (1)
N(total_pixels)
To improve the f ps, the following options are possible:

• Efficient FPGA-based processor design that operates at higher operating frequency f ( processor) .
• Reducing the actor’s execution time by decomposing it into multiple pipelined stages, thus reducing
t( actor) to improve the f ps. Shorter actors can be merged sequentially to minimise the data transfer
overhead by localising data into FIFOs between processing stages.
• Vertical scaling to exploit data parallelism by mapping an actor on multiple processor cores, thus
N(total_pixels)
reducing (n ∗ N( pixel_consumption) ) at the cost of additional system-level data distribution, control, and
collection mechanisms.

Pipeline stage delay

Actor 1 Actor 2 Actor 3 Actor 4


SIMD degree

Actor 1 Actor 2 Actor 3 Actor 4

Actor 1 Actor 2 Actor 3 Actor 4


...

Figure 2. Illustration of possible data and task parallel decomposition of a dataflow algorithm found in
image processing designs where the numerous of rows indicate the level of parallelism.

9
J. Imaging 2019, 5, 16

The developed tool flow (Figure 3) starts with a user-defined RVC-CAL description composed
of actors selected to execute in FPGA-based soft cores with the rest to be run in the host CPUs.
By analyzing behaviour, software/hardware partitioning is decided by two main factors, the actors
with the worse execution time (determined exactly by number of instructions and the average waiting
time to receive the input tokens and send the produced tokens), and the overheads incurred in
transferring the image data to/from the accelerator. The behavioural description of an algorithm could
be coded in different formats:

• No explicit balanced actors or actions are provided by the user.


• The actors include actions which are balanced without depending on each other, e.g., no global
variables in an actor is updated by one action and then used by the other ones; otherwise,
these would need to be decomposed into separate actors.
• The actors are explicitly balanced and only require hardware/software partitioning.

BehaviouralDescriptionin
RVCCAL

Software/Hardware
Partitioning

RedesignofCPUͲTargeted RedesignofFPGAͲTargeted
ActorsinRVCCAL ActorsinRVCCAL

SIMDApplication

CompilerInfrastructure
RVCCAL–CCompilation InterfaceSettings
XDFAnalysis

ActorCodeGeneration

SystemImplementation ControlͲRegisterValue/
ParameterGeneration

Figure 3. A brief description of the design flow of a hardware and software heterogeneous system
highlighting key features. More detail of the flow is contained in reference [11].

There are two types of decomposition, “row-” and “column-wise”. The newly generated data-
independent actors can be placed row-wise at the same pipeline stage; otherwise they can be placed
column-wise as consecutive pipeline stages. Row-wise is preferred as the overhead incurred in token
transmission can be a limiting factor but typically a combination is employed.
If the actors or actions are not balanced, then they need to be decomposed. This is done by
detecting a sequence of instructions without branches (unless this occurs at the end) and then breaking
the program into basic blocks. The “balance points” whereby the actor needs to be divided into
multiple sets of basic blocks such that if each set is placed in a new actor, then need to be found;
this will ensure that the overhead of transferring tokens among the sets will not create a bottleneck
and infer the selection and use of one with the lowest overhead (See Ref. [11]). Once the graph is
partitioned, the original xdf file no longer represents the network topology, so each set of actors must
be redesigned separately and their input/output ports fixed and a new set of xdf dataflow network
description files, generated. The actors to run on the host CPU are compiled from RVC-CAL to C
using the C backend of Orcc development environment, whereas the FPGA-based functionality is then
created using the proposed compiler framework.
The degree of SIMD applied will affect the controller interface settings. For a target board,
the design will have a fixed number of IPPro cores realized and interconnected with each other and

10
J. Imaging 2019, 5, 16

controllers, determined by the FPGA resources and fan-out delay; for the Zedboard considered here,
32 cores are selected. The compilation infrastructure is composed of three distinctive steps:

• Examination of the xdf dataflow network file and assignment and recording of the actor mapping
to the processors on the network.
• Compilation of each actor’s RVC-CAL code to IPPro assembly code.
• Generation of control register values, mainly for AXI Lite Registers, and parameters required by
the developed C-APIs. running on the host CPU

While FPGA-targeted actor interaction is handled by the compiler, the processes for receiving
the image data and storing the output in the edge actors need to be developed. Multiple controllers
(programmable by the host CPU) are designed to provide the interface to transfer the data to the
accelerators, gather the results and transfer them back to the host. With the host CPU running part
of the design and setting control registers, and the IPPro binary codes of the other actors loaded to
the proper cores on the accelerator, and the interface between the software/hardware sections set
accordingly, the system implementation is in place and ready to run.

4. Exploration of Efficient FPGA-Based Processor Design


Image processing applications extensively use multiply and accumulate operations for image
segmentation and filtering which can be efficiently mapped to FPGA. On the FPGA, the dedicated
memory blocks are located next to the DSP blocks to minimise any timing delays and it is this that
determines the maximum operating frequency ( f max ) of the processor. It is one of the reasons that
many-core and multi-core architectures use simple, light-weight processing datapaths over complex
and large out-of-order processors. However, to maintain the balance among soft processor functionality,
scalability, performance and efficient utilisation of FPGA resources remain an open challenge.
Figure 4 presents the impact of different configurations of DSP48E1 and BRAM on f max and
the parameters required by the developed C-APIs running on the host CPU using different FPGAs.
The DSP48E1 has five configurations that offer different functionalities (multiplier, accumulator,
pre-adder and pattern detector) based on different internal pipeline configurations that directly
impacts f max . It varies 15–52% for the same speed grade and reduces by 12–20% when the same design
is ported from −3 to −1 speed grade. Configuring the BRAM as a single and true-dual port RAM,
Figure 4b has been created to show that a true-dual port RAM configuration gives a reduction of 25%
in f max . However an improvement of 16% is possible by migrating the design from Artix-7 to Kintex-7
FPGA technology.
Table 2 shows the distribution of compute (DSP48E1) and memory (BRAM) resources, and
highlights the raw performance in GMAC/s (giga multiply-accumulates per second) across the largest
FPGA devices covering both standalone and Zynq SoC chips. A BRAM/DSP ratio metric is reported
to quantify the balance between compute and memory resources. In Zynq SoC devices, it is higher
than standalone devices because more memory is required to implement substantial data buffers to
exchange data between FPGA fabric and the host processor, while it is close to unity for standalone
devices. This suggests that BRAM/DSP ratio can be used to quantify area efficiency of FPGA designs.

11
J. Imaging 2019, 5, 16

500
800 Single Port RAM
NOPATDET
True-Dual Port RAM
PATDET
PREADD_MULT_NOADREG 450
700 15% 52%
MULT_NOMREG
MULT_NOMREG_PATDET
400
Frequency (MHz)

Frequency (MHz)
600

500 350

400 300

300
250

200
-3 -2 -1 200
Virtex-7 Kintex-7 Artix-7
Kintex-7 fabric (speed grade)
FPGA Fabric

(a)
(b)
Figure 4. (a) Impact of DSP48E1 configurations on maximum achievable clock frequency using
different speed grades using Kintex-7 FPGAs for fully pipelined with no (NOPATDET) and with
(PATDET) PATtern DETector, then multiply with no MREG (MULT_NOMREG) and pattern detector
(MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG)
(b) Impact of BRAM configurations on the maximum achievable clock frequency of Artix-7, Kintex-7
and Virtex-7 FPGAs for single and true-dual port RAM configurations.

Table 2. Computing resources (DSP48E1) and BRAM memory resources for a range of Xilinx Artix-7,
Kintex-7, Virtex-7 FPGA families implemented using 28nm CMOS technology.

Part BRAM BRAM/


Product Family DSP48E1 GMAC/s
Number (18 Kb Each) DSP
Standalone Artix-7 XC7A200T 730 740 929 0.99
Standalone Kintex-7 XC7K480T 1910 1920 2845 0.99
Standalone Virtex-7 XC7VX980T 3000 3600 5335 0.83
Zynq SoC Artix-7 XC7Z020 280 220 276 1.27
Zynq SoC Kintex-7 XC7Z045 1090 900 1334 1.21

4.1. Exploration of FPGA Fabric for Soft Core Processor Architecture


A system composed of light-weight and high-performance soft core processors that supports
modular computation with fine and coarse-grained functional granularity is more attractive than fixed
dedicated hardware accelerators. A lightweight, soft core processor allows more programmable
hardware accelerators to be accommodated onto a single SoC chip which would lead to better
acceleration possibilities by exploiting data and task-level parallelism.
Gupta et al. [24,25] have reported different dataflow graph models where the functionality
corresponds to soft core datapath models , 1 2 and  3 as shown in Figure 5. These dataflow models
are used to find a trade-off between the functionality of soft core processor and f max and laid the
foundation to find the suitable soft core datapath to map and execute the dataflow specification.
The input/output interfaces are marked in red while the grey box represents the mapped functionality
onto the soft core datapath models as shown in Figure 6.
The first model  1 exhibits the datapath of a programmable ALU as shown in Figure 6a. It has an
instruction register (IR) that defines a DFG node (OP1) programmed at system initialisation. On each
clock cycle, the datapath explicitly reads a token from the input FIFO, processes it based on the
programmed operation and stores the result into the output FIFO that is then consumed by the
following dataflow node (OP3). This model only allows the mapping of data independent, fine-grained
dataflow nodes as shown in Figure 5a which limits its applicability due to lack of control and data
dependent execution, commonly found in image processing applications where the output pixel

12
J. Imaging 2019, 5, 16

depends on the input or neighbouring pixels. This model is only suitable for mapping a single
dataflow node.
The second model  2 increases the datapath functionality to a fine-grained processor by including
BRAM-based instruction memory (IM), program counter PC and kernel memory (KM) to store constants as
shown in Figure 6b. Conversely,  2 can support mapping of multiple data independent dataflow nodes
as shown in Figure 5b. The node (OP2) requires a memory storage to store a variable (t1) to compute
the output token (C) which feeds back from the output of the ALU needed for the next instruction in
the following clock cycle. This model supports improved dataflow mapping functionality over  1 by
introducing an IM which comes at the cost of variable execution time and throughput proportional
to the number of instructions required to implement the dataflow actor. This model is suitable for
accelerating combinational logic computations.
The third model  3 increases the datapath functionality to map and execute a data dependent
dataflow actor as shown in Figure 5c. The datapath has memory in the form of a register file (RF) which
represents a coarse-grained processor shown in Figure 6c. The RF stores intermediate results to execute
data dependent operations, implements (feed-forward, split, merge and feedback) dataflow execution
patterns and facilitates dataflow transformations (actor fusion/fission, pipelining etc.) constraints
by the size of the RF. It can implement modular computations which are not possible in  1 and .2
In contrast to 
1 and ,
2 the token production/consumption (P/C) rate of  3 can be controlled through
program code that allows software-controlled scheduling and load balancing possibilities.

(a)
(b) (c)
Figure 5. A range of dataflow models taken from [24,25]. (a) DFG node without internal storage
called configuration ;
1 (b) DFG actor without internal storage t1 and constant i called configuration
;
2 (c) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j called
configuration .
3

(a) (b) (c)


Figure 6. FPGA datapath models resulting from Figure 5. (a) Programmable ALU corresponding to
configuration ;1 (b) Fine-grained processor corresponding to configuration ;
2 (c) Coarse-grained
processor corresponding to configuration .
3

4.2. Functionality vs. Performance Trade-Off Analysis


The presented models show that the processor datapath functionality significantly impacts the
dataflow decomposition, mapping and optimisation possibilities, but also increases the processor
critical path length and affects f max by incorporating more memory elements and control logic.
Figure 6 shows the datapath models and their memory elements, where the memory resources
(IM, KM, RF) have been incrementally allocated to each model. Each presented model has been coded

13
J. Imaging 2019, 5, 16

in Verilog HDL, synthesised and placed and routed using the Xilinx Vivado Design Suite v2015.2 on
Xilinx chips installed on widely available development kits which are Artix-7 (Zedboard), Kintex-7
(ZC706) and Virtex-7 (VC707). The obtained f max results are reported in Figure 7.
In this analysis, f max is considered as the performance metric for each processor datapath model
and has a reduction of 8% and 23% for  2 and  3 compared to 1 using the same FPGA technology.
For ,
2 the addition of memory elements specifically IM realised using dedicated BRAM affects f max
by ≈ 8% compared to . 1 Nevertheless, the instruction decoder (ID) which is a combinational part of a
datapath significantly increases the critical path length of the design. A further 15% f max degradation
from  2 to 3 has resulted by adding memory elements KM and RF to support control and data
dependent execution, which requires additional control logic and data multiplexers. Comparing
different FPGA fabrics, a f max reduction of 14% and 23% is observed for Kintex-7 and Artix-7. When  3
is ported from Virtex-7 to Kintex-7 and Artix-7, a maximum f max reduction of 5% and 33% is observed.
This analysis has laid firm foundations by comparing different processor datapath and dataflow
models and how they impact the f max of the resultant soft-core processor. The trade-off analysis
shows that an area-efficient, high-performance soft core processor architecture can be realised that
supports requirements to accelerate image pre-processing applications. Among the presented models,
3 provides the best balance among functionality, flexibility, dataflow mapping and optimisation
possibilities, and performance. This model is used to develop a novel FPGA-based soft core IPPro
architecture in Section 4.3.

0RGHO 
   
0RGHO 

  0RGHO 


)UHTXHQF\ 0+]


 








9LUWH[ 9& .LQWH[ =& $UWL[ =HGERDUG


)3*$IDEULF

Figure 7. Impact of the various datapath models ,


1 ,
2 
3 on f max across Xilinx Artix-7, Kintex-7 and
Virtex-7 FPGA families.

4.3. Image Processing Processor (IPPro)


The IPPro is a 16-bit signed fixed-point, five-stage balanced pipelined RISC architecture that
exploits the DSP48E1 features and provides balance among performance, latency and efficient resource
utilization [10]. The architecture here is modified to support mapping of dataflow graphs by replacing
the previously memory mapped, data memory by stream driven blocking input/output FIFOs as
shown in Figure 8. The IPPro is designed as in-order pipeline because: (1) it consumes fewer area
resources and can achieve better timing closure leading to the higher processor operating frequency
f max ; (2) the in-order pipeline execution is predictable and simplifies scheduling and compiler
development. The datapath supports the identified execution and memory access patterns (Table 1),
and can be used as a coarse-grained processing core. IPPro has an IM of size 512 × 32, a RF of size 32 ×
16 to store pixels and intermediate results, a KM of size 32 × 16 to store kernel coefficients and constant
values, blocking input/output FIFOs to buffer data tokens between a producer, and a consumer to
realise pipelined processing stages.

14
J. Imaging 2019, 5, 16

Figure 8. Block diagram of FPGA-based soft core Image Processing Processor (IPPro) datapath
highlighting where relevant the fixed Xilinx FPGA resources utilised by the approach.

Table 3 outlines the relationship between data abstraction and the addressing modes, along with
some supported instructions for the IPPro architecture, facilitating programmable implementation of
point and area image processing algorithms. The stream access reads a stream of tokens/pixels from
the input FIFO using GET instruction and allows processing either with constant values (Kernel
Memory-FIFO) or neighbouring pixel values (Register File-FIFO or Register File-Register File).
The processed stream is then written to the output FIFO using PUSH instruction. The IPPro supports
arithmetic, logical, branch and data handling instructions. The presented instruction set is optimized
after profiling use cases presented in [10,26].

Table 3. IPPro supported addressing modes highlighting the relation to the data processing
requirements and the instruction set.

Addressing Mode Data Abstraction Supported Instructions


FIFO handling Stream access get, push
Register File–FIFO Stream and randomly accessed data addrf, subrf, mulrf, orrf, minrf, maxrf etc
Register File–Register File Randomly accessed data str, add, mul, mulacc, and, min, max etc.
Kernel Memory–FIFO Stream and fixed values addkm, mulkm, minkm, maxkm etc.

The IPPro supports branch instructions to handle control flow graphs to implement commonly
known constructs such as if-else and case statements. The DSP48E1 block has a pattern detector that
compares the input operands or the generated output results depending on the configuration and
sets/resets the PATTERNDETECT (PD) bit. The IPPro datapath uses the PD bit along with some
additional control logic to generate four flags zero (ZF), equal (EQF), greater than (GTF) and sign (SF)
bits. When the IPPro encounters a branch instruction, the branch controller (BC) compares the flag
status and branch handler (BH) updates the PC as shown in Figure 8.
The IPPro architecture has been coded in Verilog HDL and synthesized using Xilinx Vivado
v2015.4 design suite on Kintex-7 FPGA fabric giving a f max of 337 MHz. Table 4 shows that the
IPPro architecture has achieved 1.6–3.3× times higher operating frequency ( f max ) than the relevant
processors highlighted in Section 2.2 by adopting the approach presented in Section 4. Comparing
the FPGA resource usage of Table 4, the flip-flop utilisation (FF) is relatively similar except for the
FlexGrip which uses 30× more flip-flops. Considering LUTs, the IPPro uses 50% less LUT resources
compared to MicroBlaze and GraphSoC. To analyse design efficiency, a significant difference (0.76–9.00)
in BRAM/DSP ratio can be observed among processors. Analysing design area efficiency, a significant
difference 0.76–9.00 in BRAM/DSP ratio is observed which makes IPPro an area-efficient design based
on the proposed metric.

15
J. Imaging 2019, 5, 16

Table 4. Comparison of IPPro against other FPGA-based processor architectures in terms of FPGA
resources used and timing results achieved.

Resource IPPro Graph-SoC [19] FlexGrip 8 SP * [8] MicroBlaze


FFs 422 551 (103,776/8 =) 12,972 518
LUTs 478 974 (71,323/8 =) 8916 897
BRAMs 1 9 (120/8 =) 15 4
DSP48E1 1 1 (156/8 =) 19.5 3
Stages 5 3 5 5
Freq. (MHz) 337 200 100 211
* Scaled to a single streaming processor.

4.4. Processor Micro-Benchmarks


A commonly used performance metric for a processor is the time required to accomplish a defined
task. Therefore, a set of commonly used micro-benchmarks [9,27] has been chosen and implemented
on the IPPro and compared against a well-established MicroBlaze soft core processor as shown in
Table 5a. Each of the chosen micro-benchmarks are fundamental kernels of larger algorithms and
often the core computation of more extensive practical applications. The micro-benchmarks were
written in standard C and implemented using Xilinx Vivado SDK v2015.1 Xilinx, San Jose, CA, USA.
MicroBlaze has been configured for performance with no debug module, instruction/data cache and
single AXI-Stream link enabled to stream data into the MicroBlaze using getfsl and putfsl instructions
in C, equivalent to (GET and PUT) in assembly.
Table 5a reports the performance results of the micro-benchmarks and Table 5b shows the area
utilisation comparison of the IPPro and the MicroBlaze both implemented on the same Xilinx Kintex-7
FPGA. It shows that the IPPro consumes 1.7 and 2.3 times fewer FFs and LUTs respectively than
the MicroBlaze. It can be observed that for streaming functions (3 × 3 filter, 5-tap FIR and Degree-2
Polynomial), the IPPro achieved 1.80, 4.41 and 8.94 times better performance compared to MicroBlaze
due to support of single cycle multiply-accumulate with data forwarding and get/push instructions
in the IPPro processor. However, as the IPPro datapath does not support branch prediction that
impacts its performance implementing data dependent or conditional functions (Fibonacci and Sum of
absolute differences); thus, the SAD implementation using the IPPro resulted in a 5% performance
degradation compared to Microblaze. On the other hand, for memory-bounded functions such as
Matrix Multiplication, IPPro performed 6.7 times better than MicroBlaze due to higher frequency.

Table 5. Performance comparison of IPPro and MicroBlaze implementations (a) Comparison of


micro-benchmarks. (b) Area comparison.

a
Processor MicroBlaze IPPro
FPGA Fabric Kintex-7
Freq (MHz) 287 337
Micro-benchmarks Exec. Time (us) Speed-up
Convolution 0.60 0.14 4.41
Degree-2 Polynomial 5.92 3.29 1.80
5-tap FIR 47.73 5.34 8.94
Matrix Multiplication 0.67 0.10 6.7
Sum of Abs. Diff. 0.73 0.77 0.95
Fibonacci 4.70 3.56 1.32
b
Processor MicroBlaze IPPro Ratio
FFs 746 422 1.77
LUTs 1114 478 2.33
BRAMs 4 2 2.67
DSP48E1 0 1 0.00

16
J. Imaging 2019, 5, 16

5. System Architecture
The k-means clustering and Traffic Sign Recognition algorithms has been used to explore and
analyse the impact of both data and task parallelism using a multi-core IPPro implemented on a
ZedBoard. The platform has a Xilinx Zynq XC7Z020 SoC device interfaced to a 256 MB flash memory
and 512 MB DDR3 memory. The SoC is composed of a host processor known as programmable system
(PS) which configures and controls the system architecture, and the FPGA programmable logic (PL)
on which the IPPro hardware accelerator is implemented, as illustrated in Figure 9. The SoC data
communication bus (ARM AMBA-AXI) transfers the data between PS and PL using the AXI-DMA
protocol and the Xillybus IP core is deployed as a bridge between PS and PL to feed data into the
image processing pipeline. The IPPro hardware accelerator is interfaced with the Xillybus IP core
via FIFOs. The Linux application running on PS streams data between the FIFO and the file handler
opened by the host application. The Xillybus-Lite interface allows control registers from the user space
program running on Linux to manage the underlying hardware architecture.
Figure 9 shows the implemented system architecture which consists of the necessary control
and data infrastructure. The data interfaces involve stream (Xillybus-Send and Xillybus-Read);
uni-directional memory mapped (Xillybus-Write) to program the IPPro cores; and Xillybus-Lite
to manage Line buffer, scatter, gather, IPPro cores and the FSM. Xillybus Linux device drivers are used
to access each of these data and control interfaces. An additional layer of C functions is developed
using Xillybus device drivers to configure and manage the system architecture, program IPPro cores
and exchange pixels between PS and PL.

Figure 9. System architecture of IPPro-based hardware acceleration highlighting data distribution and
control infrastructure, FIFO configuration and Finite-State-Machine control.

Control Infrastructure
To exploit parallelism, a configurable control infrastructure has been implemented using the PL
resources of the Zynq SoC. It decomposes statically the data into many equal-sized parts, where each
part can be processed by a separate processing core. A row-cyclic data distribution [28] has been used
because it allows buffering of data/pixels in a pattern suitable for point and area image processing

17
J. Imaging 2019, 5, 16

operations after storing them into the line buffers. The system-level architecture (Figure 9) is composed
of line buffers, a scatter module to distribute the buffered pixels, a gather module to collect the
processed pixels and a finite-state-machine (FSM) to manage and synchronise these modules.

6. Case Study 1: k-Means Clustering Algorithm


k-means clustering classifies a data set into k centroids based on the measure e.g., a distance
between each data item and the k centroid values. It involves: Distance Calculation from each data
point to the centroids which gives k distances and the associated pixels, and a minimum distance is
computed from the k distance values; Averaging where data pixels in the dimension are added up
and divided by the number in their dimensions for each cluster, giving an updated centroid value for
the following frame. Here we accelerate a functional core of the k-means clustering algorithm with
4 centroids to be applied to a 512 × 512 image.

6.1. High-Level System Description


The behavioural description is captured in RVC-CAL using Orcc and includes mainly the actor
CAL files and the xdf network, derived from .xml format. A dataflow network is constructed with
FIFO channels between actors to allow high-throughput passage of tokens from one actor’s output
port to another’s input port. The size of FIFO channels can be set. Whilst the length of execution times
are the key factor for FPGA acceleration, overheads incurred in transferring the data to/from the PL
and accelerators are also important. The SIMD degree was explored by redesigning the FPGA-targeted
actors in RVC-CAL and using the compiler to generate the IPPro assembly code. This is done by
analysing the xdf file to decide the allocation of actors to the processors and then compiling the function
and interconnections.
Every IPPro core sets the hardware units around input/output port connections for the proper
flow of tokens, and the compiler is designed to provide the proper signals required by each core.
The compiler also generates the setup registers settings and C-APIs parameters, in order to help the
controllers distribute the tokens among the cores and gather the produced results. Figure 10 shows the
two stages of k-means clustering algorithm to be accelerated, and also cores port connections, sample
distance calculation code in RVC-CAL and its compiled IPPro assembly code. As Xillybus IP has
been used in the system architecture (Section 5), it restricts the clock rate to 100 MHz on Zedboard.
To evaluate the IPPro architecture and different dataflow mapping possibilities by exploiting data and
task-level parallelism, the k-means clustering is accelerated using four acceleration designs listed in
Table 6 and illustrated in Figure 11.

Table 6. Dataflow actor mapping and supported parallelism of IPPro hardware accelerator design
presented in Figure 11.

Parallelism
Design Acceleration Paradigm Mapping
Data Task

1 Single core IPPro Single actor No No

2 8-way SIMD IPPro Single actor Yes No

3 Dual core IPPro Dual actor No Yes

4 Dual core 8-way SIMD IPPro Dual actor Yes Yes

18
J. Imaging 2019, 5, 16

Distance.cal Distance.ippro
packageorg.proj.kmeansorcc; DISTCAL:
actorDistance()int(size=8)DisInput==>int(size=8)DisOutput: GET R1, 1
//Get28Şbitpixelsandpusheachwithitsassociatedcentroid GET R2, 1
DistCal:actionDisInput:[Pix1,Pix2]==>DisOutput:[Pix1, STR R3, 31
Cent1,Pix2,Cent2] STR R4, 40
var STR R5, 50
 uintCent1, STR R6, 76
 uintCent2, SUB R7, R1, R3
 uintCent[4]=[31,40,50,76],//4initialcentroids SUB R8, R1, R4
 uintTemp1[4], SUB R9, R1, R5
(a)  uintTemp2[4], SUB R10, R1, R6
 uintTemp3[4], SUB R11, R2, R3
TopKMeansOrcc.xdf  uintTemp4[4] SUB R12, R2, R4
<?xml version="1.0" encoding="UTF-8"?> do SUB R13, R2, R5
<XDF name="TopKMeansOrcc">  //Manhattandistanceestimation SUB R14, R2, R6
... foreachint(size=8)countin1..4do MUL R15, R7, R7
<Instance id="Distance"> //Pixel1'sdistancefromeverycentroid MUL R16, R8, R8
... Temp1[count]:=Pix1ŞCent[count]; MUL R17, R9, R9
</Instance> //Pixel1'sabsolutevalueestimationbysquaring MUL R18, R10, R10
<Instance id="Averaging"> Temp3[count]:=Temp1[count]*Temp1[count]; MUL R19, R11, R11
... //Pixel2'sdistancefromeverycentroid MUL R20, R12, R12
</Instance> Temp2[count]:=Pix2ŞCent[count]; MUL R21, R13, R13
<Connection dst="Averaging" dst-port="AvInput" //Pixel2'sabsolutevalueestimationbysquaring MUL R22, R14, R14
src="Distance" src-port="DisOutput"/> Temp4[count]:=Temp2[count]*Temp2[count]; ...
<Connection dst="Distance" dst-port="DisInput" end
src="" src-port="InputPort"/> ... (d)
<Connection dst="" dst-port="OutputPort" end
src="Averaging" src-port="AvOutput"/> ...
</XDF> end
(b) (c)

Figure 10. High-level implementation of k-means clustering algorithm: (a) Graphical view of Orcc
dataflow network; (b) Part of dataflow network including the connections; (c) Part of Distance.cal file
showing distance calculation in RVC-CAL where two pixels are received through an input FIFO channel,
processed and sent to an output FIFO channel; (d) Compiled IPPro assembly code of Distance.cal.

Figure 11. IPPro-based hardware accelerator designs to explore and analyse the impact of parallelism
on area and performance based on Single core IPPro , 1 eight-way parallel SIMD IPPro , 2 parallel
Dual core IPPro 3 and combined Dual core 8-way SIMD IPPro called . 4

6.2. IPPro-Based Hardware Acceleration Designs


Table 6 shows the dataflow actor mapping and the exploited parallelism for each design. The block
diagram of each IPPro hardware acceleration design is illustrated in Figure 11. Design  1 and 
2 are
used to accelerate Distance Calculation and Averaging stages, where each stage is mapped separately
onto individual IPPro cores. To investigate the impact of data and task parallelism, design 
3 and 
4 are
used to accelerate both Distance Calculation and Averaging stages as shown in Figure 11. The detailed
area and performance results are reported in Tables 7 and 8. The execution time depends on the
number of IPPro instructions required to compute the operation and the time require to execute a
instruction which corresponds to the operating frequency ( f max ) of IPPro.
Table 7 reports the results obtained by individually accelerating the stages of k-means clustering
using 1 and .2 In each iteration, the distance calculation takes two pixels and classifies them into

19
Discovering Diverse Content Through
Random Scribd Documents
The Project Gutenberg eBook of The moving
finger
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

Title: The moving finger

Author: Natalie Sumner Lincoln

Illustrator: Charles L. Wrenn

Release date: August 8, 2024 [eBook #74210]

Language: English

Original publication: New York and London: D. Appleton and


Company, 1918

Credits: D A Alexander, David E. Brown, and the Online


Distributed Proofreading Team at
https://www.pgdp.net (This file was produced from
images generously made available by The Internet
Archive)

*** START OF THE PROJECT GUTENBERG EBOOK THE MOVING


FINGER ***
The
MOVING FINGER
By NATALIE SUMNER LINCOLN

The Moving Finger I Spy


The Nameless Man C. O. D.
The Official Chaperon The Man Inside
The Lost Despatch The Trevor Case

D. APPLETON & COMPANY, NEW YORK


“Object—matrimony,” he retorted.
[Page 168]
The
MOVING FINGER
BY

NATALIE SUMNER LINCOLN


AUTHOR OF “THE NAMELESS MAN,” “I SPY,” “C. O. D.,”
“THE LOST DESPATCH,” “THE TREVOR CASE,” ETC.

ILLUSTRATED BY
CHARLES L. WRENN

D. APPLETON AND COMPANY


NEW YORK LONDON
1918
Copyright, 1918, by
D. APPLETON AND COMPANY
Copyright, 1918, by The Frank A. Munsey Company

Printed in the United States of America


TO

MR. AND MRS. THOMAS E. NEWBOLD


THIS YARN IS SPUN WITH INFINITE AFFECTION
CONTENTS
chapter page
I. Visions 1
II. Tragedy 8
III. Testimony 23
IV. More Testimony 40
V. Dorothy Deane, “Society Editor” 66
VI. The Wall Between 81
VII. At Thornedale Lodge 98
VIII. Many Inventions 114
IX. In the Attic 138
X. The Black-Edged Card 154
XI. Mrs. Porter Grows Inquisitive 170
XII. Detective Mitchell Asks Questions 186
XIII. The Red Herring 205
XIV. Pro and Con 224
XV. Edged Tools 243
XVI. Hare and Hounds 259
XVII. Vera Receives a Letter 279
XVIII. The Counterfeit Bank Note 293
XIX. The First Shot 313
XX. KA 322
XXI. Blind Man’s Buff 329
XXII. “The Moving Finger Writes—” 344
XXIII. Out of the Maze 352
LIST OF ILLUSTRATIONS
“Object—matrimony,” he retorted Frontispiece
Facing Page
“Yes, I heard,” Millicent could hardly articulate 16
“Who’s there?” she called, as heavy steps approached 142
“Hush!” he whispered. “No noise. Look——” 256
THE MOVING FINGER
CHAPTER I
VISIONS
THE swish of starched skirts caused the man in the bed to roll slowly
over, and for the first time patient and nurse regarded each other.
The silence grew protracted.
“Well?” The man’s tone was husky and the short interrogation was
almost lost among the pillows. He made a second attempt, and this
time his voice carried across the room. “What—what do you want?”
The nurse’s eyes, pupils dilated, shifted from his white face to the
glass in her outstretched hand, and the familiar sight of the medicine
and her starched uniform drove away her temporary loss of
composure.
“Here is your medicine,” she announced, and at the sound of her
low, traînante voice the patient clutched the bedclothes
spasmodically. He made no effort to take the glass.
“Put it on the table,” he directed and, reading correctly the look that
crept into her eyes, his voice rose again harshly. “Put it down, I say
—”
A rap at the closed hall door partly drowned his words, and without
replying Nurse Deane placed the glass on the table by the bed, and
a second later was looking out into the hall. She drew back at sight
of a tall man standing somewhat away from the entrance to the
room, then thinking better of her hesitancy she stepped into the hall
and drew the door shut behind her.
“What is it, Mr. Wyndham?” she inquired.
“I came up to ask if there is anything I can do for you?” Hugh
Wyndham moved over to her side, and Nurse Deane’s preoccupation
prevented her becoming conscious of his scrutiny. “I think Noyes
exceeded matters when he asked you to undertake the care of
another patient.”
Vera Deane’s face lighted with one of her rare smiles. “Oh, no,” she
protested. “We nurses are always glad to assist in emergencies. Dr.
Noyes came in to see Mr. Porter and he explained that one of your
aunt’s dinner guests had been taken ill, and requested me to make
him comfortable for the night.”
“Still, with all you have to do for poor Craig it’s putting too much on
you,” objected Wyndham. “Let me telephone into Washington for
another night nurse, or, better still, call Nurse Hall.”
Vera laid a detaining hand on his arm. “Mrs. Hall was ill herself when
she went off duty; she needs her night’s rest,” she said earnestly. “I
assure you that I am quite capable of taking care of two patients.”
“It wasn’t that,” Hugh paused and reddened uncomfortably, started
to speak, then, thinking better of his first impulse, added lamely, “I
never doubted your ability, but—but—you’ve been under such a
strain with Craig—”
“Mr. Porter is improving,” interrupted Vera swiftly. “And as my new
patient is not seriously ill—”
“True,” Wyndham agreed, slightly relieved. “Just an attack of vertigo
—Noyes and I got him to bed without calling you.” He did not think it
necessary to add that he had stopped the surgeon sending for her.
“Noyes said you need only look in once or twice during the night and
see that he is all right.” A thought occurred to him, and he added
hastily: “Perhaps I can sit up with him—”
“That will hardly be necessary.” Vera’s tone of decision was
unmistakable. “I thank you for the offer,” raising grave eyes to his.
Wyndham bowed somewhat stiffly and moved away. “Just a
moment, Mr. Wyndham; what is the name of my new patient?”
Wyndham’s glance was a mixture of doubt and admiration.
“He is Bruce Brainard, a well-known civil engineer,” he said slowly,
halting by the head of the winding staircase. He looked thoughtfully
over the banisters before again addressing her. “Brainard is just back
from South America. I had no idea my aunt and Millicent knew him
so well, why”—in a sudden burst of confidence—“Brainard gave me
to understand before dinner that he and Millicent were engaged. Let
me know if I can assist you, Miss Deane. Good night,” and barely
waiting to hear her mumbled reply he plunged down the stairs.
Vera Deane’s return to the sick room was noiseless. She found her
patient lying on his side, apparently asleep, one arm shielding his
face and leaving exposed his tousled iron-gray hair. Vera glanced at
the empty medicine glass on the table by the bed, and a relieved
sigh escaped her; evidently Bruce Brainard had obeyed Dr. Noyes’
instructions and swallowed the dose prepared for him.
Making no unnecessary sound Vera arranged the room for the night,
screening the window so that a draught would not blow directly on
Brainard; lighted a night light and, placing a small silver bell on the
bed-table within easy reach of the patient, she turned out the
acetylene gas jet and glided from the room.
Entering the bedroom next to that occupied by Bruce Brainard Vera
smoothed the sheets for Craig Porter, lying motionless on his back,
and made the paralytic comfortable with fresh, cool pillows; then
taking a chair somewhat removed from the bed, she shaded her
eyes from the feeble rays of the night light and was soon buried in
her own thoughts. Dr. Noyes had made a professional call on Craig
Porter earlier in the evening, and he had forbidden Mrs. Porter or her
daughter going to the sick room after six o’clock.
As the night wore on sounds reached Vera of the departure of
guests, and first light then heavy footsteps passing back and forth in
the hall indicated that Mrs. Porter and her household were retiring
for the night. At last all noise ceased, and Vera, lost in memories of
the past, forgot the flight of time.
“Tick-tock, tick-tock”—Bruce Brainard’s dulled wits tried to count the
strokes, but unavailingly; he had lost all track of time. He was only
conscious of eyes glaring down at him. He dared not look up, and
for long minutes lay in agony, bathed in profuse perspiration. His
eyelids seemed weighed down with lead, but he could not keep his
cramped position much longer, and in desperation his eyes flew open
as he writhed nearer the bed-table. His breath came in easier gasps
as he became aware that the large bedroom was empty, and he
passed a feverish, shaking hand across his wet forehead. Pshaw! his
imagination was running away with him. But was it?
Again he glimpsed eyes gazing at him from a corner of the room—
eyes moving steadily nearer and nearer until even the surrounding
darkness failed to hide their expression. A sob broke from Brainard,
and his hand groped for the bell, only to fall palsied by his side.
Dawn was breaking and the faint, fresh breeze of early morning
parted the curtains before a window and disclosed to an inquisitive
snow robin a figure bending over a stationary washstand. Quickly
the skilled fingers made a paste of raw starch and, spreading it
gently over the stained linen, let it stand for a moment, then rinsed
it in cold water. With great patience the operation was repeated until
at last the linen, once more spotless, was laid across an improvised
ironing-board, and an electric iron soon smoothed out each crease
and wrinkle. Leaving every article in its accustomed place, the
worker paused for an instant, then stole from the bathroom and
through the silent house.
CHAPTER II
TRAGEDY

“RAT-A-TAT! Rat-a-tat-tat!”
The imperative summons on his bedroom door roused Hugh
Wyndham. It seemed but a moment since he had fallen asleep, and
he listened in uncomprehending surprise to the repeated
drummings, which grew in volume and rapidity. His hesitancy was
but momentary, however, and springing out of bed he seized a
bathrobe, unlocked the door and jerked it open with such
precipitancy that Vera Deane’s clenched fist expended its force on
empty air instead of on the wooden panel. Her livid face changed
the words on Wyndham’s lips.
“What’s happened?” he demanded. “Craig isn’t—?”
“No—no—not Mr. Porter”—in spite of every effort to remain calm
Vera was on the point of fainting. Totally unconscious of her action
she laid her hand in Wyndham’s, and his firm clasp brought a touch
of comfort. “It’s B—Mr. Brainard. Come!” And turning, she sped
down the hall, her rubber-heeled slippers making no more sound on
the thick carpet than Wyndham’s bare feet. She paused before a
partly opened door and, resting against the wall, her strength
deserting her, she signed to her companion to enter the bedroom.
Without wasting words Wyndham dashed by the nurse and reached
the foot of the bed; but there he stopped, and a horrified
exclamation broke from him. Bruce Brainard lay on the once spotless
white linen in a pool of blood which had flowed from a frightful gash
across his throat.
Wyndham passed a shaking hand before his eyes and turned blindly
toward the door and collided with Vera.
“Don’t come in,” he muttered hoarsely. “It’s no spectacle for a
woman.” And as she drew back into the hall again he burst out
almost violently: “God! Brainard can’t be dead, really dead?” He
glared at her. “Why didn’t you go for Noyes instead of me? He’d
know what to do.”
Vera shook her head. “Mr. Brainard was lifeless when I found him”—
her voice gained steadiness as her years of training in city hospitals
and still grimmer experiences in the American Hospital Corps abroad
came to her aid, and she grew the more composed of the two. “I
went first to summon Dr. Noyes—but his room was empty.”
“Empty!” echoed Wyndham dazedly. “At this hour?” and his glance
roved about the hall, taking in the still burning acetylene gas jet at
the far end of the hall, its artificial rays hardly showing in the
increasing daylight. How could the household remain asleep with
that ghastly tragedy so close at hand? He shuddered and turned half
appealingly to Vera. “What’s to be done?”
“The coroner—”
“To be sure, the coroner”—Wyndham snatched at the suggestion.
“Do you know his name?”
“No,” Vera shook her head, “but I can ask ‘Central.’ I presume the
coroner lives in Alexandria.”
“Yes, yes.” Wyndham was in a fever of unrest, chafing one hand over
the other. “Then will you call him? I’ll wait here until you return.”
Vera did not at once move down the hall. “Had I not better awaken
Mrs. Porter?” she asked.
“No, no,” Wyndham spoke with more show of authority. “I will break
the news to my aunt when you get back. The telephone is in the
library. Go there.”
He was doubtful if she heard his parting injunction for, hurrying to
the stairway, she paused and moved as if to enter Mrs. Porter’s
boudoir, the door of which stood ajar; then apparently thinking
better of her evident intention, she went noiselessly downstairs and
Wyndham, listening intently, detected the faint sound made by the
closing of a door on the floor below. Not until then did he relax his
tense attitude.
Stepping back into Brainard’s bedroom he closed the door softly and
stood contemplating his surroundings, his eyes darting here and
there until each detail of the large handsomely furnished bedroom
was indelibly fixed in his mind.
There was no sign of a struggle having taken place; the two high-
backed chairs and the lounge stood in their accustomed places; the
quaint Colonial dresser near the window, the highboy against the
farther wall, and the bed-table were undisturbed. Only the bed with
its motionless burden was tossed and tumbled.
Wyndham hastily averted his eyes, but not before he had seen the
opened razor lying on the sheet to the left of Brainard and just
beyond the grasp of the stiffened fingers. Drawing in his breath with
a hissing noise, Wyndham retreated to his post outside the door and
waited with ever increasing impatience for the return of Vera Deane.
The noise of the opening and shutting of a door which had reached
Wyndham, contrary to his deductions, had been made not by the
one giving into the library, but by the front door. Vera Deane all but
staggered out on the portico and leaned against one of the columns.
The cold bracing air was a tonic in itself, and she drank it down in
deep gulps, while her gaze strayed over the sloping lawn and the
hills in the background, then across to where the Potomac River
wound its slow way between the Virginia and Maryland shores. The
day promised to be fair, and through the clear atmosphere she could
dimly distinguish the distant Washington Monument and the spires
of the National Capital snugly ensconced among the rolling uplands
of Maryland.
The quaint atmosphere of a bygone age which enveloped the old
Virginia homestead had appealed to Vera from the first moment of
her arrival, and she had grown to love the large rambling country
house whose hospitality, like its name, “Dewdrop Inn,” had
descended from generation to generation. Mrs. Lawrence Porter had
elected to spend the winter there instead of opening her Washington
residence.
Three months had passed since Vera had been engaged to attend
Craig Porter; three months of peace and tranquillity, except for the
duties of the sick room; three months in which she had regained
physical strength and mental rest, and now—
Abruptly turning her back upon the view Vera re-entered the front
hall and made her way down its spacious length until she came to
the door she sought. A draught of cold air blew upon her as she
stepped over the threshold, and with a slight exclamation of surprise
she crossed the library to one of the long French windows which
stood partly open. It gave upon a side portico and, stepping outside,
she looked up and down the pathway which circled the house. No
one was in sight, and slightly perplexed she drew back, closed the
window, and walked over to the telephone instrument which stood
on a small table near by. Her feeling of wonderment grew as she
touched the receiver—it was still warm from the pressure of a moist
hand.
Vera paused in the act of lifting the receiver from its hook and
glanced keenly about the library; apparently she was alone in the
room, but which member of the household had preceded her at the
telephone?
The old “grandfather” clock in one corner of the library was just
chiming a quarter of six when a sleepy “Central” answered her call.
It took several minutes to make the operator understand that she
wished to speak to the coroner at Alexandria, and there was still
further delay before the “Central” announced: “There’s your party.”
Coroner Black stopped Vera’s explanations with an ejaculation, and
his excited intonation betrayed the interest her statement aroused.
“I can’t get over for an hour or two,” he called. “You say you have no
physician—let me see! Ah, yes! Send for Beverly Thorne; he’s a
justice of the peace as well as a physician. Tell him to take charge
until I come;” and click went his receiver on the hook.
Vera looked dubiously at the telephone as she hung up the receiver.
Pshaw! It was no time for indecision—what if an ancient feud did
exist between the Thornes and the Porters, as testified by the “spite
wall” erected by a dead and gone Porter to obstruct the river view
from “Thornedale”! In the presence of sudden death State laws had
to be obeyed, and such things as the conventions, aye, and feuds,
must be brushed aside. Only two days before, when motoring with
Mrs. Porter, that stately dame had indicated the entrance to
“Thornedale” with a solemn inclination of her head and the
statement that its present owner, Dr. Beverly Thorne, would never
be received at her house. But Coroner Black desired his immediate
presence there that morning! In spite of all she had been through, a
ghost of a smile touched Vera’s lovely eyes as she laid aside the
telephone directory and again called “Central.”
Five seconds, ten seconds passed before the operator, more awake,
reported that there was no response to her repeated rings.
“Keep it up,” directed Vera, and waited in ever growing irritation.
“Well?” came a masculine voice over the wires. “What is it?”
“I wish to speak to Dr. Beverly Thorne.”
“This is Dr. Thorne at the telephone—speak louder, please.”
Vera leaned nearer the instrument. “Mr. Bruce Brainard has died
suddenly while visiting Mrs. Lawrence Porter. Kindly come at once to
Dewdrop Inn.”
No response; and Vera, with rising color, was about to repeat her
request more peremptorily when Thorne spoke.
“Did Mr. Brainard die without medical attendance?” he asked.
It was Vera’s turn to hesitate. “I found him dead with his throat cut,”
she stated, and the huskiness of her voice blurred the words so that
she had to repeat them. This time she was not kept waiting for a
reply.
“I will be right over,” shouted Thorne.
“Yes, I heard,” Millicent could hardly articulate.
As Vera rose from the telephone stand a sound to her left caused
her to wheel in that direction. Leaning for support against a
revolving bookcase stood Millicent Porter, and her waxen pallor
brought a startled cry to Vera’s lips.
“Yes, I heard.” Millicent could hardly articulate, and her glance
strayed hopelessly about the room. “I—I must go to mother.”
“Surely.” Vera laid a soothing hand on her shoulder. “But first take a
sip of this,” and she poured out a glass of cognac from the decanter
left in the room after the dinner the night before. She had almost to
force the stimulant down the girl’s throat, then, placing her arm
about her waist, she half supported her out of the room and up the
staircase.
As they came into view Hugh Wyndham left his post by Brainard’s
door and darted toward them. Millicent waved him back and shrank
from his proffered hand.
“Not now, dear Hugh,” she stammered, reading the compassion in
his fine dark eyes. “I must see mother—and alone.” With the false
strength induced by the cognac she freed herself gently from Vera’s
encircling arm and, entering her mother’s bedroom, closed the door
behind her.
Wyndham and Vera regarded each other in silence. “Better so,” he
muttered. “I confess I dreaded breaking the news to Aunt Margaret.”
The gong in the front hall rang loudly and he started. “Who’s coming
here at this hour?” he questioned, turning to descend the stairs.
“It is probably Dr. Thorne, the justice of the peace,” volunteered
Vera, taking a reluctant step toward Brainard’s bedroom. “He said he
would run right over.”
“Run over!” echoed Wyndham blankly. “Thorne? You surely don’t
mean Beverly Thorne?”
“Yes.”
Wyndham missed a step and recovered his balance with difficulty
just as a sleepy, half-dressed footman appeared in the hall below
hastening to the front door. Wyndham continued to gaze at Vera as
if not crediting the evidence of his ears. From below came the
murmur of voices, then a man stepped past the bewildered servant
and approached the staircase. Then only did Wyndham recover his
customary poise.
“This way, Dr. Thorne,” he called softly, and waited while the
newcomer handed his overcoat and hat to the footman and joined
him on the stairs. Vera, an interested spectator, watched the two
men greet each other stiffly, then turning she led the way into
Brainard’s bedroom.
Neither man guessed the effort it cost Vera to keep her eyes turned
on the dead man as with a tremor now and then in her voice she
recounted how she had entered the bedroom to see her patient and
had made the ghastly discovery.
“I then notified Mr. Wyndham,” she concluded.
“Did you visit your patient during the night?” questioned Thorne,
never taking his eyes from the beautiful woman facing him.
“Yes, doctor, at half past one o’clock. Mr. Brainard was fast asleep.”
“And the remainder of the night—”
“I spent with my other patient, Mr. Craig Porter.” Vera moved
restlessly. “If you do not require my assistance, doctor, I will return
to Mr. Porter,” and barely waiting for Thorne’s affirmative nod, she
slipped away, and resumed her seat in the adjoining bedroom
halfway between the window and Craig Porter’s bedside.
From that vantage point she had an unobstructed view of the
shapely head and broad shoulders of the young athlete whose
prowess in college sports had gained a name for him even before his
valor in the aviation corps of the French army had heralded him far
and near. He had been taken from under his shattered aëroplane six
months before in a supposedly dying condition, but modern science
had wrought its miracle and snatched him from the grave to bring
him back to his native land a hopeless paralytic, unable to move
hand or foot.
As she listened to Craig Porter’s regular breathing Vera permitted her
thoughts to turn to Beverly Thorne; his quiet, self-possessed
manner, his finely molded mouth and chin and expressive gray eyes,
had all impressed her favorably, but how account for his lack of
interest in Bruce Brainard—he had never once glanced toward the
bed while she was recounting her discovery of the tragedy. Why had
he looked only at her so persistently?
Had Vera been able to see through lath and plaster, her views would
have undergone a change. Working with a skill and deftness that
aroused Wyndham’s reluctant admiration, Beverly Thorne made a
thorough examination of the body and the bed, taking care not to
disarrange anything. Each piece of furniture and the articles on
tables, dresser, and mantel received his attention, even the curtains
before the window were scrutinized.
“Has anyone besides you and Miss Deane been in this room since
the discovery of the tragedy?” asked Thorne, breaking his long
silence.
“No.”
“When was Mr. Brainard taken ill?”
“During dinner last night. Dr. Noyes said it would be unwise for him
to return to Washington, so Mrs. Porter suggested that he stay here
all night, and I loaned him a pair of pajamas,” Wyndham, talking in
short, jerky sentences, felt Thorne’s eyes boring into him.
“I should like to see Dr. Noyes,” began Thorne. “Where—”
“I’ll get him,” Wyndham broke in, hastening to the door; he
disappeared out of the room just as Thorne picked up the razor and
holding it between thumb and forefinger examined it with deep
interest.
However, Wyndham was destined to forget his errand for, as he sped
down the hall, a door opened and his aunt confronted him.
“Wait, Hugh.” Mrs. Porter held up an imperative hand. “Millicent has
told me of poor Bruce’s tragic death, and Murray,” indicating the
footman standing behind her, “informs me that Dr. Beverly Thorne
has had the effrontery to force his way into this house—and at such
a time.”
She spoke louder than customary under the stress of indignation,
and her words reached Beverly Thorne as he appeared in the hall.
He never paused in his rapid stride until he joined the little group,
and his eyes did not fall before the angry woman’s gaze.
“It is only at such a time as this that I would think of intruding,” he
said. “Kindly remember, madam, that I am here in my official
capacity only. Before I sign a death certificate, an inquest must
decide whether your guest, Bruce Brainard, committed suicide—or
was murdered.”

You might also like