PDF Image Processing Using FPGAs Donald G Bailey Editor download
PDF Image Processing Using FPGAs Donald G Bailey Editor download
com
https://ebookmeta.com/product/image-processing-using-fpgas-
donald-g-bailey-editor/
OR CLICK HERE
DOWLOAD NOW
https://ebookmeta.com/product/design-for-embedded-image-processing-on-
fpgas-2nd-edition-donald-g-bailey/
ebookmeta.com
https://ebookmeta.com/product/digital-image-processing-using-
matlab-4th-edition-rafael-c-gonzalez/
ebookmeta.com
https://ebookmeta.com/product/embedded-microprocessor-system-design-
using-fpgas-uwe-meyer-baese/
ebookmeta.com
https://ebookmeta.com/product/blandy-s-urology-3rd-edition-omar-m-
aboumarzouk-ed/
ebookmeta.com
Serpent s Blood 1st Edition Beth Alvarez Alvarez Beth
https://ebookmeta.com/product/serpent-s-blood-1st-edition-beth-
alvarez-alvarez-beth/
ebookmeta.com
https://ebookmeta.com/product/alec-keepers-of-the-lake-3-1st-edition-
emilia-hartley/
ebookmeta.com
https://ebookmeta.com/product/solid-state-nmr-principles-methods-and-
applications-1st-edition-klaus-muller-marco-geppi/
ebookmeta.com
https://ebookmeta.com/product/indigenous-management-practices-in-
africa-a-guide-for-educators-and-practitioners-1st-edition-uchenna-
uzo/
ebookmeta.com
https://ebookmeta.com/product/monster-girl-safari-1-1st-edition-
roland-carlsson/
ebookmeta.com
Journal of
Imaging
Image Processing
Using FPGAs
Edited by
Donald G. Bailey
Printed Edition of the Special Issue Published in Journal of Imaging
www.mdpi.com/journal/jimaging
Image Processing Using FPGAs
Image Processing Using FPGAs
Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland
This is a reprint of articles from the Special Issue published online in the open access journal
Journal of Imaging (ISSN 2313-433X) from 2018 to 2019 (available at: https://www.mdpi.com/
journal/jimaging/special issues/Image FPGAs).
For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:
LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
Page Range.
c 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license, which allows users to download, copy and build upon
published articles, as long as the author and publisher are properly credited, which ensures maximum
dissemination and a wider impact of our publications.
The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
license CC BY-NC-ND.
Contents
Donald Bailey
Image Processing Using FPGAs
Reprinted from: Journal of Imaging 2019, 5, 53, doi:10.3390/jimaging5050053 . . . . . . . . . . . . 1
Fahad Siddiqui, Sam Amiri, Umar Ibrahim Minhas, Tiantai Deng, Roger Woods,
Karen Rafferty and Daniel Crookes
FPGA-Based Processor Acceleration for Image Processing Applications
Reprinted from: Journal of Imaging 2019, 5, 16, doi:10.3390/jimaging5010016 . . . . . . . . . . . . 5
Paulo Garcia, Deepayan Bhowmik, Robert Stewart, Greg Michaelson and Andrew Wallace
Optimized Memory Allocation and Power Minimization for FPGA-Based Image Processing
Reprinted from: Journal of Imaging 2019, 5, 7, doi:10.3390/jimaging5010007 . . . . . . . . . . . . . 27
Andrew Tzer-Yeu Chen, Rohaan Gupta, Anton Borzenko, Kevin I-Kai Wang and
Morteza Biglari-Abhari
Accelerating SuperBE with Hardware/Software Co-Design
Reprinted from: Journal of Imaging 2018, 4, 122, doi:10.3390/jimaging4100122 . . . . . . . . . . . . 91
Zhe Wang, Trung-Hieu Tran, Ponnanna Kelettira Muthappa and Sven Simon
A JND-Based Pixel-Domain Algorithm and Hardware Architecture for Perceptual Image
Coding
Reprinted from: Journal of Imaging 2019, 5, 50, doi:10.3390/jimaging5050050 . . . . . . . . . . . . 164
v
About the Special Issue Editor
Donald G. Bailey received his Bachelor of Engineering (Honours) degree in Electrical Engineering
in 1982, and Ph.D. degree in Electrical and Electronic Engineering from the University of Canterbury,
New Zealand, in 1985. From 1985 to 1987, he applied image analysis to the wool and paper industries
of New Zealand. From 1987 to 1989, he was a Visiting Research Engineer at University of California,
Santa Barbara. Prof. Bailey joined Massey University in Palmerston North, New Zealand, as Director
of the Image Analysis Unit in November 1989. He was a Visiting Researcher at the University of
Wales, Cardiff, in 1996; University of California, Santa Barbara, in 2001–2002; and Imperial College
London in 2008. He is currently Professor of Imaging Systems in the Department of Mechanical
and Electrical Engineering in the School of Food and Advanced Technology at Massey University,
where he is Leader of the Centre for Research in Image and Signal Processing. Prof. Bailey has spent
over 35 years applying image processing to a range of industrial, machine vision, and robot vision
applications. For the last 18 years, one area of particular focus has been exploring different aspects
of using FPGAs for implementing and accelerating image processing algorithms. He is the author
of many publications in this field, including the book “Design for Embedded Image Processing on
FPGAs”, published by Wiley/IEEE Press. He is a Senior Member of the IEEE, and is active in the
New Zealand Central Section.
vii
Preface to ”Image Processing Using FPGAs”
Over the last 20 years, FPGAs have moved from glue logic through to computing platforms.
They effectively provide a reconfigurable hardware platform for implementing logic and algorithms.
Being fine-grained hardware, FPGAs are able to exploit the parallelism inherent within a hardware
design while at the same time maintaining the reconfigurability and programmability of software.
This has led to FPGAs being used as a platform for accelerating computationally intensive tasks. This
is particularly seen in the field of image processing, where the FPGA-based acceleration of imaging
algorithms has become mainstream. This is even more so within an embedded environment, where
the power and computational resources of conventional processors are not up to the task of managing
the data throughput and computational requirements of real-time imaging applications.
Unfortunately, the fine-grained nature of FPGAs also makes them difficult to programme
effectively. Conventional processors have a fixed computational architecture, which is able to provide
a high level of abstraction. By contrast, on an FPGA, it is necessary to design not only the algorithm
but also the computational architecture, which leads to an explosion in the design space complexity.
This, coupled with the complexities of managing the concurrency of a highly parallel design and the
bandwidth issues associated with the high volume of data associated with images and video, has
led to a wide range of approaches and architectures used for realising FPGA-based image processing
systems. This Special Issue provides an opportunity for researchers in this area to present some of
their latest results and designs. The diversity of presented techniques and applications reflects the
nature and current state of FPGA-based design for image processing.
Donald G. Bailey
Special Issue Editor
ix
Journal of
Imaging
Editorial
Image Processing Using FPGAs
Donald G. Bailey
Department of Mechanical and Electrical Engineering, School of Food and Advanced Technology,
Massey University, Palmerston North 4442, New Zealand; [email protected]
Abstract: Nine articles have been published in this Special Issue on image processing using
field programmable gate arrays (FPGAs). The papers address a diverse range of topics relating
to the application of FPGA technology to accelerate image processing tasks. The range includes:
Custom processor design to reduce the programming burden; memory management for full frames,
line buffers, and image border management; image segmentation through background modelling,
online K-means clustering, and generalised Laplacian of Gaussian filtering; connected components
analysis; and visually lossless image compression.
Keywords: field programmable gate arrays (FPGA); image processing; hardware/software co-design;
memory management; segmentation; image analysis; compression
2. Contributions
Programming an FPGA to accelerate complex algorithms is difficult, with one of four approaches
commonly used [1]:
• Custom hardware design of the algorithm using a hardware description language, optimised for
performance and resources;
• implementing the algorithm by instantiating a set of application-specific intellectual property
cores (from a library);
• using high-level synthesis to convert a C-based representation of the algorithm to
synthesisable hardware; or
• mapping the algorithm onto a parallel set of programmable soft-core processors.
The article by Siddiqui et al. [1] took this last approach, and describes the design of an efficient
16-bit integer soft-core processor, IPPro, capable of operating at 337 MHz, specifically targetting the
dataflow seen in complex image processing algorithms. The presented architecture uses dedicated
stream access instructions on the input and output, with a 32-element local memory for storing pixels
and intermediate results, and a separate 32-element kernel memory for storing filter coefficients
and other parameters and constants. The exploitation of both data-level parallelism and task-level
parallelism is demonstrated through the mapping of a K-means clustering algorithm onto the
architecture, showing good scalability of processing speed with multiple cores. A second case study of
traffic sign recognition is partitioned between the IPPro cores and an ARM processor, with the colour
conversion and morphological filtering stages mapped to the IPPro. Again, the use of parallel IPPro
cores can significantly accelerate these tasks, compared to conventional software, without having to
resort to the tedious effort of custom hardware design.
Garcia et al. [2] worked on the thesis that the image processing operations which require random
access to the whole frame (including iterative algorithms) are particularly difficult to realise in FPGAs.
They investigate the mapping of a frame buffer onto the memory resources of an FPGA, and explore
the optimal mapping onto combinations of configurable on-chip memory blocks. They demonstrate
that, for many image sizes, the default mapping by the synthesis tools results in poor utilisation, and is
also inefficient in terms of power requirements. A procedure is described that determines the best
memory configuration, based on balancing resource utilisation and power requirements. The mapping
scheme is demonstrated with optical flow and mean shift tracking algorithms.
On the other hand, local operations (such as filters) only need part of the image to produce an
output, and operate efficiently in stream processing mode, using line buffers to cache data for scanning
a local window through the image. This works well when the image size is fixed, and is known in
advance. Two situations where this approach is less effective [3] are in the region of interest processing,
where only a small region of the image is processed (usually determined from the image contents at
run-time), and cloud processing of user-uploaded images (which may be of arbitrary size). This is
complicated further in high-speed systems, where the real-time requirements demand processing
multiple pixels in every clock cycle, because, if the line width is not a multiple of the number of pixels
processed each cycle, then it is necessary to assemble the output window pixels from more than one
memory block. Shi et al. [3], in their paper, extend their earlier work on assembling the output window
to allow arbitrary image widths. The resulting line buffer must be configurable at run-time, which is
achieved through a series of “instructions”, which control the assembly of the output processing
window when the required data spans two memory blocks. Re-configuration only takes a few clock
cycles (to load the instructions), rather than conventional approach of reconfiguring the FPGA each
time the image width changes. The results demonstrate better resource utilisation, higher throughput,
and lower power than their earlier approach.
When applying window operations to an image, the size of the output image is smaller than
the input because data is not valid when the window extends beyond the image border. If necessary,
this may be mitigated by extending the input image to provide data to allow such border pixels to be
calculated. Prior work only considered border management using direct form filter structures, because
the window formation and filter function can be kept independent. However, in some applications,
transpose-form filter structures are desirable because the corresponding filter function is automatically
pipelined, leading to fewer resources and faster clock frequencies. Bailey and Ambikumar [4] provide
a design methodology for border management using transpose filter structures, and show that the
resource requirements are similar to those for direct-form border management.
An important task in computer vision is segmenting objects from a complex background. While
there are many background modelling algorithms, the complexity of robust algorithms make them
difficult to realise on an FPGA, especially for larger image sizes. Chen et al. [5] address scalability issues
with increasing image size by using super-pixels—small blocks of adjacent pixels that are treated as a
single unit. As each super-pixel is considered to be either object or background, this means that fewer
2
J. Imaging 2019, 5, 53
models need to be maintained (less memory) and fewer elements need to be classified (reduced
computation time). Using hardware/software co-design, they accelerated the computationally
expensive steps of Gaussian filtering and calculating the mean and variance within each super-pixel
with hardware, with the rest of the algorithm being realised on the on-chip CPU. The resulting system
gave close to state-of-the-art classification accuracy.
A related paper, by Badawi and Bilal [6], used K-means clustering to segment objects within video
sequences. Rather than taking the conventional iterative approach to K-means clustering, they rely
on the temporal coherence of video streams and use the cluster centres from the previous frame as
initialisation for the current frame. Additionally, rather than waiting until the complete frame has
been accumulated before updating the cluster centres, an online algorithm is used, with the clusters
updated for each pixel. To reduce the computational requirements, the centres are updated using a
weighted average. They demonstrate that, for typical video streams, this gives similar performance to
conventional K-means algorithms, but with far less computation and power.
In another segmentation paper, Zhou et al. [7] describe the use of a generalised Laplacian of
Gaussian (LoG) filter for detecting cell nuclei for a histopathology application. The LoG filters detect
elliptical blobs at a range of scales and orientations. Local maxima of the responses are used as
candidate seeds for cell centres, and mean-shift clustering is used to combine multiple detections
from different scales and orientations. Their FPGA design gave modest acceleration over a software
implementation on a high-end computer.
Given a segmented image, a common task is to measure feature vectors of each connected
component for analysis. Bailey and Klaiber [8] present a new single-pass connected components
analysis algorithm, which does this with minimum latency and relatively few resources. The key novelty
of this paper is the use of a zig-zag based scan, rather than a conventional raster scan. This eliminates the
end-of-row processing for label resolution by integrating it directly within the reverse scan. The result is
true single-pixel-per-clock-cycle processing, with no overheads at the end of each row or frame.
An important real-time application of image processing is embedded online image compression
for reducing the data bandwidth for image transmission. In the final paper within this Special Issue,
Wang et al. [9] defined a new image compression codec which works efficiently with a streamed image,
and minimises the perceptual distortion within the reconstructed images. Through small local filters,
each pixel is classified as either an edge, a smooth region, or a textured region. These relate to a
perceptual model of contrast masking, allowing just noticeable distortion (JND) thresholds to be
defined. The image is compressed by downsampling; however, if the error in any of the contributing
pixels exceeds the visibility thresholds, the 2 × 2 block is considered a region of interest, with the
4 pixels coded separately. In both cases, the pixel values are predicted using a 2-dimensional predictor,
and the prediction residuals are quantised and entropy-encoded. Results typically give a visually
lossless 4:1 compression, which is significantly better than other visually lossless codecs.
3. Conclusions
Overall, this collection of papers reflects the diversity of approaches taken to applying FPGAs to
image processing applications. From one end, using the programmable logic to design lightweight
custom processors to enable parallelism, through overcoming some of the limitations of current
high-level synthesis tools, to the other end with the design of custom hardware designs at the
register-transfer level.
The range of image processing techniques include filtering, segmentation, clustering, and
compression. Applications include traffic sign recognition for autonomous driving, histopathology,
and video compression.
3
J. Imaging 2019, 5, 53
References
1. Siddiqui, F.; Amiri, S.; Minhas, U.I.; Deng, T.; Woods, R.; Rafferty, K.; Crookes, D. FPGA-based processor
acceleration for image processing applications. J. Imaging 2019, 5, 16. [CrossRef]
2. Garcia, P.; Bhowmik, D.; Stewart, R.; Michaelson, G.; Wallace, A. Optimized memory allocation and power
minimization for FPGA-based image processing. J. Imaging 2019, 5, 7. [CrossRef]
3. Shi, R.; Wong, J.S.; So, H.K.H. High-throughput line buffer microarchitecture for arbitrary sized streaming
image processing. J. Imaging 2019, 5, 34. [CrossRef]
4. Bailey, D.G.; Ambikumar, A.S. Border handling for 2D transpose filter structures on an FPGA. J. Imaging
2018, 4, 138. [CrossRef]
5. Chen, A.T.Y.; Gupta, R.; Borzenko, A.; Wang, K.I.K.; Biglari-Abhari, M. Accelerating SuperBE with
hardware/software co-design. J. Imaging 2018, 4, 122. [CrossRef]
6. Badawi, A.; Bilal, M. High-level synthesis of online K-Means clustering hardware for a real-time image
processing pipeline. J. Imaging 2019, 5, 38. [CrossRef]
7. Zhou, H.; Machupalli, R.; Mandal, M. Efficient FPGA implementation of automatic nuclei detection in
histopathology images. J. Imaging 2019, 5, 21. [CrossRef]
8. Bailey, D.G.; Klaiber, M.J. Zig-zag based single pass connected components analysis. J. Imaging 2019, 5, 45.
[CrossRef]
9. Wang, Z.; Tran, T.H.; Muthappa, P.K.; Simon, S. A JND-based pixel-domain algorithm and hardware
architecture for perceptual image coding. J. Imaging 2019, 5, 50. [CrossRef]
c 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
4
Journal of
Imaging
Article
FPGA-Based Processor Acceleration for Image
Processing Applications
Fahad Siddiqui 1,† , Sam Amiri 2,† , Umar Ibrahim Minhas 1 , Tiantai Deng 1 , Roger Woods 1, * ,
Karen Rafferty 1 and Daniel Crookes 1
1 School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast,
Belfast BT7 1NN, UK; [email protected] (F.S.); [email protected] (U.I.M.); [email protected] (T.D.);
[email protected] (K.R.); [email protected] (D.C.)
2 School of Computing, Electronics and Maths, Coventry University, Coventry CV1 5FB, UK;
[email protected]
* Correspondence: [email protected]; Tel.: +44-289-097-4081
† These authors contributed equally to this work.
Abstract: FPGA-based embedded image processing systems offer considerable computing resources
but present programming challenges when compared to software systems. The paper describes an
approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can
operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based
programming environment. The approach is demonstrated for a k-means clustering operation and
a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard
that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options
were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a
speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using
16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for
k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient
(fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded
GPU respectively.
1. Introduction
With improved sensor technology, there has been a considerable growth in the amount of data
being generated by security cameras. In many remote environments with limited communication
bandwidth, there is a clear need to overcome this by employing remote functionality in the system such
as employing motion estimation in smart cameras [1]. As security requirements grow, the processing
needs will only need to increase.
New forms of computing architectures are needed. In late 70’s, Lamport [2] laid the foundation
of parallel architectures exploiting data-level parallelism (DLP) using work load vectorisation and
shared memory parallelisation, used extensively in Graphical Processing Units (GPUs). Current energy
requirements and limitations of Dennard scaling have acted to limit clock scaling and thus reduce
future processing capabilities of GPUs or multi-core architectures [3]. Recent field programmable gate
array (FPGA) architectures represent an attractive alternative for acceleration as they comprise ARM
processors and programmable logic for accelerating computing intensive operations.
FPGAs are proven computing platforms that offer reconfigurability, concurrency and pipelining,
but have not been accepted as a mainstream computing platform. The primary inhibitor is the need to
use specialist programming tools, describing algorithms in hardware description language (HDL), altough
this has been alleviated by the introduction of high-level programming tools such as Xilinx’s Vivado
High-level Synthesis (HLS) and Intel’s (Altera’s) compiler for OpenCL. While the level of abstraction
has been raised, a gap still exists between adaptability, performance and efficient utilisation of FPGA
resources. Nevertheless, the FPGA design flow still requires design synthesis and place-and-route that
can be time-consuming depending on the complexity and size of the design [4,5]. This FPGA design
flow is alien to software/algorithm developers and inhibits wider use of the technology.
One way to approach this research problem is to develop adaptable FPGA hardware architecture
that enables edit-compile-run flow familiar to software and algorithm developers instead of hardware
synthesis and place-and-route. This can be achieved by populating FPGA logic with a number of efficient
soft core processors used for programmable hardware acceleration. This underlying architecture will
be adaptable and can be programmed using conventional software development approaches. However,
the challenge is to build an FPGA solution that is more easily programmed whilst still providing high
performance. Whilst FPGA-based processor architectures exist such as Xilinx’s MicroBlaze, Altera’s
NIOS and others [6–9], we propose an Image Processing Processor (IPPro) processor [10] tailored to
accelerate image processing operations, thereby providing an excellent mapping between FPGA
resources, speed and programming efficiency. The main purpose of the paper is to give insights into
the multi-core processor architecture built using the IPPro architecture, its programming environment
and outline its applications to two image processing applications. Our main contributions are:
• Creation of an efficient, FPGA-based multicore processor which advances previous work [10],
[11] and an associated dataflow-based compiler environment for programming a heterogeneous
FPGA resource comprising it and ARM processors.
• Exploration of mapping the functionality for a k-means clustering function, resulting in a possible
speedup of up to 8 times that is 57, 28 and 1.7 times more power efficient (fps/W) than ARM
Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU.
• Acceleration of colour and morphology operations of traffic sign recognition application, resulting
in a speedup of 4.5 and 9.6 times respectively on a Zedboard.
The rest of paper is organized as follows: Section 2 outlines the various image processing
requirements and outlines how these can be matched to FPGA; relevant research is also reviewed.
System requirements are outlined in Section 3 and the soft core processor architecture is also briefly
reviewed in Section 4. The system architecture is outlined in Section 5. Experiments to accelerate a
k-means clustering algorithm and a traffic sign recognition example, are presented in Sections 6 and 7
respectively. Conclusions and future work are described in Section 8.
2. Background
Traditionally, vision systems have been created in a centralized manner where video from
multiple cameras is sent to a central back-end computing unit to extract significant features. However,
with increasing number of nodes and wireless communications, this approach becomes increasingly
limited, particularly with higher resolution cameras [12]. A distributed processing approach can be
employed where data-intensive, front-end preprocessing such as sharpening, object detection etc. can
be deployed remotely, thus avoiding the need to transmit high data, video streams back to the server.
6
J. Imaging 2019, 5, 16
• Customised hardware accelerator designs in HDLs which require long development times but
can be optimised in terms of performance and area.
• Application specific hardware accelerators which are generally optimized for a single function,
non-programmable and created using IP cores.
• Designs created using high-level synthesis tools such as Xilinx’s Vivado HLS tool and
Altera’s OpenCL compiler which convert a C-based specification into an RTL implementation
synthesizable code [15] allowing pipelining and parallelization to be explored.
• Programmable hardware accelerator in the form of vendor specific soft processors such as
Xilinx’s Microblaze and Altera’s NIOS II processors and customized hard/soft processors.
Table 1. Categorisation of image processing operations based on their memory and execution
patterns [13] allow features of compute and memory patterns to be highlighted and therefore identifying
what can be mapped into FPGA.
7
J. Imaging 2019, 5, 16
a GPGPU architecture called FlexGrip [8] which like vector processors, supports wide data parallel,
SIMD-style computation using multiple parallel compute lanes, provides support for conditional
operations, and requires optimized interfaces to on- and off-chip memory. FlexGrip maps pre-compiled
CUDA kernels on soft core processors which are programmable and operate at 100 MHz.
3. System Implementation
Whilst earlier versions of FPGAs just comprised multiple Lookup Tables (LUT) connected to
registers and accelerated by fast adders, FPGAs now comprise more coarse-grained functions such as
dedicated, full-custom, low-power DSP slices. For example, the Xilinx DSP48E1 block comprises a
25-bit pre-adder, a 25 × 18-bit multiplier and a 48-bit adder/subtracter/logic unit, multiple distributed
RAM blocks which offer high bandwidth capability (Figure 1), and a plethora of registers which
supports high levels of pipelining.
Figure 1. Bandwidth/memory distribution in Xilinx Virtex-7 FPGA which highlight how bandwidth
and computation improves as we near the datapath parts of the FPGA.
Whilst FPGAs have been successfully applied in embedded systems and communications,
they have struggled as a mainstream computational platform. Addressing the following considerations
would make FPGAs a major platform rival for “data-intensive” applications:
• Programmability: there is a need for a design methodology which includes a flexible data
communication interface to exchange data. Intellectual Property (IP) cores and HLS tools [15]/
OpenCL design routes increase programming abstraction but do not provide the flexible system
infrastructure for image processing systems.
• Dataflow support: the dataflow model of computation is a recognized model for data-intensive
applications. Algorithms are represented as a directed graph composed of nodes (actors) as
computational units and edges as communication channels [21]. While the actors run explicitly in
parallel decided by the user, actor functionality can either be sequential or concurrent. Current
FPGA realizations use the concurrency of the whole design at a higher level but eliminate
reprogrammability. A better approach is to keep reprogrammability while still maximizing
parallelism by running actors on simple “pipelined” processors; the actors still run their code
explicitly in parallel (user-specified).
• Heterogeneity: the processing features of FPGAs should be integrated with CPUs. Since dataflow
supports both sequential and concurrent platforms, the challenge is then to allow effective
mapping onto CPUs with parallelizable code onto FPGA.
• Toolset availability: design tools created to specifically compile user-defined dataflow programs at
higher levels to fully reprogrammable heterogeneous platform should be available.
8
J. Imaging 2019, 5, 16
is a standalone entity, which defines an execution procedure and can be implemented in the IPPro
processor. Actors communicate with other actors by passing data tokens, and the execution is done
through the token passing through First-In-First-Out (FIFO) units. The combination of a set of actors
with a set of connections between actors constructs a network, which maps well to the system level
architecture of the IPPro processors. An earlier version of the programming environment has been
is detailed in [11] allowing the user to explore parallel implementation and providing the necessary
back-end compilation support.
In our flow, every processor can be thought of as an actor and data is fired through the FIFO
structures but the approach needs to be sensitive to FPGA-based limitations such as restricted memory.
Cal Actor Language (CAL) [22] is a dataflow programming language that has been focussed at image
processing and FPGAs and it offers the necessary constructs for expressing parallel or sequential
coding, bitwise types, a consistent memory model, and a communication between parallel tasks
through queues. RVC-CAL is supported by an open source dataflow development environment and
compiler framework, Orcc, that allows the trans-compilation of actors and generates equivalent code
depending on the chosen back-ends [23]. An RVC-CAL based design is composed of a dataflow
network file (.xdf file) that supports task and data-level parallelism.
Figure 2 illustrates the possible pipelined decomposition of dataflow actors. These dataflow
actors need to be balanced as the worst-case execution time of the actor determines the overall
achievable performance. Data-level parallelism is achieved by making multiple instances of an actor
and requires SIMD operations that shall be supported by the underlying processor architecture.
In addition, it requires software configurable system-level infrastructure that manages control and data
distribution/collection tasks. It involves the initialisation of the soft core processors (programming the
decomposed dataflow actor description), receiving data from the host processor, distributing them to
first-level actors, gathering processed data from the final-level actors and send it back to host processor.
Data-level parallelism directly impacts the system performance; the major limiting factor is the
number of resources available on FPGA. An example pipeline structure with an algorithm composed
of four actors each having different execution times, and multiple instances of the algorithm realised
in SIMD fashion is shown in Figure 2. The performance metric, frames-per-second (fps) can be
approximated using N(total_pixels) the number of pixels in a frame, N( pixel_consumption) the number of
pixels consumed by an actor in each iteration and f ( processor) is operating frequency of processor.
f ( processor) ∗ N( pixel_consumption)
fps ≈ (1)
N(total_pixels)
To improve the f ps, the following options are possible:
• Efficient FPGA-based processor design that operates at higher operating frequency f ( processor) .
• Reducing the actor’s execution time by decomposing it into multiple pipelined stages, thus reducing
t( actor) to improve the f ps. Shorter actors can be merged sequentially to minimise the data transfer
overhead by localising data into FIFOs between processing stages.
• Vertical scaling to exploit data parallelism by mapping an actor on multiple processor cores, thus
N(total_pixels)
reducing (n ∗ N( pixel_consumption) ) at the cost of additional system-level data distribution, control, and
collection mechanisms.
Figure 2. Illustration of possible data and task parallel decomposition of a dataflow algorithm found in
image processing designs where the numerous of rows indicate the level of parallelism.
9
J. Imaging 2019, 5, 16
The developed tool flow (Figure 3) starts with a user-defined RVC-CAL description composed
of actors selected to execute in FPGA-based soft cores with the rest to be run in the host CPUs.
By analyzing behaviour, software/hardware partitioning is decided by two main factors, the actors
with the worse execution time (determined exactly by number of instructions and the average waiting
time to receive the input tokens and send the produced tokens), and the overheads incurred in
transferring the image data to/from the accelerator. The behavioural description of an algorithm could
be coded in different formats:
BehaviouralDescriptionin
RVCCAL
Software/Hardware
Partitioning
RedesignofCPUͲTargeted RedesignofFPGAͲTargeted
ActorsinRVCCAL ActorsinRVCCAL
SIMDApplication
CompilerInfrastructure
RVCCAL–CCompilation InterfaceSettings
XDFAnalysis
ActorCodeGeneration
SystemImplementation ControlͲRegisterValue/
ParameterGeneration
Figure 3. A brief description of the design flow of a hardware and software heterogeneous system
highlighting key features. More detail of the flow is contained in reference [11].
There are two types of decomposition, “row-” and “column-wise”. The newly generated data-
independent actors can be placed row-wise at the same pipeline stage; otherwise they can be placed
column-wise as consecutive pipeline stages. Row-wise is preferred as the overhead incurred in token
transmission can be a limiting factor but typically a combination is employed.
If the actors or actions are not balanced, then they need to be decomposed. This is done by
detecting a sequence of instructions without branches (unless this occurs at the end) and then breaking
the program into basic blocks. The “balance points” whereby the actor needs to be divided into
multiple sets of basic blocks such that if each set is placed in a new actor, then need to be found;
this will ensure that the overhead of transferring tokens among the sets will not create a bottleneck
and infer the selection and use of one with the lowest overhead (See Ref. [11]). Once the graph is
partitioned, the original xdf file no longer represents the network topology, so each set of actors must
be redesigned separately and their input/output ports fixed and a new set of xdf dataflow network
description files, generated. The actors to run on the host CPU are compiled from RVC-CAL to C
using the C backend of Orcc development environment, whereas the FPGA-based functionality is then
created using the proposed compiler framework.
The degree of SIMD applied will affect the controller interface settings. For a target board,
the design will have a fixed number of IPPro cores realized and interconnected with each other and
10
J. Imaging 2019, 5, 16
controllers, determined by the FPGA resources and fan-out delay; for the Zedboard considered here,
32 cores are selected. The compilation infrastructure is composed of three distinctive steps:
• Examination of the xdf dataflow network file and assignment and recording of the actor mapping
to the processors on the network.
• Compilation of each actor’s RVC-CAL code to IPPro assembly code.
• Generation of control register values, mainly for AXI Lite Registers, and parameters required by
the developed C-APIs. running on the host CPU
While FPGA-targeted actor interaction is handled by the compiler, the processes for receiving
the image data and storing the output in the edge actors need to be developed. Multiple controllers
(programmable by the host CPU) are designed to provide the interface to transfer the data to the
accelerators, gather the results and transfer them back to the host. With the host CPU running part
of the design and setting control registers, and the IPPro binary codes of the other actors loaded to
the proper cores on the accelerator, and the interface between the software/hardware sections set
accordingly, the system implementation is in place and ready to run.
11
J. Imaging 2019, 5, 16
500
800 Single Port RAM
NOPATDET
True-Dual Port RAM
PATDET
PREADD_MULT_NOADREG 450
700 15% 52%
MULT_NOMREG
MULT_NOMREG_PATDET
400
Frequency (MHz)
Frequency (MHz)
600
500 350
400 300
300
250
200
-3 -2 -1 200
Virtex-7 Kintex-7 Artix-7
Kintex-7 fabric (speed grade)
FPGA Fabric
(a)
(b)
Figure 4. (a) Impact of DSP48E1 configurations on maximum achievable clock frequency using
different speed grades using Kintex-7 FPGAs for fully pipelined with no (NOPATDET) and with
(PATDET) PATtern DETector, then multiply with no MREG (MULT_NOMREG) and pattern detector
(MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG)
(b) Impact of BRAM configurations on the maximum achievable clock frequency of Artix-7, Kintex-7
and Virtex-7 FPGAs for single and true-dual port RAM configurations.
Table 2. Computing resources (DSP48E1) and BRAM memory resources for a range of Xilinx Artix-7,
Kintex-7, Virtex-7 FPGA families implemented using 28nm CMOS technology.
12
J. Imaging 2019, 5, 16
depends on the input or neighbouring pixels. This model is only suitable for mapping a single
dataflow node.
The second model 2 increases the datapath functionality to a fine-grained processor by including
BRAM-based instruction memory (IM), program counter PC and kernel memory (KM) to store constants as
shown in Figure 6b. Conversely, 2 can support mapping of multiple data independent dataflow nodes
as shown in Figure 5b. The node (OP2) requires a memory storage to store a variable (t1) to compute
the output token (C) which feeds back from the output of the ALU needed for the next instruction in
the following clock cycle. This model supports improved dataflow mapping functionality over 1 by
introducing an IM which comes at the cost of variable execution time and throughput proportional
to the number of instructions required to implement the dataflow actor. This model is suitable for
accelerating combinational logic computations.
The third model 3 increases the datapath functionality to map and execute a data dependent
dataflow actor as shown in Figure 5c. The datapath has memory in the form of a register file (RF) which
represents a coarse-grained processor shown in Figure 6c. The RF stores intermediate results to execute
data dependent operations, implements (feed-forward, split, merge and feedback) dataflow execution
patterns and facilitates dataflow transformations (actor fusion/fission, pipelining etc.) constraints
by the size of the RF. It can implement modular computations which are not possible in 1 and .2
In contrast to
1 and ,
2 the token production/consumption (P/C) rate of 3 can be controlled through
program code that allows software-controlled scheduling and load balancing possibilities.
(a)
(b) (c)
Figure 5. A range of dataflow models taken from [24,25]. (a) DFG node without internal storage
called configuration ;
1 (b) DFG actor without internal storage t1 and constant i called configuration
;
2 (c) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j called
configuration .
3
13
J. Imaging 2019, 5, 16
in Verilog HDL, synthesised and placed and routed using the Xilinx Vivado Design Suite v2015.2 on
Xilinx chips installed on widely available development kits which are Artix-7 (Zedboard), Kintex-7
(ZC706) and Virtex-7 (VC707). The obtained f max results are reported in Figure 7.
In this analysis, f max is considered as the performance metric for each processor datapath model
and has a reduction of 8% and 23% for 2 and 3 compared to 1 using the same FPGA technology.
For ,
2 the addition of memory elements specifically IM realised using dedicated BRAM affects f max
by ≈ 8% compared to . 1 Nevertheless, the instruction decoder (ID) which is a combinational part of a
datapath significantly increases the critical path length of the design. A further 15% f max degradation
from 2 to 3 has resulted by adding memory elements KM and RF to support control and data
dependent execution, which requires additional control logic and data multiplexers. Comparing
different FPGA fabrics, a f max reduction of 14% and 23% is observed for Kintex-7 and Artix-7. When 3
is ported from Virtex-7 to Kintex-7 and Artix-7, a maximum f max reduction of 5% and 33% is observed.
This analysis has laid firm foundations by comparing different processor datapath and dataflow
models and how they impact the f max of the resultant soft-core processor. The trade-off analysis
shows that an area-efficient, high-performance soft core processor architecture can be realised that
supports requirements to accelerate image pre-processing applications. Among the presented models,
3 provides the best balance among functionality, flexibility, dataflow mapping and optimisation
possibilities, and performance. This model is used to develop a novel FPGA-based soft core IPPro
architecture in Section 4.3.
0RGHO
0RGHO
0RGHO
)UHTXHQF\ 0+]
14
J. Imaging 2019, 5, 16
Figure 8. Block diagram of FPGA-based soft core Image Processing Processor (IPPro) datapath
highlighting where relevant the fixed Xilinx FPGA resources utilised by the approach.
Table 3 outlines the relationship between data abstraction and the addressing modes, along with
some supported instructions for the IPPro architecture, facilitating programmable implementation of
point and area image processing algorithms. The stream access reads a stream of tokens/pixels from
the input FIFO using GET instruction and allows processing either with constant values (Kernel
Memory-FIFO) or neighbouring pixel values (Register File-FIFO or Register File-Register File).
The processed stream is then written to the output FIFO using PUSH instruction. The IPPro supports
arithmetic, logical, branch and data handling instructions. The presented instruction set is optimized
after profiling use cases presented in [10,26].
Table 3. IPPro supported addressing modes highlighting the relation to the data processing
requirements and the instruction set.
The IPPro supports branch instructions to handle control flow graphs to implement commonly
known constructs such as if-else and case statements. The DSP48E1 block has a pattern detector that
compares the input operands or the generated output results depending on the configuration and
sets/resets the PATTERNDETECT (PD) bit. The IPPro datapath uses the PD bit along with some
additional control logic to generate four flags zero (ZF), equal (EQF), greater than (GTF) and sign (SF)
bits. When the IPPro encounters a branch instruction, the branch controller (BC) compares the flag
status and branch handler (BH) updates the PC as shown in Figure 8.
The IPPro architecture has been coded in Verilog HDL and synthesized using Xilinx Vivado
v2015.4 design suite on Kintex-7 FPGA fabric giving a f max of 337 MHz. Table 4 shows that the
IPPro architecture has achieved 1.6–3.3× times higher operating frequency ( f max ) than the relevant
processors highlighted in Section 2.2 by adopting the approach presented in Section 4. Comparing
the FPGA resource usage of Table 4, the flip-flop utilisation (FF) is relatively similar except for the
FlexGrip which uses 30× more flip-flops. Considering LUTs, the IPPro uses 50% less LUT resources
compared to MicroBlaze and GraphSoC. To analyse design efficiency, a significant difference (0.76–9.00)
in BRAM/DSP ratio can be observed among processors. Analysing design area efficiency, a significant
difference 0.76–9.00 in BRAM/DSP ratio is observed which makes IPPro an area-efficient design based
on the proposed metric.
15
J. Imaging 2019, 5, 16
Table 4. Comparison of IPPro against other FPGA-based processor architectures in terms of FPGA
resources used and timing results achieved.
a
Processor MicroBlaze IPPro
FPGA Fabric Kintex-7
Freq (MHz) 287 337
Micro-benchmarks Exec. Time (us) Speed-up
Convolution 0.60 0.14 4.41
Degree-2 Polynomial 5.92 3.29 1.80
5-tap FIR 47.73 5.34 8.94
Matrix Multiplication 0.67 0.10 6.7
Sum of Abs. Diff. 0.73 0.77 0.95
Fibonacci 4.70 3.56 1.32
b
Processor MicroBlaze IPPro Ratio
FFs 746 422 1.77
LUTs 1114 478 2.33
BRAMs 4 2 2.67
DSP48E1 0 1 0.00
16
J. Imaging 2019, 5, 16
5. System Architecture
The k-means clustering and Traffic Sign Recognition algorithms has been used to explore and
analyse the impact of both data and task parallelism using a multi-core IPPro implemented on a
ZedBoard. The platform has a Xilinx Zynq XC7Z020 SoC device interfaced to a 256 MB flash memory
and 512 MB DDR3 memory. The SoC is composed of a host processor known as programmable system
(PS) which configures and controls the system architecture, and the FPGA programmable logic (PL)
on which the IPPro hardware accelerator is implemented, as illustrated in Figure 9. The SoC data
communication bus (ARM AMBA-AXI) transfers the data between PS and PL using the AXI-DMA
protocol and the Xillybus IP core is deployed as a bridge between PS and PL to feed data into the
image processing pipeline. The IPPro hardware accelerator is interfaced with the Xillybus IP core
via FIFOs. The Linux application running on PS streams data between the FIFO and the file handler
opened by the host application. The Xillybus-Lite interface allows control registers from the user space
program running on Linux to manage the underlying hardware architecture.
Figure 9 shows the implemented system architecture which consists of the necessary control
and data infrastructure. The data interfaces involve stream (Xillybus-Send and Xillybus-Read);
uni-directional memory mapped (Xillybus-Write) to program the IPPro cores; and Xillybus-Lite
to manage Line buffer, scatter, gather, IPPro cores and the FSM. Xillybus Linux device drivers are used
to access each of these data and control interfaces. An additional layer of C functions is developed
using Xillybus device drivers to configure and manage the system architecture, program IPPro cores
and exchange pixels between PS and PL.
Figure 9. System architecture of IPPro-based hardware acceleration highlighting data distribution and
control infrastructure, FIFO configuration and Finite-State-Machine control.
Control Infrastructure
To exploit parallelism, a configurable control infrastructure has been implemented using the PL
resources of the Zynq SoC. It decomposes statically the data into many equal-sized parts, where each
part can be processed by a separate processing core. A row-cyclic data distribution [28] has been used
because it allows buffering of data/pixels in a pattern suitable for point and area image processing
17
J. Imaging 2019, 5, 16
operations after storing them into the line buffers. The system-level architecture (Figure 9) is composed
of line buffers, a scatter module to distribute the buffered pixels, a gather module to collect the
processed pixels and a finite-state-machine (FSM) to manage and synchronise these modules.
Table 6. Dataflow actor mapping and supported parallelism of IPPro hardware accelerator design
presented in Figure 11.
Parallelism
Design Acceleration Paradigm Mapping
Data Task
1 Single core IPPro Single actor No No
2 8-way SIMD IPPro Single actor Yes No
3 Dual core IPPro Dual actor No Yes
4 Dual core 8-way SIMD IPPro Dual actor Yes Yes
18
J. Imaging 2019, 5, 16
Distance.cal Distance.ippro
packageorg.proj.kmeansorcc; DISTCAL:
actorDistance()int(size=8)DisInput==>int(size=8)DisOutput: GET R1, 1
//Get28Şbitpixelsandpusheachwithitsassociatedcentroid GET R2, 1
DistCal:actionDisInput:[Pix1,Pix2]==>DisOutput:[Pix1, STR R3, 31
Cent1,Pix2,Cent2] STR R4, 40
var STR R5, 50
uintCent1, STR R6, 76
uintCent2, SUB R7, R1, R3
uintCent[4]=[31,40,50,76],//4initialcentroids SUB R8, R1, R4
uintTemp1[4], SUB R9, R1, R5
(a) uintTemp2[4], SUB R10, R1, R6
uintTemp3[4], SUB R11, R2, R3
TopKMeansOrcc.xdf uintTemp4[4] SUB R12, R2, R4
<?xml version="1.0" encoding="UTF-8"?> do SUB R13, R2, R5
<XDF name="TopKMeansOrcc"> //Manhattandistanceestimation SUB R14, R2, R6
... foreachint(size=8)countin1..4do MUL R15, R7, R7
<Instance id="Distance"> //Pixel1'sdistancefromeverycentroid MUL R16, R8, R8
... Temp1[count]:=Pix1ŞCent[count]; MUL R17, R9, R9
</Instance> //Pixel1'sabsolutevalueestimationbysquaring MUL R18, R10, R10
<Instance id="Averaging"> Temp3[count]:=Temp1[count]*Temp1[count]; MUL R19, R11, R11
... //Pixel2'sdistancefromeverycentroid MUL R20, R12, R12
</Instance> Temp2[count]:=Pix2ŞCent[count]; MUL R21, R13, R13
<Connection dst="Averaging" dst-port="AvInput" //Pixel2'sabsolutevalueestimationbysquaring MUL R22, R14, R14
src="Distance" src-port="DisOutput"/> Temp4[count]:=Temp2[count]*Temp2[count]; ...
<Connection dst="Distance" dst-port="DisInput" end
src="" src-port="InputPort"/> ... (d)
<Connection dst="" dst-port="OutputPort" end
src="Averaging" src-port="AvOutput"/> ...
</XDF> end
(b) (c)
Figure 10. High-level implementation of k-means clustering algorithm: (a) Graphical view of Orcc
dataflow network; (b) Part of dataflow network including the connections; (c) Part of Distance.cal file
showing distance calculation in RVC-CAL where two pixels are received through an input FIFO channel,
processed and sent to an output FIFO channel; (d) Compiled IPPro assembly code of Distance.cal.
Figure 11. IPPro-based hardware accelerator designs to explore and analyse the impact of parallelism
on area and performance based on Single core IPPro , 1 eight-way parallel SIMD IPPro , 2 parallel
Dual core IPPro 3 and combined Dual core 8-way SIMD IPPro called . 4
19
Discovering Diverse Content Through
Random Scribd Documents
The Project Gutenberg eBook of The moving
finger
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Language: English
ILLUSTRATED BY
CHARLES L. WRENN
“RAT-A-TAT! Rat-a-tat-tat!”
The imperative summons on his bedroom door roused Hugh
Wyndham. It seemed but a moment since he had fallen asleep, and
he listened in uncomprehending surprise to the repeated
drummings, which grew in volume and rapidity. His hesitancy was
but momentary, however, and springing out of bed he seized a
bathrobe, unlocked the door and jerked it open with such
precipitancy that Vera Deane’s clenched fist expended its force on
empty air instead of on the wooden panel. Her livid face changed
the words on Wyndham’s lips.
“What’s happened?” he demanded. “Craig isn’t—?”
“No—no—not Mr. Porter”—in spite of every effort to remain calm
Vera was on the point of fainting. Totally unconscious of her action
she laid her hand in Wyndham’s, and his firm clasp brought a touch
of comfort. “It’s B—Mr. Brainard. Come!” And turning, she sped
down the hall, her rubber-heeled slippers making no more sound on
the thick carpet than Wyndham’s bare feet. She paused before a
partly opened door and, resting against the wall, her strength
deserting her, she signed to her companion to enter the bedroom.
Without wasting words Wyndham dashed by the nurse and reached
the foot of the bed; but there he stopped, and a horrified
exclamation broke from him. Bruce Brainard lay on the once spotless
white linen in a pool of blood which had flowed from a frightful gash
across his throat.
Wyndham passed a shaking hand before his eyes and turned blindly
toward the door and collided with Vera.
“Don’t come in,” he muttered hoarsely. “It’s no spectacle for a
woman.” And as she drew back into the hall again he burst out
almost violently: “God! Brainard can’t be dead, really dead?” He
glared at her. “Why didn’t you go for Noyes instead of me? He’d
know what to do.”
Vera shook her head. “Mr. Brainard was lifeless when I found him”—
her voice gained steadiness as her years of training in city hospitals
and still grimmer experiences in the American Hospital Corps abroad
came to her aid, and she grew the more composed of the two. “I
went first to summon Dr. Noyes—but his room was empty.”
“Empty!” echoed Wyndham dazedly. “At this hour?” and his glance
roved about the hall, taking in the still burning acetylene gas jet at
the far end of the hall, its artificial rays hardly showing in the
increasing daylight. How could the household remain asleep with
that ghastly tragedy so close at hand? He shuddered and turned half
appealingly to Vera. “What’s to be done?”
“The coroner—”
“To be sure, the coroner”—Wyndham snatched at the suggestion.
“Do you know his name?”
“No,” Vera shook her head, “but I can ask ‘Central.’ I presume the
coroner lives in Alexandria.”
“Yes, yes.” Wyndham was in a fever of unrest, chafing one hand over
the other. “Then will you call him? I’ll wait here until you return.”
Vera did not at once move down the hall. “Had I not better awaken
Mrs. Porter?” she asked.
“No, no,” Wyndham spoke with more show of authority. “I will break
the news to my aunt when you get back. The telephone is in the
library. Go there.”
He was doubtful if she heard his parting injunction for, hurrying to
the stairway, she paused and moved as if to enter Mrs. Porter’s
boudoir, the door of which stood ajar; then apparently thinking
better of her evident intention, she went noiselessly downstairs and
Wyndham, listening intently, detected the faint sound made by the
closing of a door on the floor below. Not until then did he relax his
tense attitude.
Stepping back into Brainard’s bedroom he closed the door softly and
stood contemplating his surroundings, his eyes darting here and
there until each detail of the large handsomely furnished bedroom
was indelibly fixed in his mind.
There was no sign of a struggle having taken place; the two high-
backed chairs and the lounge stood in their accustomed places; the
quaint Colonial dresser near the window, the highboy against the
farther wall, and the bed-table were undisturbed. Only the bed with
its motionless burden was tossed and tumbled.
Wyndham hastily averted his eyes, but not before he had seen the
opened razor lying on the sheet to the left of Brainard and just
beyond the grasp of the stiffened fingers. Drawing in his breath with
a hissing noise, Wyndham retreated to his post outside the door and
waited with ever increasing impatience for the return of Vera Deane.
The noise of the opening and shutting of a door which had reached
Wyndham, contrary to his deductions, had been made not by the
one giving into the library, but by the front door. Vera Deane all but
staggered out on the portico and leaned against one of the columns.
The cold bracing air was a tonic in itself, and she drank it down in
deep gulps, while her gaze strayed over the sloping lawn and the
hills in the background, then across to where the Potomac River
wound its slow way between the Virginia and Maryland shores. The
day promised to be fair, and through the clear atmosphere she could
dimly distinguish the distant Washington Monument and the spires
of the National Capital snugly ensconced among the rolling uplands
of Maryland.
The quaint atmosphere of a bygone age which enveloped the old
Virginia homestead had appealed to Vera from the first moment of
her arrival, and she had grown to love the large rambling country
house whose hospitality, like its name, “Dewdrop Inn,” had
descended from generation to generation. Mrs. Lawrence Porter had
elected to spend the winter there instead of opening her Washington
residence.
Three months had passed since Vera had been engaged to attend
Craig Porter; three months of peace and tranquillity, except for the
duties of the sick room; three months in which she had regained
physical strength and mental rest, and now—
Abruptly turning her back upon the view Vera re-entered the front
hall and made her way down its spacious length until she came to
the door she sought. A draught of cold air blew upon her as she
stepped over the threshold, and with a slight exclamation of surprise
she crossed the library to one of the long French windows which
stood partly open. It gave upon a side portico and, stepping outside,
she looked up and down the pathway which circled the house. No
one was in sight, and slightly perplexed she drew back, closed the
window, and walked over to the telephone instrument which stood
on a small table near by. Her feeling of wonderment grew as she
touched the receiver—it was still warm from the pressure of a moist
hand.
Vera paused in the act of lifting the receiver from its hook and
glanced keenly about the library; apparently she was alone in the
room, but which member of the household had preceded her at the
telephone?
The old “grandfather” clock in one corner of the library was just
chiming a quarter of six when a sleepy “Central” answered her call.
It took several minutes to make the operator understand that she
wished to speak to the coroner at Alexandria, and there was still
further delay before the “Central” announced: “There’s your party.”
Coroner Black stopped Vera’s explanations with an ejaculation, and
his excited intonation betrayed the interest her statement aroused.
“I can’t get over for an hour or two,” he called. “You say you have no
physician—let me see! Ah, yes! Send for Beverly Thorne; he’s a
justice of the peace as well as a physician. Tell him to take charge
until I come;” and click went his receiver on the hook.
Vera looked dubiously at the telephone as she hung up the receiver.
Pshaw! It was no time for indecision—what if an ancient feud did
exist between the Thornes and the Porters, as testified by the “spite
wall” erected by a dead and gone Porter to obstruct the river view
from “Thornedale”! In the presence of sudden death State laws had
to be obeyed, and such things as the conventions, aye, and feuds,
must be brushed aside. Only two days before, when motoring with
Mrs. Porter, that stately dame had indicated the entrance to
“Thornedale” with a solemn inclination of her head and the
statement that its present owner, Dr. Beverly Thorne, would never
be received at her house. But Coroner Black desired his immediate
presence there that morning! In spite of all she had been through, a
ghost of a smile touched Vera’s lovely eyes as she laid aside the
telephone directory and again called “Central.”
Five seconds, ten seconds passed before the operator, more awake,
reported that there was no response to her repeated rings.
“Keep it up,” directed Vera, and waited in ever growing irritation.
“Well?” came a masculine voice over the wires. “What is it?”
“I wish to speak to Dr. Beverly Thorne.”
“This is Dr. Thorne at the telephone—speak louder, please.”
Vera leaned nearer the instrument. “Mr. Bruce Brainard has died
suddenly while visiting Mrs. Lawrence Porter. Kindly come at once to
Dewdrop Inn.”
No response; and Vera, with rising color, was about to repeat her
request more peremptorily when Thorne spoke.
“Did Mr. Brainard die without medical attendance?” he asked.
It was Vera’s turn to hesitate. “I found him dead with his throat cut,”
she stated, and the huskiness of her voice blurred the words so that
she had to repeat them. This time she was not kept waiting for a
reply.
“I will be right over,” shouted Thorne.
“Yes, I heard,” Millicent could hardly articulate.
As Vera rose from the telephone stand a sound to her left caused
her to wheel in that direction. Leaning for support against a
revolving bookcase stood Millicent Porter, and her waxen pallor
brought a startled cry to Vera’s lips.
“Yes, I heard.” Millicent could hardly articulate, and her glance
strayed hopelessly about the room. “I—I must go to mother.”
“Surely.” Vera laid a soothing hand on her shoulder. “But first take a
sip of this,” and she poured out a glass of cognac from the decanter
left in the room after the dinner the night before. She had almost to
force the stimulant down the girl’s throat, then, placing her arm
about her waist, she half supported her out of the room and up the
staircase.
As they came into view Hugh Wyndham left his post by Brainard’s
door and darted toward them. Millicent waved him back and shrank
from his proffered hand.
“Not now, dear Hugh,” she stammered, reading the compassion in
his fine dark eyes. “I must see mother—and alone.” With the false
strength induced by the cognac she freed herself gently from Vera’s
encircling arm and, entering her mother’s bedroom, closed the door
behind her.
Wyndham and Vera regarded each other in silence. “Better so,” he
muttered. “I confess I dreaded breaking the news to Aunt Margaret.”
The gong in the front hall rang loudly and he started. “Who’s coming
here at this hour?” he questioned, turning to descend the stairs.
“It is probably Dr. Thorne, the justice of the peace,” volunteered
Vera, taking a reluctant step toward Brainard’s bedroom. “He said he
would run right over.”
“Run over!” echoed Wyndham blankly. “Thorne? You surely don’t
mean Beverly Thorne?”
“Yes.”
Wyndham missed a step and recovered his balance with difficulty
just as a sleepy, half-dressed footman appeared in the hall below
hastening to the front door. Wyndham continued to gaze at Vera as
if not crediting the evidence of his ears. From below came the
murmur of voices, then a man stepped past the bewildered servant
and approached the staircase. Then only did Wyndham recover his
customary poise.
“This way, Dr. Thorne,” he called softly, and waited while the
newcomer handed his overcoat and hat to the footman and joined
him on the stairs. Vera, an interested spectator, watched the two
men greet each other stiffly, then turning she led the way into
Brainard’s bedroom.
Neither man guessed the effort it cost Vera to keep her eyes turned
on the dead man as with a tremor now and then in her voice she
recounted how she had entered the bedroom to see her patient and
had made the ghastly discovery.
“I then notified Mr. Wyndham,” she concluded.
“Did you visit your patient during the night?” questioned Thorne,
never taking his eyes from the beautiful woman facing him.
“Yes, doctor, at half past one o’clock. Mr. Brainard was fast asleep.”
“And the remainder of the night—”
“I spent with my other patient, Mr. Craig Porter.” Vera moved
restlessly. “If you do not require my assistance, doctor, I will return
to Mr. Porter,” and barely waiting for Thorne’s affirmative nod, she
slipped away, and resumed her seat in the adjoining bedroom
halfway between the window and Craig Porter’s bedside.
From that vantage point she had an unobstructed view of the
shapely head and broad shoulders of the young athlete whose
prowess in college sports had gained a name for him even before his
valor in the aviation corps of the French army had heralded him far
and near. He had been taken from under his shattered aëroplane six
months before in a supposedly dying condition, but modern science
had wrought its miracle and snatched him from the grave to bring
him back to his native land a hopeless paralytic, unable to move
hand or foot.
As she listened to Craig Porter’s regular breathing Vera permitted her
thoughts to turn to Beverly Thorne; his quiet, self-possessed
manner, his finely molded mouth and chin and expressive gray eyes,
had all impressed her favorably, but how account for his lack of
interest in Bruce Brainard—he had never once glanced toward the
bed while she was recounting her discovery of the tragedy. Why had
he looked only at her so persistently?
Had Vera been able to see through lath and plaster, her views would
have undergone a change. Working with a skill and deftness that
aroused Wyndham’s reluctant admiration, Beverly Thorne made a
thorough examination of the body and the bed, taking care not to
disarrange anything. Each piece of furniture and the articles on
tables, dresser, and mantel received his attention, even the curtains
before the window were scrutinized.
“Has anyone besides you and Miss Deane been in this room since
the discovery of the tragedy?” asked Thorne, breaking his long
silence.
“No.”
“When was Mr. Brainard taken ill?”
“During dinner last night. Dr. Noyes said it would be unwise for him
to return to Washington, so Mrs. Porter suggested that he stay here
all night, and I loaned him a pair of pajamas,” Wyndham, talking in
short, jerky sentences, felt Thorne’s eyes boring into him.
“I should like to see Dr. Noyes,” began Thorne. “Where—”
“I’ll get him,” Wyndham broke in, hastening to the door; he
disappeared out of the room just as Thorne picked up the razor and
holding it between thumb and forefinger examined it with deep
interest.
However, Wyndham was destined to forget his errand for, as he sped
down the hall, a door opened and his aunt confronted him.
“Wait, Hugh.” Mrs. Porter held up an imperative hand. “Millicent has
told me of poor Bruce’s tragic death, and Murray,” indicating the
footman standing behind her, “informs me that Dr. Beverly Thorne
has had the effrontery to force his way into this house—and at such
a time.”
She spoke louder than customary under the stress of indignation,
and her words reached Beverly Thorne as he appeared in the hall.
He never paused in his rapid stride until he joined the little group,
and his eyes did not fall before the angry woman’s gaze.
“It is only at such a time as this that I would think of intruding,” he
said. “Kindly remember, madam, that I am here in my official
capacity only. Before I sign a death certificate, an inquest must
decide whether your guest, Bruce Brainard, committed suicide—or
was murdered.”