100% found this document useful (2 votes)
76 views55 pages

Download Compiling Algorithms for Heterogeneous Systems Steven Bell ebook All Chapters PDF

Compiling

Uploaded by

zsilikofofon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
76 views55 pages

Download Compiling Algorithms for Heterogeneous Systems Steven Bell ebook All Chapters PDF

Compiling

Uploaded by

zsilikofofon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.

com

Compiling Algorithms for Heterogeneous Systems


Steven Bell

https://textbookfull.com/product/compiling-algorithms-for-
heterogeneous-systems-steven-bell/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Colloidal Nanoparticles for Heterogeneous Catalysis


Priscila Destro

https://textbookfull.com/product/colloidal-nanoparticles-for-
heterogeneous-catalysis-priscila-destro/

textboxfull.com

Radio Systems Engineering Steven W. Ellingson

https://textbookfull.com/product/radio-systems-engineering-steven-w-
ellingson/

textboxfull.com

Intelligent Algorithms for Analysis and Control of


Dynamical Systems Rajesh Kumar

https://textbookfull.com/product/intelligent-algorithms-for-analysis-
and-control-of-dynamical-systems-rajesh-kumar/

textboxfull.com

International environmental risk management: a systems


approach Second Edition Bell

https://textbookfull.com/product/international-environmental-risk-
management-a-systems-approach-second-edition-bell/

textboxfull.com
Data Parallel C++ Mastering DPC++ for Programming of
Heterogeneous Systems using C++ and SYCL 1st Edition James
Reinders
https://textbookfull.com/product/data-parallel-c-mastering-dpc-for-
programming-of-heterogeneous-systems-using-c-and-sycl-1st-edition-
james-reinders/
textboxfull.com

Tools and Algorithms for the Construction and Analysis of


Systems Dirk Beyer

https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer/

textboxfull.com

Tools and Algorithms for the Construction and Analysis of


Systems Dirk Beyer

https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer-2/

textboxfull.com

How to Draw Manga Volume 1 Compiling Characters Society


For The Study Of Manga Techniques

https://textbookfull.com/product/how-to-draw-manga-volume-1-compiling-
characters-society-for-the-study-of-manga-techniques/

textboxfull.com

Smart Electronic Systems Heterogeneous Integration of


Silicon and Printed Electronics Li-Rong Zheng

https://textbookfull.com/product/smart-electronic-systems-
heterogeneous-integration-of-silicon-and-printed-electronics-li-rong-
zheng/
textboxfull.com
Compiling Algorithms
for Heterogeneous Systems
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.

Compiling Algorithms for Heterogeneous Systems


Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
2018

Architectural and Operating System Support for Virtual Memory


Abhishek Bhattacharjee and Daniel Lustig
2017

Deep Learning for Computer Architects


Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and David Brooks
2017

On-Chip Networks, Second Edition


Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh
2017

Space-Time Computing with Temporal Neural Networks


James E. Smith
2017

Hardware and Software Support for Virtualization


Edouard Bugnion, Jason Nieh, and Dan Tsafrir
2017
iv
Datacenter Design and Management: A Computer Architect’s Perspective
Benjamin C. Lee
2016

A Primer on Compression in the Memory Hierarchy


Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
2015

Research Infrastructures for Hardware Accelerators


Yakun Sophia Shao and David Brooks
2015

Analyzing Analytics
Rajesh Bordawekar, Bob Blainey, and Ruchir Puri
2015

Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015

Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015

Single-Instruction Multiple-Data Execution


Christopher J. Hughes
2015

Power-Efficient Computer Architectures: Recent Advances


Magnus Själander, Margaret Martonosi, and Stefanos Kaxiras
2014

FPGA-Accelerated Simulation of Computer Systems


Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe
2014

A Primer on Hardware Prefetching


Babak Falsafi and Thomas F. Wenisch
2014

On-Chip Photonic Interconnects: A Computer Architect’s Perspective


Christopher J. Nitta, Matthew K. Farrens, and Venkatesh Akella
2013
v
Optimization and Mathematical Modeling in Computer Architecture
Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and
David Wood
2013

Security Basics for Computer Architects


Ruby B. Lee
2013

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines, Second edition
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle
2013

Shared-Memory Synchronization
Michael L. Scott
2013

Resilient Architecture Design for Voltage Variation


Vijay Janapa Reddi and Meeta Sharma Gupta
2013

Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013

Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012

Automatic Parallelization: An Overview of Fundamental Compiler Techniques


Samuel P. Midkiff
2012

Phase Change Memory: From Devices to Systems


Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran
2011

Multi-Core Cache Hierarchies


Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar
2011

A Primer on Memory Consistency and Cache Coherence


Daniel J. Sorin, Mark D. Hill, and David A. Wood
2011
vi
Dynamic Binary Modification: Tools, Techniques, and Applications
Kim Hazelwood
2011

Quantum Computing for Computer Architects, Second Edition


Tzvetan S. Metodi, Arvin I. Faruque, and Frederic T. Chong
2011

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities


Dennis Abts and John Kim
2011

Processor Microarchitecture: An Implementation Perspective


Antonio González, Fernando Latorre, and Grigorios Magklis
2010

Transactional Memory, 2nd edition


Tim Harris, James Larus, and Ravi Rajwar
2010

Computer Architecture Performance Evaluation Methods


Lieven Eeckhout
2010

Introduction to Reconfigurable Supercomputing


Marco Lanzagorta, Stephen Bique, and Robert Rosenberg
2009

On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009

The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009

Fault Tolerant Computer Architecture


Daniel J. Sorin
2009

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines
Luiz André Barroso and Urs Hölzle
2009
vii
Computer Architecture Techniques for Power-Efficiency
Stefanos Kaxiras and Margaret Martonosi
2008

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency


Kunle Olukotun, Lance Hammond, and James Laudon
2007

Transactional Memory
James R. Larus and Ravi Rajwar
2006

Quantum Computing for Computer Architects


Tzvetan S. Metodi and Frederic T. Chong
2006
Copyright © 2018 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

Compiling Algorithms for Heterogeneous Systems


Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
www.morganclaypool.com

ISBN: 9781627059619 paperback


ISBN: 9781627057301 ebook
ISBN: 9781681732633 hardcover

DOI 10.2200/S00816ED1V01Y201711CAC043

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE

Lecture #43
Series Editor: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
Compiling Algorithms
for Heterogeneous Systems

Steven Bell
Stanford University

Jing Pu
Google

James Hegarty
Oculus

Mark Horowitz
Stanford University

SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #43

M
&C Morgan & cLaypool publishers
ABSTRACT
Most emerging applications in imaging and machine learning must perform immense amounts
of computation while holding to strict limits on energy and power. To meet these goals, archi-
tects are building increasingly specialized compute engines tailored for these specific tasks. The
resulting computer systems are heterogeneous, containing multiple processing cores with wildly
different execution models. Unfortunately, the cost of producing this specialized hardware—and
the software to control it—is astronomical. Moreover, the task of porting algorithms to these
heterogeneous machines typically requires that the algorithm be partitioned across the machine
and rewritten for each specific architecture, which is time consuming and prone to error.
Over the last several years, the authors have approached this problem using domain-
specific languages (DSLs): high-level programming languages customized for specific domains,
such as database manipulation, machine learning, or image processing. By giving up general-
ity, these languages are able to provide high-level abstractions to the developer while producing
high-performance output. The purpose of this book is to spur the adoption and the creation of
domain-specific languages, especially for the task of creating hardware designs.
In the first chapter, a short historical journey explains the forces driving computer archi-
tecture today. Chapter 2 describes the various methods for producing designs for accelerators,
outlining the push for more abstraction and the tools that enable designers to work at a higher
conceptual level. From there, Chapter 3 provides a brief introduction to image processing al-
gorithms and hardware design patterns for implementing them. Chapters 4 and 5 describe and
compare Darkroom and Halide, two domain-specific languages created for image processing
that produce high-performance designs for both FPGAs and CPUs from the same source code,
enabling rapid design cycles and quick porting of algorithms. The final section describes how
the DSL approach also simplifies the problem of interfacing between application code and the
accelerator by generating the driver stack in addition to the accelerator configuration.
This book should serve as a useful introduction to domain-specialized computing for com-
puter architecture students and as a primer on domain-specific languages and image processing
hardware for those with more experience in the field.

KEYWORDS
domain-specific languages, high-level synthesis, compilers, image processing accel-
erators, stencil computation
xi

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 CMOS Scaling and the Rise of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 What Will We Build Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Performance, Power, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 The Cost of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Good Applications for Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Computations and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.1 Direct Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Domain-specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Image Processing with Stencil Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


3.1 Image Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Darkroom: A Stencil Language for Image Processing . . . . . . . . . . . . . . . . . . . . 33


4.1 Language Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 A Simple Pipeline in Darkroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Optimal Synthesis of Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Generating Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Shift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Finding Optimal Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 ASIC and FPGA Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 CPU Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xii
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Scheduling for Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Scheduling for General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Programming CPU/FPGA Systems from Halide . . . . . . . . . . . . . . . . . . . . . . . . 51


5.1 The Halide Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Mapping Halide to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Architecture Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 IR Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.3 Loop Perfection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Programmability and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 Quality of Hardware Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6 Interfacing with Specialized Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


6.1 Common Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 The Challenge of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Solutions to the Interface Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.1 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Library Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3 API plus DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Drivers for Darkroom and Halide on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Memory and Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Running the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.3 Generating Systems and Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.4 Generating the Whole Stack with Halide . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.5 Heterogeneous System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
xiii

Preface
Cameras are ubiquitous, and computers are increasingly being used to process image data to
produce better images, recognize objects, build representations of the physical world, and extract
salient bits from massive streams of video, among countless other things. But while the data
deluge continues to increase, and while the number of transistors that can be cost-effectively
placed on a silicon die is still going up (for now), limitations on power and energy mean that
traditional CPUs alone are insufficient to meet the demand. As a result, architects are building
more and more specialized compute engines tailored to provide energy and performance gains
on these specific tasks.
Unfortunately, the cost of producing this specialized hardware—and the software to con-
trol it—is astronomical. Moreover, the resulting computer systems are heterogeneous, contain-
ing multiple processing cores with wildly different execution models. The task of porting al-
gorithms to these heterogeneous machines typically requires that the algorithm be partitioned
across the machine and rewritten for each specific architecture, which is time consuming and
prone to error.
Over the last several years, we have approached this problem using domain-specific lan-
guages (DSLs)—high-level programming languages customized for specific domains, such as
database manipulation, machine learning, or image processing. By giving up generality, these
languages are able to provide high-level abstractions to the developer while producing high-
performance output. Our purpose in writing this book is to spur the adoption and the creation
of domain-specific languages, especially for the task of creating hardware designs.
This book is not an exhaustive description of image processing accelerators, nor of domain-
specific languages. Instead, we aim to show why DSLs make sense in light of the current state
of computer architecture and development tools, and to illustrate with some specific examples
what advantages DSLs provide, and what tradeoffs must be made when designing them. Our
examples will come from image processing, and our primary targets are mixed CPU/FPGA
systems, but the underlying techniques and principles apply to other domains and platforms as
well. We assume only passing familiarity with image processing, and focus our discussion on the
architecture and compiler sides of the problem.
In the first chapter, we take a short historical journey to explain the forces driving com-
puter architecture today. Chapter 2 describes the various methods for producing designs for
accelerators, outlining the push for more abstraction and the tools that enable designers to work
at a higher conceptual level. In Chapter 3, we provide a brief introduction to image processing
algorithms and hardware design patterns for implementing them, which we use through the
rest of the book. Chapters 4 and 5 describe Darkroom and Halide, two domain-specific lan-
xiv PREFACE
guages created for image processing. Both are able to produce high-performance designs for
both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick
porting of algorithms. We present both of these examples because comparing and contrasting
them illustrates some of the tradeoffs and design decisions encountered when creating a DSL.
The final portion of the book discusses the task of controlling specialized hardware within a het-
erogeneous system running a multiuser operating system. We give a brief overview of how this
works on Linux and show how DSLs enable us to automatically generate the necessary driver
and interface code, greatly simplifying the creation of that interface.
This book assumes at least some background in computer architecture, such as an advanced
undergraduate or early graduate course in CPU architecture. We also build on ideas from com-
pilers, programming languages, FPGA synthesis, and operating systems, but the book should
be accessible to those without extensive study on these topics.

Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz


January 2018
xv

Acknowledgments
Any work of this size is necessarily the result of many collaborations. We are grateful to John
Brunhaver, Zachary DeVito, Pat Hanrahan, Jonathan Ragan-Kelley, Steve Richardson, Jeff Set-
ter, Artem Vasilyev, and Xuan Yang, who influenced our thinking on these topics and helped
develop portions of the systems described in this book. We’re also thankful to Mike Morgan,
Margaret Martonosi, and the team at Morgan & Claypool for shepherding us through the
writing and production process, and to the reviewers whose feedback made this a much bet-
ter manuscript than it would have been otherwise.

Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz


January 2018
1

CHAPTER 1

Introduction
When the International Technology Roadmap for Semiconductors organization announced its
final roadmap in 2016, it was widely heralded as the official end of Moore’s law [ITRS, 2016].
As we write this, 7 nm technology is still projected to provide cheaper transistors than current
technology, so it isn’t over just yet. But after decades of transistor scaling, the ITRS report
revealed at least modest agreement across the industry that cost-effective scaling to 5 nm and
below was hardly a guarantee.
While the death of Moore’s law remains a topic of debate, there isn’t any debate that the
nature and benefit of scaling has decreased dramatically. Since the early 2000s, scaling has not
brought the power reductions it used to provide. As a result, computing devices are limited by
the electrical power they can dissipate, and this limitation has forced designers to find more
energy-efficient computing structures. In the 2000s this power limitation led to the rise of mul-
ticore processing, and is the reason that practically all current computing devices (outside of
embedded systems) contain multiple CPUs on each die. But multiprocessing was not enough to
continue to scale performance, and specialized processors were also added to systems to make
them more energy efficient. GPUs were added for graphics and data-parallel floating point op-
erations, specialized image and video processors were added to handle video, and digital signal
processors were added to handle the processing required for wireless communication.
On one hand, this shift in structure has made computation more energy efficient; on the
other, it has made programming the resulting systems much more complex. The vast major-
ity of algorithms and programming languages were created for an abstract computing machine
running a single thread of control, with access to the entire memory of the machine. Changing
these algorithms and languages to leverage multiple threads is difficult, and mapping them to
use the specialized processors is near impossible. As a result, accelerators only get used when
performance is essential to the application; otherwise, the code is written for CPU and declared
“good enough.” Unless we develop new languages and tools that dramatically simplify the task
of mapping algorithms onto these modern heterogeneous machines, computing performance
will stagnate.
This book describes one approach to address this issue. By restricting the application do-
main, it is possible to create programming languages and compilers that can ease the burden of
creating and mapping applications to specialized computing resources, allowing us to run com-
plete applications on heterogeneous platforms. We will illustrate this with examples from image
processing and computer vision, but the underlying principles extend to other domains.
2 1. INTRODUCTION
The rest of this chapter explains the constraints that any solution to this problem must
work within. The next section briefly reviews how computers were initially able to take advantage
of Moore’s law scaling without changing the programming model, why that is no longer the case,
and why energy efficiency is now key to performance scaling. Section 1.2 then shows how to
compare different power-constrained designs to determine which is best. Since performance
and power are tightly coupled, they both need to be considered to make the best decision. Using
these metrics, and some information about the energy and area cost of different operations, this
section also points out the types of algorithms that benefit the most from specialized compute
engines. While these metrics show the potential of specialization, Section 1.3 describes the costs
of this approach, which historically required large teams to design the customized hardware and
develop the software that ran on it. The remaining chapters in this book describe one approach
that addresses these cost issues.

1.1 CMOS SCALING AND THE RISE OF SPECIALIZATION


From the earliest days of electronic computers, improvements in physical technology have con-
tinually driven computer performance. The first few technology changes were discrete jumps,
first from vacuum tubes to bipolar transistors in the 1950s, and then from discrete transistors to
bipolar integrated circuits (ICs) in the 1960s. Once computers were built with ICs, they were
able to take advantage of Moore’s law, the prediction-turned-industry-roadmap which stated
that the number of components that could be economically packed onto an integrated circuit
would double every two years [Moore, 1965].
As MOS transistor technology matured, gates built with MOS transistors used less power
and area than gates built with bipolar transistors, and it became clear in the late 1970s that MOS
technology would dominate. During this time Robert Dennard at IBM Research published his
paper on MOS scaling rules, which showed different approaches that could be taken to scale
MOS transistors [Dennard et al., 1974]. In particular, he observed that if a transistor’s operating
voltage and doping concentration were scaled along with its physical dimensions, then a number
of other properties scaled nicely as well, and the resized transistor would behave predictably.
If a MOS transistor is shrunk by a factor of 1= in each linear dimension, and the operating
voltage is lowered by the same 1= , then several things follow:
1. Transistors get smaller, allowing  2 more logic gates in the same silicon area.
2. Voltages and currents inside the transistor scale by a factor of 1= .
3. The effective resistance of the transistor, I =V , remains constant, due to 2 above.
4. The gate capacitance C shrinks by a factor of 1= (1= 2 due to decreased area, multiplied
by  due to reduced electrode spacing).
The switching time for a logic gate is proportional to the resistance of the driving transistor
multiplied by the capacitance of the driven transistor. If the effective resistance remains constant
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 3
while the capacitance decreases by 1= , then the overall delay also decreases by 1= , and the chip
can be run faster by a factor of  .
Taken together, these scaling factors mean that  2 more logic gates are switched  faster,
for a total increase of  3 more gate evaluations per second. At the same time, the energy required
to switch a logic gate is proportional to CV 2 . With both capacitance and voltage decreasing by
a factor of 1= , the energy per gate evaluation decreased by a factor of 1= 3 .
During this period,
p roughly every other year, a new technology process yielded transistors
which were about 1= 2 as large in each dimension. Following Dennard scaling, this would give
a chip with twice as many gates and a faster clock by a factor of 1.4, making it 2.8 more
powerful than the previous one. Simultaneously, however, the energy dissipated by each gate
evaluation dropped by 2.8, meaning that total power required was the same as the previous
chip. This remarkable result allowed each new generation to achieve nearly a 3 improvement
for the same die area and power.
This scaling is great in theory, but what happened in practice is somewhat more circuitous.
First, until the mid-1980s, most complex ICs were made with nMOS rather than CMOS gates,
which dissipate power even when they aren’t switching (known as static power). Second, during
this period power supply voltages remained at 5 V, a standard set in the bipolar IC days. As
a result of both of these, the power per gate did not change much even as transistors scaled
down. As nMOS chips grew more complex, the power dissipation of these chips became a
serious problem. This eventually forced the entire industry to transition from nMOS to CMOS
technology, despite the additional manufacturing complexity and lower intrinsic gate speed of
CMOS.
After transitioning to CMOS ICs in the mid-1980s, power supply voltages began to scale
down, but not exactly in sync with technology. While transistor density and clock speed contin-
ued to scale, the energy per logic gate dropped more slowly. With the number of gate evaluations
per second increasing faster than the energy of gate evaluation was scaling down, the overall chip
power grew exponentially.
This power scaling is exactly what we see when we look at historical data from CMOS
microprocessors, shown in Figure 1.1. From 1980 to 2000, the number of transistors on a chip
increased by about 500 (Figure 1.1a), which corresponds to scaling transistor feature size by
roughly 20. During this same period of time, processor clock frequency increased by 100,
which is 5 faster than one would expect from simple gate speed (Figure 1.1b). Most of this ad-
ditional clock speed gain came from microarchitectural changes to create more deeply pipelined
“short tick” machines with fewer gates per cycle, which were enabled by better circuit designs
of key functional units. While these fast clocks were good for performance, they were bad from
a power perspective.
By 2000, computers were executing 50,000 more gate evaluations per second than they
had in the 1980s. During this time the average capacitance had scaled down, providing a 20
energy savings, but power supply voltages had only scaled by 4–5 (Figure 1.1c), giving roughly
4 1. INTRODUCTION
a 25 savings. Taken together the capacitance and supply scaling only reduce the gate energy
by around 500, which means that the power dissipation of the processors should increase by
two orders of magnitude during this period. Figure 1.1d shows that is exactly what happened.

1B 4 GHz
1 GHz
100 M
Number of Transistors

Clock Frequency
10 M 100 MHz

1M 10 MHz

100 k
1 MHz
10 k

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
(a) Transistors Per Chip (b) CPU Frequency

5V
150 W

Thermal Design Power (TDP)


100 W

3.3 V
Voltage

2.5 V 10 W

1.2 V
1W

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
(c) Operating Voltage (d) Power Dissipation

Figure 1.1: From the 1960s until the early 2000s, transistor density and operating frequency
scaled up exponentially, providing exponential performance improvements. Power dissipa-
tion increased but was kept in check by lowering the operating voltage. Data from CPUDB
[Danowitz et al., 2012].

Up to this point, all of these additional transistors were used for a host of architectural im-
provements that increased performance even further, including pipelined datapaths, superscalar
instruction issue, and out-of-order execution. However, the instruction set architectures (ISAs)
for various processors generally remained the same through multiple hardware revisions, mean-
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 5
ing that existing software could run on the newer machine without modification—and reap a
performance improvement.
But around 2004, Dennard scaling broke down. Lowering the gate threshold voltage fur-
ther caused the leakage power to rise unacceptably high, so it began to level out just below 1 V.
Without the possibility to manage the power density by scaling voltage, manufacturers hit
the “power wall” (the red line in Figure 1.1d). Chips such as the Intel Pentium 4 were dissipating
a little over 100 W at peak performance, which is roughly the limit of a traditional package
with a heatsink-and-fan cooling system. Running a CPU at significantly higher power than this
requires an increasingly complex cooling system, both at a system level and within the chip itself.
Pushed up against the power wall, the only choice was to stop increasing the clock fre-
quency and find other ways to increase performance. Although Intel had predicted processor
clock rates over 10 GHz, actual numbers peaked around 4 GHz and settled back between 2 and
4 GHz (Figure 1.1b).
Even though Dennard scaling had stopped, taking down frequency scaling with it,
Moore’s law continued its steady march forward. This left architects with an abundance of tran-
sistors, but the traditional microarchitectural approaches to improving performance had been
mostly mined out. As a result, computer architecture has turned in several new directions to
improve performance without increasing power consumption.
The first major tack was symmetric multicore, which stamped down two (and then four,
and then eight) copies of the CPU on each chip. This has the obvious benefit of delivering more
computational power for the same clock rate. Doubling the core count still doubles the total
power, but if the clock frequency is dialed back, the chip runs at a lower voltage, keeping the
energy constant while maintaining some of the performance advantage of having multiple cores.
This is especially true if the parallel cores are simplified and designed for energy efficiency rather
than single-thread performance. Nonetheless, even simple CPU cores incur significant overhead
to compute their results, and there is a limit to how much efficiency can be achieved simply by
making more copies.
The next theme was to build processors to exploit regularity in certain applications, lead-
ing to the rise of single-instruction-multiple-data (SIMD) instruction sets and general-purpose
GPU computing (GPGPU). These go further than symmetric multicore in that they amortize
the instruction fetch and decode steps across many hardware units, taking advantage of data
parallelism. Neither SIMD nor GPUs were new; SIMD had existed for decades as a staple
of supercomputer architectures and made its way into desktop processors for multimedia ap-
plications along with GPUs in the late 1990s. But in the mid-2000s, they started to became
prominent as a way to accelerate traditional compute-intensive applications.
A third major tack in architecture was the proliferation of specialized accelerators, which
go even further in stripping out control flow and optimizing data movement for particular appli-
cations. This trend was hastened by the widespread migration to mobile devices and “the cloud,”
where power is paramount and typical use is dominated by a handful of tasks. A modern smart-
6 1. INTRODUCTION
phone System-on-chip (SoC) contains more than a dozen custom compute engines, created
specifically to perform intensive tasks that would be impossible to run in real time on the main
CPU. For example, communicating over WiFi and cellular networks requires complex coding
and modulation/demodulation, which is performed on a small collection of hardware units spe-
cialized for these signal processing tasks. Likewise, decoding or encoding video—whether for
watching Netflix, video chatting, or camera filming—is handled by hardware blocks that only
perform this specific task. And the process of capturing raw pixels and turning them into a
pleasing (or at least presentable) image is performed by a long pipeline of hardware units that
demosaic, color balance, denoise, sharpen, and gamma-correct the image.
Even low-intensity tasks are getting accelerators. For example, playing music from an
MP3 file requires relatively little computational work, but the CPU must wake up a few dozen
times per second to fill a buffer with sound samples. For power efficiency, it may be better to
have a dedicated chip (or accelerator within the SoC, decoupled from the CPU) that just handles
audio.
While there remain some performance gains still to be squeezed out of thread and data
parallelism by incrementally advancing CPU and GPU architectures, they cannot close the gap
to a fully customized ASIC. The reason, as we’ve already hinted, comes down to power.
Cell phones are power-limited both by their battery capacity (roughly 8–12 Wh) and the
amount of heat it is acceptable to dissipate in the user’s hand (around 2 W). The datacenter is
the same story at a different scale. A warehouse-sized datacenter consumes tens of megawatts,
requiring a dedicated substation and a cheap source of electrical power. And like phones, data
center performance is constrained partly by the limits of our ability to get heat out, as evidenced
by recent experiments and plans to build datacenters in caves or in frigid parts of the ocean.
Thus, in today’s power-constrained computing environment, the formula for improvement is
simple: performance per watt is performance.
Only specialized architectures can optimize the data storage and movement to achieve the
energy reduction we want. As we will discuss in Section 1.4, specialized accelerators are able to
eliminate the overhead of instructions by “baking” them into the computation hardware itself.
They also eliminate waste for data movement by designing the storage to match the algorithm.
Of course, general-purpose processors are still necessary for most code, and so modern
systems are increasingly heterogeneous. As mentioned earlier, SoCs for mobile devices contain
dozens of processors and specialized hardware units, and datacenters are increasingly adding
GPUs, FPGAs, and ASIC accelerators [AWS, 2017, Norman P. Jouppi et al., 2017].
In the remainder of this chapter, we’ll describe the metrics that characterize a “good”
accelerator and explain how these factors will determine the kind of systems we will build in the
future. Then we lay out the challenges to specialization and describe the kinds of applications
for which we can expect accelerators to be most effective.
1.2. WHAT WILL WE BUILD NOW? 7
1.2 WHAT WILL WE BUILD NOW?
Given that specialized accelerators are—and will continue to be—an important part of computer
architecture for the foreseeable future, the question arises: What makes a good accelerator? Or
said another way, if I have a potential set of designs, how do I choose what to add to my SoC
or datacenter, if anything?

1.2.1 PERFORMANCE, POWER, AND AREA


On the surface, the good things we want are obvious. We want high performance, low power,
and low cost.
Raw performance—the speed at which a device is able to perform a computation—is
the most obvious measure of “good-ness.” Consumers will throw down cash for faster devices,
whether that performance means quicker web page loads or richer graphics. Unfortunately, this
isn’t easy to quantify with the most commonly advertised metrics.
Clock speed matters, but we also need to account for how much work is done on each
clock cycle. Multiplying clock speed by the number of instructions issued per cycle is better, but
still ignores the fact that some instructions might do much more work than others. And on top
of this, we have the fact that utilization is rarely 100% and depends heavily on the architecture
and application.
We can quantify performance in a device-independent way by counting the number of
essential operations performed per unit time. For the purposes of this metric, we define “essen-
tial operations” to include only the operations that form the actual result of the computation.
Most devices require a great deal of non-essential computation, such as decoding instructions or
loading and storing intermediate data. These are “non-essential” not because they are pointless
or unnecessary but because they are not intrinsically required to perform the computation. They
are simply overhead incurred by the specific architecture.
With this definition, adding two pieces of data to produce an intermediate result is an
essential operation, but incrementing a loop counter is not since the latter is required by the
implementation and not the computation itself.
To make things concrete, a 3  3 convolution on a single-channel image requires nine mul-
tiplications (multiplying 3  3 pixels by their corresponding weights) and eight 2-input additions
per output pixel. For a 640  480 image (307,200 pixels), this is a little more than 5.2 million
total operations.
A CPU implementation requires many more instructions than this to compute the result
since the instruction stream includes conditional branches, loop index computations, and so
forth. On the flip side, some implementations might require fewer instructions than operations,
if they process multiple pieces of data on each instruction or have complex instructions that
fuse multiple operations. But implementations across this whole spectrum can be compared if
we calculate everything in terms of device-independent operations, rather than device-specific
instructions.
8 1. INTRODUCTION
The second metric is power consumption, measured in Watts. In a datacenter context,
the power consumption is directly related to the operating cost, and thus to the total cost of
ownership (TCO). In a mobile device, power consumption determines how long the battery will
last (or how large a battery is necessary for the device to survive all day). Power consumption also
determines the maximum computational load that can be sustained without causing the device
to overheat and throttle back.
The third metric is cost. We’ll discuss development costs further in the following section,
but for now it is sufficient to observe that the production cost of the final product is closely related
to the silicon area of the chip, typically measured in square millimeters (mm2 ). More chips of a
smaller design will fit on a fixed-size wafer, and smaller chips are likely to have somewhat higher
yield percentages, both of which reduce the manufacturing cost.
However, as important as performance, power, and silicon area are as metrics, they can’t
be used directly to compare designs, because it is relatively straightforward to trade one for the
other.
Running a chip at a higher operating voltage causes its transistors to switch more rapidly,
allowing us to increase the clock frequency and get increased performance, at the cost of in-
creased power consumption. Conversely, lowering the operating voltage along with the clock
frequency saves energy, at the cost of lower performance.1
It isn’t fair to compare the raw performance of a desktop Intel Core i7 to an ARM phone
SoC, if for no other reason than that the desktop processor has a 20–50 power advantage.
Instead, it is more appropriate to divide the power ( Joules per second) by the performance (op-
erations per second) to get the average energy used per computation ( Joules per operation).
Throughout the rest of this book, we’ll refer to this as “energy per operation” or pJ/op. We
could equivalently think about maximizing the inverse: operations/Joule.
For a battery-powered device, energy per operation relates directly to the amount of com-
putation that can be performed with a single battery charge; for anything plugged into the wall,
this relates the amount of useful computation that was done with the money you paid to the
electric company.
A similar difficulty is related to the area metric. For applications with sufficient parallelism,
we can double performance simply by stamping down two copies of the same processor on a chip.
This benefit requires no increase in clock speed or operating voltage—only more silicon. This
was, of course, the basic impetus behind going to multi core computation.
Even further, it is possible to lower the voltage and clock frequency of the two cores,
trading performance for energy efficiency as described earlier. As a result, it is possible to improve
either power or performance by increasing silicon area as long as there is enough parallelism.
Thus, when comparing between architectures for highly parallel applications, it is helpful to

1 Of
course, modern CPUs do this scaling on the fly to match their performance to the ever-changing CPU load, known as
“Dynamic Voltage and Frequency Scaling” (DVFS).
1.2. WHAT WILL WE BUILD NOW? 9
normalize performance by the silicon area used. This gives us operations/Joule divided by area,
ops
or mm 2 J .
ops
These two compound metrics, pJ=operation and mm 2 J , give us meaningful ways to com-
pare and evaluate vastly different architectures. However, it isn’t sufficient to simply minimize
these in the abstract; we must consider the overall system and application workload.

1.2.2 FLEXIBILITY
Engineers building a system are concerned with a particular application, or perhaps a collection
of applications, and the metrics discussed are only helpful insofar as they represent performance
on the applications of interest. If a specialized hardware module cannot run our problem, its
energy and area efficiency are irrelevant. Likewise, if a module can only accelerate parts of the
application, or only some applications out of a larger suite, then its benefit is capped by Ahm-
dahl’s law. As a result, we have a flexibility tradeoff: more flexible devices allow us to accelerate
computation that would otherwise remain on the CPU, but increased flexibility often means
reduced efficiency.
Suppose a hypothetical fixed-function device can accelerate 50% of a computation by a
factor of 100, reducing the total computation time from 1 second to 0.505 seconds. If adding
some flexibility to the device drops the performance to only 10 but allows us to accelerate 70%
of the computation, we will now complete the computation in 0.37 seconds—a clear win.
Moreover, many applications demand flexibility, whether the product is a networking de-
vice that needs to support new protocols or an augmented-reality headset that must incorporate
the latest advances in computer vision. As more and more devices are connected to the internet,
consumers increasingly expect that features can be upgraded and bugs can be fixed via over-the-
air updates. In this market, a fixed-function device that cannot support rapid iteration during
prototyping and cannot be reconfigured once deployed is a major liability.
The tradeoff is that flexibility isn’t free, as we have already alluded to. It almost always hurts
ops
efficiency (performance per watt or mm 2 J ) since overhead is spent processing the configuration.
Figure 1.2 illustrates this by comparing the performance and efficiency for a range of designs
proposed at ISSCC a number of years ago. While newer semiconductor processes have reduced
energy across the board, the same trend holds: the most flexible devices (CPUs) are the least
efficient, and increasing specialization also increases performance, by as much as three orders of
magnitude.
In certain domains, this tension has created something of a paradox: applications that were
traditionally performed completely in hardware are moving toward software implementations,
even while competing forces push related applications away from software toward hardware. For
example, the fundamental premise of software defined radio (SDR) is that moving much (or all)
of the signal processing for a radio from hardware to software makes it possible to build a system
that is simpler, cheaper, and more flexible. With only a minimal analog front-end, an SDR
system can easily run numerous different coding and demodulation schemes, and be upgraded
10 1. INTRODUCTION
1,000

Energy Efficiency (MOPS/mW)


Microprocessors General Purpose DSPs Dedicated

100

10

0.1

0.01

Figure 1.2: Comparison of efficiency for a number of designs from ISSCC, showing the clear
tradeoff between flexibility and efficiency. Designs are sorted by efficiency and grouped by overall
design paradigm. Figure from Marković and Brodersen [2012].

over the air. But because real-time signal processing requires extremely high computation rates,
many SDR platforms use an FPGA, and carefully optimized libraries have been written to fully
exploit the SIMD and digital signal processing (DSP) hardware in common SoCs. Likewise,
software-defined networking aims to provide software-based reconfigurability to networks, but
at the same time more and more effort is being poured into custom networking chips.

1.3 THE COST OF SPECIALIZATION


To fit these metrics together, we must consider one more factor: cost. After all, given the enor-
mous benefits of specialization, the only thing preventing us from making a specialized acceler-
ator for everything is the expense.
Figure 1.3 compares the non-recurring engineering (NRE) cost of building a new high-
end SoC on the past few silicon process nodes. The price tags for the most recent technologies
are now well out of reach for all but the largest companies. Most ASICs are less expensive than
this, by virtue of being less complex, using purchased or existing IP, having lower performance
targets, and being produced on older and mature processes [Khazraee et al., 2017]. Yet these
costs still run into the millions of dollars and remain risky undertakings for many businesses.
Several components contribute to this cost. The most obvious is the price of the lithogra-
phy masks and tooling setup, which has been driven up by the increasingly high precision of each
process node. Likewise, these processes have ever-more-stringent design rules, which require
more engineering effort during the place and route process and in verification. The exponen-
tial increase in number of transistors has enabled a corresponding growth in design complexity,
which comes with increased development expense. Some of these additional transistors are used
1.3. THE COST OF SPECIALIZATION 11
500

Validation
Prototye
400

Software
300
Cost (million USD)

200
Physical

100
Verification

Architecture
IP
0
65 nm 45/40 nm 28 nm 22 nm 16/14 nm 10 nm 7 nm 5 nm
(2006) (2008) (2010) (2012) (2014) (2017)

Figure 1.3: Estimated cost breakdown to build a large SoC. The overall cost is increasing expo-
nentially, and software comprises nearly half of the total cost. (Data from International Business
Strategies [IBS, 2017].)

in ways that do not appreciably increase the design complexity, such as additional copies of pro-
cessor cores or larger caches. But while the exact slope of the correlation is debatable, the trend
is clear: More transistors means more complexity, and therefore higher design costs. Moreover,
with increased complexity comes increased costs for testing and verification.
Last, but particularly relevant to this book, is the cost of developing software to run the
chip, which in the IBS estimates accounts for roughly 40% of the total cost. The accelerator
must be configured, whether with microcode, a set of registers, or something else, and it must
be interfaced with the software running on the rest of the system. Even the most rigid of “fixed”
devices usually have some degree of configurability, such as the ability to set an operating mode
or to control specific parameters or coefficients.
This by itself is unremarkable, except that all of these “configurations” are tied to a pro-
gramming model very different than the idealized CPU that most developers are used to. Timing
details become crucial, instructions execute out of order or in a massively parallel fashion, and
12 1. INTRODUCTION
concurrency and synchronization are handled with device-specific primitives. Accelerators are,
almost by definition, difficult to program.
To state the obvious, the more configurable a device is, the more effort must go into con-
figuring it. In highly configurable accelerators such as GPUs or FPGAs, it is quite easy—even
typical—to produce configurations that do not perform well. Entire job descriptions revolve
around being able to work the magic to create high-performance configurations for accelera-
tors. These people, informally known as “the FPGA wizards” or “GPU gurus,” have an intimate
knowledge of the device hardware and carry a large toolbox of techniques for optimizing appli-
cations. They also have excellent job security.
This difficulty is exacerbated by a lack of tools. Specialized accelerators need specialized
tools, often including a compiler toolchain, debugger, and perhaps even an operating system.
This is not a problem in the CPU space: there are only a handful of competitive CPU archi-
tectures, and many groups are developing tools, both commercial and open source. Intel is but
one of many groups with an x86 C++ compiler, and the same is true for ARM. But specialized
accelerators are not as widespread, and making tools for them is less profitable. Unsurprisingly,
NVIDIA remains the primary source of compilers, debuggers, and development tools for their
GPUs. This software design effort cannot easily be pushed onto third-party companies or the
open-source community, and becomes part of the chip development cost.
As we stand today, bringing a new piece of silicon to market is as much about writing
software as it is designing logic. It isn’t sufficient to just “write a driver” for the hardware; what
is needed is an effective bridge to application-level code.
Ultimately, companies will only create and use accelerators if the improvement justifies
the expense. That is, an accelerator is only worthwhile if the engineering cost can be recouped
by savings in the operating cost, or if the accelerator enables an application that was previously
impossible. The operating cost is closely tied to the efficiency of the computing system, both in
terms of the number of units necessary (buying a dozen CPUs vs. a single customized accelerator)
and in terms of time and electricity. Because it is almost always easier to implement an algorithm
on a more flexible device, this cost optimization results in a tug-of-war between performance
and flexibility, illustrated in Figure 1.4.
This is particularly true for low-volume products, where the NRE cost dominates the
overall expense. In such cases, the cheapest solution—rather than the most efficient—might be
the best. Often, the most cost-effective solution to speed up an application is to buy a more
powerful computer (or a whole rack of computers!) and run the same horribly inefficient code
on it. This is why an enormous amount of code, even deployed production code, is written in
languages like Python and Matlab, which have poor runtime performance but terrific developer
productivity.
Our goal is to reduce the cost of developing accelerators and of mapping emerging applica-
tions onto heterogeneous systems, pushing down the NRE of the high-cost/high-performance
1.4. GOOD APPLICATIONS FOR ACCELERATION 13

CPU

Operating Cost
Optimized CPU

GPU

FPGA

ASIC

Engineering Cost

Figure 1.4: Tradeoff of operating cost (which is inversely related to runtime performance) vs.
non-recurring engineering cost (which is inversely related to flexibility). More flexible devices
(CPUs and GPUs) require less development effort but achieve worse performance compared to
FPGAs and ASICs. We aim to reduce the engineering development cost (red arrows), making
it more feasible to adopt specialized computing.

areas of this tradeoff space. Unless we do so, it will remain more cost effective to use general-
purpose systems, and computer performance in many areas will suffer.

1.4 GOOD APPLICATIONS FOR ACCELERATION


Before we launch into systems for programming accelerators, we’ll examine which applications
can be accelerated most effectively. Can all applications be accelerated with specialized proces-
sors, or just some of them?
The short answer is that only a few types of applications are worth accelerating. To see
why, we have to go back to the fundamentals of power and energy. Given that, for a modern chip,
performance per watt is equivalent to performance, we want to minimize the energy consumed
per unit of computation. That is, if the way to maximize operations per second is to maximize
operations per second per watt, we can cancel “seconds,” and simply maximize operations per
Joule.
Table 1.1 shows the energy required for a handful of fundamental operations in a 45 nm
process. The numbers are smaller for more recent process nodes, but the relative scale remains
essentially the same.
The crucial observation here is that a DRAM fetch requires 500 more energy than a
32-bit multiplication, and 50; 000 more than an 8-bit addition. The cost of fetching data from
memory completely dwarfs the cost of computing with it. The cache hierarchy helps, of course,
Exploring the Variety of Random
Documents with Different Content
V

SHE GOES ON SUNDAY TO THE CHURCH

Eumenes Fane’s marriage had been both more respectable and


more romantic than his kind enemies believed: living in Paris, he had
eloped with a handsome, wilful French girl of noble family. Her
relations swallowed the match as a bitter pill, his did not exist; and
the married lovers lived in isolation far away in Brittany until death
cut short their long honeymoon. Eumenes returned to England
embittered; he had always been disagreeable. The relations
between him and his children were eccentric. He lived with them, he
had taught them, yet he lavished satire upon their boorishness and
stupidity; he had been devoted to the mother, yet for the children he
had no feeling but unamiable contempt. They, on their part, repaid
him with indifference. Bernard at eighteen, on his own initiative, took
control of the farm and made it pay; Dolly managed the dairy and the
household. Their lives were isolated equally from their father and
from the world. Bernard was not much of a reader, and never strayed
far from his Shakespeare and his farming journals, with excursions
into Tennyson; but Dolly was insatiable. She had read and digested
every book in their heterogeneous library. Unfortunately, the
collection was not representative; the modern French novelists were
there arranged in full tale, and fresh volumes were added as they
appeared, but it had no single work of English fiction later than the
date of the admirable Sir Charles Grandison. Both Bernard and Dolly
could read and speak French as easily as English, though they did
not know the worth of their accomplishment; and from their study of
fin-de-siècle literature they had gained an innocently lurid knowledge
of the world which hardly fitted in with the conditions of English
country life, and was particularly inappropriate as applied to the
blameless households at the vicarage, the surgery, or The Lilacs.
When young Merton of The Hall brought home a pretty bride, Dolly
seriously looked for the appearance of Tertium Quid. He delayed his
coming for a year, and then arrived in the cradle. Dolly was
surprised; but she ascribed this breach of custom to the fact that
Merton senior’s money was made in soap. Only the true aristocrats
indulge in a friend of the house.
After Farquhar’s visit Dolly made a dress for herself. It was then
the fashion to wear a bodice opening at the sleeves and in front to
show a lighter under-dress, which also appeared beneath the skirt,
as the corolla of a flower beneath the calyx. Dolly’s gown of dark
chestnut matched her hair; the colour of the vest was white. She was
more skilful in the dairy than with her needle, but she gave her mind
to this, and in the end her work was crowned with fair success.
“I guess that colour, what they call, suits you,” said Bernard, whom
she called in to assist at the full-dress rehearsal.
“I expect it does,” assented Dolly, bending back her swan’s-neck to
catch a glimpse of her supple young waist in the spotty mirror. “It fits
rather badly; any one can see it is homemade, but that can’t be
helped. I am going to wear it to church on Christmas Day.”
“Father’ll be awfully angry if you go to church.”
“Of course, but that doesn’t matter. No one except small
shopkeepers and mill-girls goes to chapel now. Besides, the minister
drops his h’s and mixes his metaphors and talks the silliest
nonsense: I wouldn’t listen to him even if it were the fashion. Shall
you come with me?”
“I guess I’d better. Have you seen that Farquhar chap again?”
“I have,” Dolly answered, composedly.
“You’ll get yourself into a mess if you don’t look out.”
“Oh no. He may get into a mess, but I shall not.”
“Then I don’t think you are playing fair.”
“Yes, I am. He knows why I spoke to him.”
“Why did you?”
“To know how ladies behave.”
“I suppose you’ll go your own way,” said Bernard, after a pause;
“but people’ll talk if you go on meeting him.”
“Let them. I don’t mean to stay down here.”
“I do,” said Bernard.
Dolly perceived the force of this objection. She valued Farquhar’s
advice; but where her own aims clashed with Bernard’s well-being,
she rarely hesitated.
“Very well; I won’t meet him again,” she said. “But, Bernard, if he
speaks to you, do you respond. Ask him here; no one can find fault if
I see him in my own house. Or I don’t think they can; do you?”
She was reassured by Bernard’s hearty assent, backed by a
special instance. “For,” said he, “when Maude had his sister staying
here, Farquhar went and saw them; and I guess if he goes to
Maude’s house he can come to us.” And the point was thus settled.
Two days before Christmas the wind blew softly from the south,
the snow melted from the earth and the clouds from the sky, the
robins broke out into their pure celestial strains, and it was spring in
all but name. Farquhar’s invalid began to pester his doctor for
permission to go out, and Dolly got a white hat to go with her
chestnut gown.
Christmas Day itself was a flash of summer. Dolly came down
dressed for church at half-past ten, and found her brother ready in a
Norfolk jacket, knickerbockers, and a cap. An inward monitor told her
that this attire was incorrect, and she said so; but as Bernard had
nothing else to wear, the question solvitur ambulando.
Neither of them had ever been to church. In early days Bernard
had been sent to a chapel with a damnatory creed, and he took his
sister with him till she developed opinions of her own: an epoch early
in Dolly’s history. She rebelled: Bernard, who was bored by the
service, outraged by the music, and submissive only from
indifference, supported her: and Mr. Fane’s graceless children took
their own way, and henceforth spent the Sabbath hours in reading,
prefaced always by a chapter of the Bible.
They arrived late, having lingered in the woods because Dolly
said, and Bernard agreed, that Mrs. Merton and the lady in the black
frills had never entered the church till after the bells stopped ringing.
Such is the force of bad example. Bernard held the door open for his
sister, and followed her in, according to instructions which he had
received from her, and she from Noel Farquhar. The aisles were
crossed by dim sunbeams swimming with drowsy motes, the people
were sleepy, the priest was monotoning monotonously out of tune;
and Dolly’s entrance, in company with a beam of pure sunshine and
a gust of wind which set the Christmas wreaths rustling all round the
church, electrified everybody. Heads turned to stare; the choristers,
ever the devotees of inattention, nudged and whispered. Up the aisle
came Dolly, a glowing piece of colour in her rich dress and richer
hair, with the immaculate whiteness of her brow and the deepening
carmine of her cheeks, her eyes shining like brown diamonds. She
walked steadily, carrying her head high, up to the big square pew
assigned by tradition to the house of Fanes, unlatched the door, and
took her seat. Bernard followed, his height and his patent unconcern
making his figure quite as imposing as hers.
For a space Dolly knelt, as she saw others doing, and hid her hot
face; but when the time came she rose, and pinched Bernard, who
had sat down and stayed there. He got up slowly, plunged his hands
into his pockets, and looked round him. Dolly was convinced that his
behaviour was improper; she also looked round her, but without
moving her head, and found her exemplar in the person of Noel
Farquhar, who was attentively following the service in a large prayer-
book. Three volumes lay on the shelf of their pew; Dolly opened one
and handed another to her brother, signing to him to do his duty. He
looked into it helplessly; it was a copy of Hymns Ancient and
Modern, and it is not surprising that he could not find the place. Dolly
was no better off, but she had a model to imitate; she turned over the
pages as though they were perfectly familiar, found her place near
the beginning of the volume, and devoutly studied the evening
hymns while the choristers chanted the Venite.
The recollection of that morning always brought a smile to Dolly’s
lips. Occupied by her culte of deportment, and still more by her culte
of Bernard’s deportment, she missed the humours at the moment,
but found them all the more amusing under the enchantment lent by
distance. Bernard, who was not thinking about himself, was not
amused. Music at chapel had been bad enough, but this, more
ambitious, was really horrible. The choir sang neither better nor
worse than most village performers; there was a preponderance of
trebles out of tune and raucous, an absence of altos, two tenors who
sang wrong, and three basses who sang treble. When they should
have monotoned they climbed unevenly and one by one in linked
sweetness long drawn out down a chromatic scale, until Bernard
suddenly launched the true note at them in a voice of startling
richness and power, which would have made his fortune had he
taken it to market in town. It had the true bass quality, but an
unusually extensive compass, ranging from the C below the bass
clef up to the octave of middle C.
After he began to sing, most of the curious eyes were diverted
from Dolly to him, and she regained her composure. Farquhar had
not looked at her; it was not his cue to let his eye wander during
service. But Dolly was sure, from the dark flush which overspread his
face, that he had seen her enter. She designed this meeting as a
test. If he refused to acknowledge her before his friends, Dolly
vowed that she would never speak to him again. Her pride of birth
was keen; she went to the length of thinking her brother the only
gentleman present, inasmuch as he alone, so far as she knew, had
the right to bear arms. She took little part in the religious ceremonies.
Dolly had her creed, and held to it in practice, but at this time she
was too intent on this world to think much of the next.
She got up with alacrity after the benediction, and marshalled out
Bernard, glad to go. The organist was now playing music soft and
slow, and tenderly touching the pedals with boots so large that he
frequently put down two notes at once by accident. Music was really
the only subject about which Bernard was sensitive; as a false
quantity to a Latinist, as a curse to a Quaker, as a red rag to a bull,
so was a wrong note to Bernard Fane.
Outside shone the sun and breathed the wind and danced the
grasses over the graves of women as young and beautiful as Dolly;
but she was not thinking of them. The stream of people began to
condense into groups of two and three, who gave each other the
accustomed greetings and echoed cheerful wishes at cross purpose
in a babel of inanity. Farquhar was shaking hands with Mrs. Merton,
a fragile little lady with dark eyes, frileuse, as Dolly christened her,
who dressed very well and talked plaintive nonsense in an erratic
fashion. Dolly knew by instinct that they were speaking of her. She
went on at an even pace. Farquhar broke from his friends and
followed, and Dolly, with true Christmas good-will in her heart, found
herself shaking his hand in the overhand style, according to the
custom of the lady in black frills.
“I wish I could walk home your way; I’ve a hundred things to say
about that Burnt House business, and one never has a chance of
seeing Mr. Fane. But I’ve an invalid at home who’s to take his first
airing to-day, and I know he’ll go too far if I don’t look after him.”
“Is that the chap you picked up on the road?” asked Bernard, who
had heard the story from the men, with romantic embellishments.
“Oh, I didn’t pick him up; don’t think it; he was planted on me by
Providence. I say, Fane, if you’ve nothing better to do, I wish you’d
come in to-night and have a knock-up at billiards. It would be a
Christian act, for I’ve not a soul in the house except the invalid, who
toddles off to bye-bye at seven.”
“I can’t play billiards,” was Bernard’s reply, rather proudly spoken.
“Right; I’ll teach you. There’s nothing I like better; is there, Mrs.
Merton?”
“Don’t ask me; I never pretend to fathom you,” said Mrs. Merton,
plaintively, shaking her head. And she put out a very small hand to
Dolly. “Please don’t snub me, Miss Fane; I’d so like to come and call,
if you’ll let me. I was told you were a dreadful person, who dropped
the h and divided the hoof—skirt, I mean; besides, it was your turn to
call first on me. But you aren’t dreadful, are you? So may I come?”
Had there been any patronage in Mrs. Merton’s manner, Dolly
would have been delighted to snub her; but there was none. The
formula of gracious acceptance was less easy than a refusal, but
Dolly let no one guess her difficulties. An interesting general
discussion of the weather followed, during which one remarked that
it gave the doctors quite a holiday, a second that it was muggy and
unwholesome and why didn’t we have a nice healthy frost, a third
that it was excellent for the crops, and a fourth that the harvest would
be certainly ruined by wireworms, and each agreed with all the rest.
Bernard, standing still, thought fashionable people talked like
imbeciles. Dolly, shy, though no one saw it, was in a glow of triumph.
Their way home led through woods. So much rain had fallen that
the mossy bridle-path was scored with deep ruts full of water, and
Dolly had to hold her skirt away from the black leaf-mould. Rain-
drops held in crumpled copper leaves shone gemlike, smooth young
stems glistened; only the grey boles of the forest trees looked warm
and dry. Dolly, herself like a russet leaf, harmonised with the
woodland scenery, which seemed a frame made for her.
Farther on down the path, resignedly sitting on a bundle of fagots,
and beginning to grow chilly, Lucian de Saumarez was waiting for
some one to pass. He had set out with the virtuous intention of
returning home in half an hour precisely, but had been lured on by a
shrew-mouse, a squirrel, and the enchanting sun, till the end of his
strength put a period to his walk; his legs gave way under him. Then
he sat down and whistled “Just Break the News to Mother,” very
cheerfully. It was fortunate that in Bernard’s hearing he did not
attempt to sing, for his voice can only be described by the adjective
squawky. He looked like a tramp who had stolen a coat, for over his
own he wore one of Farquhar’s, which was truly a giant’s robe to
him. At first glimpse of Dolly he whipped off his cap, and stood up
bareheaded and recklessly polite.
“Excuse me—” he began.
“If you want relief, you’d better go to Alresworth workhouse; they’ll
take you in there,” interrupted Bernard, who would never give to
tramps.
“Be quiet, Bernard. Is there anything we can do for you?” asked
Dolly, in her gentlest voice.
“Candidly, I only ask an arm, and not an alms,” said Lucian,
laughing in Bernard’s face. “Fact is, I’ve walked up from The Lilacs
and just petered out. Your woods are such a very remarkably long
way through.”
“Then your name is De Saumarez. Bernard, give Mr. de Saumarez
your arm. You must come home with us and rest; afterwards you can
go back. You ought not to be sitting down out-of-doors this weather,”
said Dolly, fixing her imperious young eyes upon him, between pity
and severity.
“No, I’m an abomination, I confess it,” answered the culprit,
meekly.
“You must be feeling very tired.”
“I’m feeling more like boned goose than anything else, especially
in the legs. By-the-way, I wonder if Farquhar will leave his to look for
the strayed lamb?”
“Let him; it won’t do him any harm.”
Lucian’s eyes opened wide; Farquhar had described the ladies of
Monkswell in picture-making phrases, and he was trying to fit this
vivid young beauty into some one of the frames provided, which all
seemed too strait. “Am I speaking to Miss Maude?” he asked at a
venture, choosing the likeliest.
“Oh no. I am Mirabelle Fane, and this is my brother Bernard.”
“The dickens you are!” said Lucian to himself; for Farquhar, in
relating the adventure of Mr. Fane and the copper, had not
mentioned Miss Fane. Her foreign name and intonation caught
Lucian’s ear, and he asked if she were French.
“My mother was Comtesse de Beaufort,” Dolly told him, and her
naïve pride was quaint and pretty. Lucian mentioned Paris, and she
fastened upon him with a string of eager questions, but put him to
silence before half were answered, by declaring that he had talked
too much.
“I’ve been off the silent list this fortnight past,” Lucian pleaded.
“But you are already overtired. You ought to lie down directly you
get in, and take a dose of cod-liver oil.”
“I take cod-liver oil three times a day,” Lucian assured her, with
equal gravity.
“How? In port wine?”
“I should consider that a sacrilege. No; I will describe the
operation,” said Lucian, warming to his subject, which in any of his
many conversations with pretty girls he had never discussed before.
“I squeeze half a lemon into a wineglass, so; then I pour the oil in on
it; next I squeeze the juice of the other half-lemon into another
wineglass; and finally I swallow first the lemon plus oil and then the
lemon solus. It is a process which requires great nicety and
precision. Farquhar is not so careful as I could wish. Of course, it is
nothing to him if I suffer.”
“Port wine would be far more nourishing than lemon-juice,” Dolly
asseverated, knitting her brows. “Or milk would be better. Have you
ever tried goat’s milk?”
“I have not; is it a sovereign specific?”
“I have known it work wonderful cures on emaciated people. How
much do you weigh?”
“Six stone eleven, I believe.”
“That is far too little. You should test your weight every day—are
you laughing at me?”
“I’m awfully sorry!” said Lucian, who certainly was. “But, Miss
Fane, what a nurse you would make! I was expecting you to feel my
pulse, and take my temperature, and look at my tongue.”
“So I was intending to do; I have a clinical thermometer at home,”
Dolly proudly answered. “I do not know how to behave. I have never
learned any manners.”
“Say you’ve never learned customs; manners come by nature.”
Lucian’s smile was irresistible.
“Mine come very badly, then,” said Dolly, smiling back at him; “for
when we get in you will certainly have to lie down; and, what’s more,
I shall give you a glass of goat’s milk.”
VI

HONESTY IS THE BEST POLICY

A royal stag, whose many-branched and palmate antlers showed


that he had seen at least ten springs, looked down upon the mantel-
piece of Noel Farquhar’s library; a huge elk fronted him across the
room. This style of decoration, which took its origin in the simple
skull palisades of primitive Britain and latter-day Africa, which was
handed down by the traditions of Tower Hill, and which is rampant in
the modern hall, had in Noel Farquhar a devotee. The walls of his
smoking-room bristled with the heads of digested enemies. Thither
the two men repaired after dinner on Christmas night, taking with
them a decanter of mid-century port, cigars of indubitable
excellence, and a dish of nuts for Lucian, who took a childlike
interest in extracting and peeling walnuts without breaking the
kernel. Farquhar was inclined to be silent, in which mood Lucian, the
student of the abnormal, found him specially interesting.
“Queer chap you are Farquhar,” he suddenly remarked. “Why
didn’t you ever tell me about the fascinating Fanes?”
“Didn’t I? I thought I had.” Farquhar did not think any such thing,
and Lucian knew it. “The day I went there Miss Dolly Fane stopped
me in the hall, and would know whether I thought she’d make an
actress. An odd girl.”
“Well, and what did you say to her?”
“Said she would. I couldn’t do otherwise, could I?”
“My immaculate friend, I’m afraid the charms of Miss Fane have
persuaded you into a statement which is very remarkably near to a
L, I, E, lie. At the least, you were disingenuous, decidedly.”
“Who says I am immaculate? Not I. You thrust virtues upon me
and then cry out when I don’t come up to your notions of an
archangel.”
“And your church-going and your alms-giving and your brand-new
coppers and general holiness? Eh, sonny?”
“I’ve a creed, as four-fifths of the men down here are supposed to
have; but whereas they deny in their acts what they repeat with their
tongues, I prefer to perform what I profess. There’s a fine lack of
logic about the way men regard their faith; each time they repeat
their Credo they’re self-condemned fools. Well, I don’t relish making
a fool of myself. Either I’ll be an infidel, and thus set myself free, or
else I’ll act up to what I say. For that you praise me. Now, the only
virtue to which I do lay claim is patience, of which I think I possess
an extraordinary store.”
Lucian peeled a walnut with painstaking earnestness, and ate it
with salt and pepper. The shell he flicked across at Farquhar, who
had fallen into a brown study and was looking very grim. He looked
up with a quick, involuntary smile.
“Did you shoot all these horned beasties yourself?” Lucian
inquired, introducing the elk and the stag with a wave of the hand.
“Yes. I shot the elk in Russia; the horns weigh a good eighty
pounds. Shy brutes they are, and fierce when at bay; this one lamed
me with a kick after I thought I had done for him.”
“My biggest bag was twenty sjamboks running,” said Lucian,
pensively. “I and some others were up country on a big shoot, and,
of course, I got fever and had to lie up. Well, they used to come in
with their blesbok and their springbok, and all the rest of it, so I didn’t
see why I shouldn’t do a little on my own. So I lined up all our
niggers with a sjambok apiece, and made my bag from my couch of
pain. I worked those sjamboks afterwards for all they were worth.
Yes, sir-ree.”
“Sometimes I really think you’re daft, De Saumarez!”
“Pray don’t mention it. Let’s see, where were you? Oh, in Russia.
No, I’ve never been there—I don’t know Russia at all.”
“I do.”
“What, intimately?”
Farquhar turned his head, met Lucian’s eyes, and smiled. “Oh no;
quite slightly,” he said, lying with candour and glee.
“Oh, indeed,” said Lucian. “Now that’s queer; I thought I’d met you
there. By the way, do you believe in eternal constancy?”
“In what?”
“In eternal constancy; did you never hear of it before?”
“Well, yes, pulex irritans, I’ve seen a man go mourning all his life
long; so I do believe in it.”
“No, no, sonny; I’m not discussing its existence, but its merits. Do
you hold that a man should be eternally faithful to the memory of a
dead woman?”
“Not if he doesn’t want to.”
“My point is that he oughtn’t to want to. See here; your body
changes every seven years, and I’ll be hanged if your mind doesn’t
change, too. Now, your married couple change together and so keep
abreast. But if the woman dies, she comes to a stop. In seven years
the survivor will have grown right away from her. The constant
husband prides himself on his loyalty, and is ashamed to admit even
in camera that a resurrected wife wouldn’t fit into his present life; but
in nine cases out of ten the wound’s healed and cicatrised, and only
a sentimental scruple bars him from saying so. And there, as I take
it, he’s wrong.”
“What would you have him do?”
“Take another woman and make her and himself happy.”
“What becomes of the dead wife’s point of view?”
“According to my creed, you know, she’s got no point of view at
all.”
“You can’t expect me to follow you there.”
“No; and so I’ll cite your own creed. After the resurrection there
shall be no marrying or giving in marriage. She’s no call to be
jealous.”
“You’ve no romance about you.”
“No sentimentalism, you mean. Half the feelings consecrated by
public opinion are trash. It’s astounding how we do adore the dumps.
Happiness is our first duty. It seems to me that one needs more
courage to forget than to remember. That’s where I’ve been weak
myself.”
Lucian put his hand inside his coat and took out the letter which
Farquhar had read; he had been leading up to this point. He spread
it open on his knee, showing the thick, chafed, blue paper, the gilded
monogram and daisy crest, the thin Italian writing. “I’ve carried that
about for nine years,” he said, glancing up, and then held the paper
to the fire and watched it catch light. The advancing line of brown,
the blue-edged flame, crept across the letter, leaving shrivelled ash
in its track. Lucian held it till the heat scorched his fingers, and then
let it fall in the fire. “A passionate letter, was it not?” he said, turning
from the black, rustling tinder to meet Farquhar’s eyes.
“My dear De Saumarez!”
“Don’t humbug; you read it when you thought I was unconscious.”
“Ah,” said Farquhar, “now I understand why you understood.”
He altered his pose slightly, relaxing as though freed from some
slight, omnipresent constraint; the nature which confronted Lucian
was different in gross and in detail from the mask of excellence
which he had hitherto kept on. Vices were there, and virtues
unsuspected: coarse, barbaric, potent qualities, dominated by a will-
power mightier than they. Race-characteristics, hitherto overlaid,
suddenly started out; and Lucian, recurring quickly to the last fresh
lie which Farquhar had told him, exclaimed, “Why, man, you’re a
Russian yourself!”
“Half-breed. My mother was Russian. My father was Scotch, but a
naturalized Russian subject. The worse for him; he died in the
mines. Confound him: a pretty ancestry he’s given me, and a pretty
job I’ve had to keep the story out of the papers. I’ve done it, though.”
“But what’s it for?” asked Lucian, whose mind was flying to the
story of Jekyll and Hyde.
“Respectability; that’s the god of England. Do you think I could
confess myself the son of a couple of dirty Russian nihilists and keep
my position? Not much. It’s the only crevice in my armour. Scores of
men have tried to get on by shamming virtuous, but I’ve gone one
better than they; I am virtuous. You can’t pick a hole in my character,
because there’s none to pick. I speak the truth, I do my duty, I’m
honest and honourable down to the end of the whole fool’s
catalogue, I even go out of my way to be chivalrously charitable, as
when I picked you up, or made a fool of myself over that confounded
copper. That’s all the political muck-worms find when they come
burrowing about me. Yes, honesty’s the best policy; it pays.”
“H’m! well, my most honourable friend, you’d find yourself in Queer
Street if I related how you’d read my letter.”
“Not in the least. I was glancing at it to find your address.”
“You took a mighty long time over your glance.”
“The paper was so much rubbed that I could hardly see where it
began or ended.”
“There was the monogram for a sign-post.”
“Plenty women begin on the back sheet.”
“You’re abominable; faith, you are,” said Lucian. “You’re a regular
prayer-mill of lies!”
“I’d never have touched it if I hadn’t prepared my excuse
beforehand. Ruin my career for the sake of reading an old love-
letter? Not I!”
Even as Farquhar wished it, the contemptuous and insulting
reference displeased Lucian; the letter was still sacred in his eyes.
But he would not, and he did not, allow the feeling to be seen.
Farquhar’s measure of reserve was matched by his present
openness; but Lucian, whose affairs were everybody’s business,
kept his mind as a fenced garden and a fountain sealed. Action and
reaction are always equal and opposite; the law is true in the moral
as well as the physical world.
“Kindly speak of my letter with more respect, will you?” was all
Lucian said.
“Oh, the letter was charming; I wish it had been addressed to me!”
“You shut up, and don’t try to be a profane and foolish babbler. I
want to know what it’s all for—what’s your aim and object, sonny?”
“I’m going to get into the Cabinet.”
“You are, are you?” said Lucian. “And why not be premier?”
“And why not king? Because I happen to know my own limitations.
I’ll make a damned good understrapper, but the other’s beyond me.”
“You’ll change your mind when you’ve got your wish.”
“And there you’re wrong. I’ll be content then. I’m content now, for
that matter. It’s as good as a play to see how the virtuous people
look up to me.”
Lucian leaned back in the attitude proper to meditation, and
studied his vis-à-vis over his joined finger-tips. Strength of body,
strength of mind, a will keen as a knife-blade to cut through
obstacles, an arrogant pride in himself and his sins, all these had writ
themselves large on Farquhar’s face; but the acute mind of the critic
was questing after more amiable qualities.
“And so you took me in as an instance of chivalrous charity, eh?
And what do you keep me here for, now I’m sain and safe?”
“You’re not well enough to be dismissed cured.”
“I beg your pardon. I could go and hold horses to-morrow.”
“I shall have to find some work for you before I let you go. I like to
do the thing thoroughly.”
“I see. I’m being kept as an object-lesson in generosity; is that
so?”
“You’ve hit it,” said Farquhar. “Hope you like the position. Have a
cigar?”
“No, thanks. I don’t mind being a sandwich-man, but I draw the
line at an object-lesson.” Lucian got up, and began buttoning his coat
round him. “If that’s your reason for keeping me, I’m off.”
“De Saumarez, don’t be a fool.”
“I will not be an object-lesson,” said Lucian, making for the door.
“My conscience rebels against the deception. I will expire on your
threshold.”
Farquhar jumped up and put his back against the door. “Go and sit
down, you fool!”
“I’ve not the slightest intention of sitting down. I will be a body—a
demd, damp, moist, unpleasant body.”
“Do you mean this?”
“I do. I’m too proud to take money from a man who’s not a friend.”
Farquhar was very angry. He knew what Lucian wanted, but he
would not say it. “Go, and be hanged to you, then!” he retorted, and
flung round towards the fire.
“All right, I’m going,” said Lucian, as he went into the hall.
He took his cap and his stick. Overcoat he had none, and he could
not now borrow Farquhar’s. His own clothes were inadequate even
for mid-day wearing, and for night were absurd. All this Farquhar
knew. He heard Lucian unbolt and unlock the front door, and
presently the wind swept in, invaded the hall, and made Farquhar
shiver, sitting by the fire. Lucian coughed.
Up sprang Farquhar; he ran into the hall, flung the door closed,
caught Lucian round the shoulders, and in the impatient pride of his
strength literally carried him back to the library close to the fire. “You
fool!” he said. “You dashed fool!”
“Well?” said Lucian, looking up, laughing, from the sofa upon
which he had been cast. “Own up! Why do you keep me here?”
“Because you have a damnable way of getting yourself liked.
Because you’re sick.”
“Sh! don’t swear like that, sonny; you really do shock me. And so
you like me?”
“I’ve always a respect for people who find me out,” retorted
Farquhar. “The others—Lord, what fools—what fools colossal! But
you’ve grit; you know your own mind; you do what you want, and not
what your dashed twopenny-halfpenny passions want. Besides,
you’re ill,” he wound up again, with a change of tone which sent
Lucian’s eyebrows up to his shaggy hair.
“You’re a nice person for a small Sunday-school!” was his
comment. “Well, well! So you profess yourself superior to dashed
twopenny-halfpenny passions—such as affection, for example?”
“I was bound to stop you going. You’d have died at my door and
made a scandal.”
“You know very well that never entered your head. Take care what
you say; I can still go, you know.”
Farquhar laughed, half angry; he chafed under Lucian’s control;
would fain have denied it, but could not. “Confound you, I wish I’d
never seen you!” he said.
“You’ll wish that more before you’ve done. I’m safe to bring bad
luck. Gimme your hand and I’ll tell your fortune. I can read the palm
like any gypsy; got a drop of Romany blood in me, I guess.”
“You’ll not read mine,” said Farquhar, grimly, putting it out.
“Won’t I? Hullo! You’ve got a nice little handful!”
The hand was scarred from wrist to finger-tips.
“Never noticed it before, did you? I’m pretty good at hiding it by
now.”
“How on earth was it done?”
“In hell—that’s Africa. I told you I learned massage from an old
Arab sheikh; well, I practised on him. I was alone and down with
fever, and they don’t have river police on the Lualaba. He made me
his slave. Used to thrash me when he chose to say I’d not done my
work; make me kneel at his feet and strike me on the face.”
“Good Lord! How did you like that, sonny?”
“I smiled at him till he got sick of it. Then he put me on silence: one
word, death. He thought he’d catch me out, but I’d no notion of that; I
held my tongue. So one day the old devil sent me to fetch his knife. It
was dusk, and I picked it up carelessly; the handle was white-hot.
He’d tried that trick with slaves before. Liked to see them howl and
drop it, and then finish them off with the very identical knife—
confound him!”
“Amen. And what did you do?”
“I? Brought him his knife by the blade; do you think I was going to
let him cheat me out of my career?”
Lucian stared at him. “You—you!” he said. “And I verily believe the
man’s telling the truth. What happened next?”
“Something to do with termites that I won’t repeat; it might make
you ill.”
“Only a channel steamer does that, sonny. You got away, though?”
“Eventually; half blind and deadly sick. By the way, you’ve not told
me why you made up your mind to burn that letter at this precise
time?”
“To draw you, of course. And now you’ll be pleased to go and see
that my room’s ready; I can hear Bernard Fane hammering at the
door, so you can play billiards with him while I go to bye-low.”
VII

COURAGE QUAND MÊME

January came with the snow-drop, February brought the crocus, and
March violets were empurpling the woods before the next scene
came on the stage and introduced a new actor. In the meanwhile,
Lucian continued to live on Noel Farquhar’s bounty. It should have
been an intolerable position, but Lucian’s luckless head had received
such severe bludgeonings at the hands of Fate that he was glad to
hide it anywhere, and give his pride the congé. His choice lay
between remaining with Farquhar, retiring to the workhouse, and
expiring in a haystack without benefit of clergy; he chose the least
heroic course, and, sad to say, he found it very pleasant.
One night alarm he gave Farquhar. Punctual to its time, the cold
snap of mid-January arrived on the eleventh of the month, and
Lucian went skating at Fanes. His tutelary divinity Dolly being
absent, he was beguiled into staying late, got chilled, and awoke
Farquhar at three in the morning by one of his usual attacks. It was
very slight and soon checked, but the incident strengthened the bond
between them; for Lucian did not forget Farquhar’s face when he
found him fighting for breath, nor the lavish tenderness of his
subsequent nursing, which seemed to be extorted from him by a
force stronger than his would-be carelessness. That constraining
force Lucian declined to christen: friendship seemed too mild a term
for Farquhar’s crude emotions.
No one could have felt more horribly ashamed than Lucian, on
finding that his host gave up all engagements to wait upon him. He
was soon about again, but he now guarded his health as though he
had it on a repairing lease. When Dolly consulted him on points of
etiquette, as she soon learned to do, he retaliated with questions
concerning the proper conduct of an invalid; it is only fair to say that
Dolly was the more correct informant. He was welcome at Fanes.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like