0% found this document useful (0 votes)
33 views10 pages

DNA Data Storage and Hybrid Molecular-Electronic Computing

This document discusses using DNA as an alternative substrate for data storage and computing as existing technologies approach physical limits. It presents a vision for a hybrid molecular-electronic system that uses DNA for both storage and near-data processing. It describes work done on a DNA-based archival storage system demonstrating storage of over 400MB of data in 3 billion nucleotides. It proposes using this approach for massive parallel image similarity search and models its feasibility.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views10 pages

DNA Data Storage and Hybrid Molecular-Electronic Computing

This document discusses using DNA as an alternative substrate for data storage and computing as existing technologies approach physical limits. It presents a vision for a hybrid molecular-electronic system that uses DNA for both storage and near-data processing. It describes work done on a DNA-based archival storage system demonstrating storage of over 400MB of data in 3 billion nucleotides. It proposes using this approach for massive parallel image similarity search and models its feasibility.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

DNA Data Storage and Hybrid Molecular-Electronic Computing


Douglas Carmean2 Luis Ceze1 Georg Seelig1 Kendall Stewart1 Karin Strauss2 Max Willsey1

Abstract—Moore’s Law may be slowing, but our ability to Today Projection Limit
1.0E+10
manipulate molecules is improving faster than ever. DNA could 1.0E+09

provide alternative substrates for computing and storage as 1.0E+08


1.0E+07
existing ones approach physical limits. In this paper, we explore 1.0E+06
Potential of DNA:
>107 Improvement

Gb/mm^3
the implications of this trend in computer architecture. 1.0E+05

We present a computer systems prospective on molecular 1.0E+04


1.0E+03
processing and storage, positing a hybrid molecular-electronic 1.0E+02

architecture that plays to the strengths of both domains. We 1.0E+01


1.0E+00
cover the design and implementation of all stages of the pipeline: 1.0E-01

encoding, DNA synthesis, system integration with digital mi- 1.0E-02

Disk (Rot) Disk (SSD) Tape Optical Flash (Chip) DNA


crofluidics, DNA sequencing (including emerging technologies
like nanopores), and decoding. We first draw on our experience Lifetime (years) 3-5 5 10-30 1000+ 5 1000+

designing a DNA-based archival storage system, which includes


the largest demonstration to date of DNA digital data storage of Fig. 1: Comparing DNA with mainstream storage media.
over 3 billion nucleotides encoding over 400MB of data. We then
propose a more ambitious hybrid-electronic design that uses a
molecular form of near-data processing for massive parallelism.
We present a model that demonstrates the feasibility of these
systems in the near future. of the life sciences industry. Others challenges include system
We think the time is ripe to consider molecular storage seri- integration, fluidics automation, reliable interfaces between
ously and explore system designs and architectural implications. electronics and wet system components, stable preservation,
and random access of data stored in molecular form.
Molecular data storage creates opportunities for near-data
I. I NTRODUCTION
processing; for example, pattern matching and search could
Exponentially growing data poses a significant challenge be performed directly on the molecular representation. Adle-
to the landscape of current storage technologies. If we are man [3] noted that DNA’s stable double-stranded structure
to store and make use of the world’s information, we need comes with a simple computational primitive: matching single-
fundamentally denser and cheaper storage technologies. We stranded molecules will stochastically ”bump into each other”
believe going to the molecular level is inevitable, as also in solution. Adleman used this property to compute a solution
observed by Zhirnov et al [1]. to the Hamiltonian path problem, pioneering the field of DNA
Synthetic DNA is an attractive storage medium for computing. While this area of work has advanced rapidly over
many reasons: its theoretical information density of about the last two decades, the path to large-scale systems remains
1018 B/mm3 is 107 times denser than magnetic tape (Figure 1), unclear.
it can potentially last for thousands of years, and it will never
go obsolete since we will always be interested in reading DNA Motivated by progress in DNA data storage, we envision
for health purposes. The biotechnology industry has developed a hybrid molecular-electronic architecture that combines the
the basic tools to manipulate DNA, including writing and strengths of molecular and conventional electronics. This ap-
reading DNA, which can now be leveraged and improved for proach takes advantage of DNA as both a storage medium and
digital data storage. Importantly, there is rapid exponential computing substrate. It promises to achieve nearly unlimited
progress in DNA reading and writing, arguably surpassing bandwidth: data and processing units float free in solution,
Moore’s law [2] (though in the analysis provided in this paper, so computation can diffuse through data and effectively occur
we chose to model sequencing and synthesis rates that are everywhere simultaneously. We call this phenomenon near-
achievable today). Given the current trends in data production molecule processing. This property effectively breaks the
and the rapid progress of DNA manipulation technologies, fixed capacity/bandwidth ratio on typical storage devices in
we believe the time is ripe to make DNA-based storage and traditional systems, making it especially promising for data-
computing systems a reality. intensive applications such as content-based media search.
In this paper we articulate a vision towards an end-to-end In the remainder of this paper, we provide background
system for archival and retrieval, discuss the challenges in in Section II, discuss hybrid molecular-electronic systems in
building it, and consider additional applications once it is built. general in Section III, and then present our work on DNA data
The key challenge is scaling throughput and cost of DNA syn- storage in detail in Section IV. In Section V, we propose a
thesis and sequencing orders of magnitude beyond the needs new hybrid molecular-electronic system for image similarity
1 Douglas search and model its feasibility. Finally, we conclude with a
Carmean and Karin Strauss are with Microsoft
2 Luis Ceze, Max Willsey, Kendall Stewart and Georg Seelig are with discussion of future technology trends that impact the design
University of Washington space of these systems in Section VI.
Digital Object Identifier: 10.1109/JPROC.2018.2875386

1558-2256 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2

II. BACKGROUND 5’-CCTGCTCAATATATGGATAG-3’


5’-CCTGCTCAATATATGGATAG-3’
DNA’s potential as a substrate for molecular computation
and storage has been the subject of research for over two
decades, dating back to Adleman’s exploration of combina-
torial problems [3] and Baum’s proposal for a massive DNA- 3’-GGACGAGTTATATACCTATC-5’ 3’-GGACGCCTTATAAACCTATC-5’
based database with associative search capability [4].
(a) A fully hybridized duplex (b) A partially hybridized duplex
with complementary sequences. with mismatched sequences (in-
A. DNA structure Melting temperature = 66.5◦ C. dicated with red arrows). Melt-
ing temperature = 42.5◦ C.
DNA molecules are biopolymers consisting of a sequence
of nucleotides. Each nucleotide can have one of four bases: Fig. 2: DNA molecules can form double-stranded duplexes
adenine (A), cytosine (C), guanine (G), or thymine (T). A even when their sequences are not fully complementary, but
single DNA molecule (also called an oligonucleotide or oligo these structures are less stable and thus have lower melting
for short) consists of a sequence of bases, written with ini- temperatures.
tials, e.g., AGTATC. The direction is significant: as normally
written, the left end is called the 5′ (“five prime”) end, and
the right end is called the 3′ (“three prime”) end. repeat. DNA synthesis can be made very parallel via an array-
Two oligos can come together to form a double-stranded based control of either deposition of the next base or localized
duplex, where bases are paired with their complement: A with removal of the protecting group.
T, and C with G. The two oligos in a duplex run in opposite There are several technologies for DNA synthesis [8]. En-
directions, therefore a sequence will be fully complementary zymatic synthesis is a potential alternative to phosphoramidite
with its reverse complement. This is illustrated in Figure 2a: chemistry. In this process, engineered enzymes incorporate
each base on the 5′ end of the upper strand is paired with a bases in a controllable fashion without a template. This method
complementary base on the 3′ end of the lower strand. promises to be cheaper, faster and cleaner (water-based, as
The process of duplex formation is called hybridization. opposed to needing to use solvents). Making synthesis scale
Two complementary strands suspended in solution will even- requires a control mechanism to select which bases to add to
tually form a stable structure that requires energy to break. which sequences. This is often called array-based synthesis:
Fully complementary sequences will form the famous double sequences are seeded on a surface and reagents flow in
helix structure (Figure 2a). succession to add bases in a cyclic fashion. There are several
Two sequences do not have to be fully complementary basic technologies, from which three are most commonly
to hybridize (Figure 2b). Partially hybridized structures are used: electrochemical and light-based arrays, which selectively
less thermodynamically stable, and occur less frequently at deblock sequences and adds the same base to all deblocked
higher solution temperatures. Given two strands, the solution sequences simultaneously; and deposition-based arrays that
temperature at which 50% of the strands have formed a duplex use inkjet to selectively deposit bases where they are to be
at equilibrium is called the duplex’s melting temperature. A added.
higher melting temperature indicates a more stable duplex. The The most commercially-adopted DNA sequencing platform
number of unpaired bases is not necessarily related to melting today is based on image processing and the concept called
temperature [5]. For instance, changing the mismatched A on sequencing by synthesis. Single-stranded DNA sequences are
the lower strand of Figure 2b to a mismatched G raises the attached to a substrate and complementary bases with fluores-
melting temperature to 44.6◦ C, despite the fact that the number cent markers are attached one by one to individual sequences
of unpaired bases remains the same. Melting temperature for (yet, in parallel for all sequences). The spatial fluorescence
a pair of sequences can be calculated precisely by thermody- pattern created by the fluorescent markers is captured in an
namic simulation software such as NUPACK [6]. In addition image, which is then processed and fluorescent spots correlated
to temperature, one can also control hybridization via pH or to individual bases in the sequences. The fluorescent markers
ionic strength of the solution. are then chemically removed, leaving complementary bases
behind and setting up the next base in the sequence to be
recognized. Scaling such technology to higher throughputs
B. DNA writing (synthesis) and reading (sequencing) will depend on more precise optical setups and improvements
DNA synthesis is the process of making arbitrary DNA in image processing, and once optical resolution limits are
molecules from a specification. One of the most estab- reached, this style of sequencing will probably no longer be
lished methods is based on phosphoramidite chemistry due appropriate.
to Caruthers [7]. The method uses ”protected” monomers Another DNA sequencing solution that has been gain-
(individual nucleotides) to prevent the formation of a long ing momentum is nanopore technology. The cornerstone of
homopolymer chain. Removing the protecting group is done nanopore technology is to capture DNA molecules and force
with an acid solution. The synthesis cycles works by: (1) them through a nanoscale pore which causes small fluctuations
incorporating a chosen nucleotide into an existing polymer; in electrical current depending on the passing DNA. The
(2) strengthening the bond via oxidation; (3) washing out main challenges in using nanopore devices for DNA storage
excess monomers; (4) deprotecting the last added base; (5) are controlling the high error rates resulting from sensing
3

Electronics DNA Synthesis Molecules raise a series of critical design questions. In this section,
we discuss challenges and trade-offs pertaining to physical
constraints, communication, storage, and computation. We
Fluidic Control discuss these dimensions in general, and we provide examples
of how they guided design decisions in the systems presented
Complex control DNA Sequencing Dense, parallel, in Section IV and Section V.
flow & operations near-data compute
Ex: Sparse linear Ex: high-dimensional
algebra, image feature search, distance A. Physical Constraints
encoding Sensor Data computations
Molecular systems are unique because they require the stor-
Fig. 3: Hybrid eletronic-molecular architecture. Benefits of age and manipulation of various solutions, including mixing,
electronic and molecular components. Different applications splitting, diluting, and incubating them. Architects must take
may better fit the strengths of either domain. The arrows show care to ensure that a system is physically realizable. Adleman’s
ways of getting data from electronic to molecular components famous DNA-based algorithm for solving the Hamiltonian
and vice-versa. path problem [3] provides a cautionary tale: the amount of
DNA required grows exponentially with the graph size. A
system is not feasible if modestly sized problems require
these minute current fluctuations, which may require heavy
swimming pools or oceans of DNA. The systems presented
signal processing and more precise sensors, and increasing
in this paper, however, demonstrate that some applications
the density of nanopores on a physical substrate, as well as
require only a small reaction volume and are thus feasible.
solving problems with clogging and endurance of pores.
Since we are trying to build computer systems, physical
C. Brief History of DNA Data Storage manipulation also necessitates automation. The various steps
of preparing, operating on, and analyzing samples are typically
The general idea of using DNA as storage of synthetic
done by humans in a wetlab. Microfluidic technology could
information has been around since at least the mid 1960s,
provide the needed automation, but it is not yet advanced
when Norbert Wiener suggested the idea of ”genetic” memory
enough to support a practical computer system. Some instances
for computers. In the past 6 years, work from Harvard [8]
of the technology are not flexible enough, and those that are
and the European Bioinformatics Institute [9] showed that
remain error-prone and difficult to program [12]. Furthermore,
progress in modern DNA manipulation methods could make
programming these hybrid molecular-electronic systems will
it both possible and practical soon. Many research groups,
require intertwined control code, sample manipulation, data
including group at ETH Zurich, University of Illinois at
analysis, and conventional computation; these challenges re-
Urbana-Champaign, and Columbia University are working on
main to be explored.
this problem. Our own group at the University of Washington
and Microsoft holds the world record for the amount of data
successfully stored in and retrieved from DNA: over 500 B. Communication Considerations
megabytes as of June 2018. How to move information between domains is a primary
concern for any heterogeneous system, and it is especially im-
D. DNA-based computation portant for hybrid molecular-electronic systems, where com-
The kinetics of DNA hybridization enable more than just munication can be expensive.
a lookup operation. For instance, partial hybridization can There are many ways to communicate from the electronic
implement “fuzzy matching”, where the query and target do domain to the molecular domain. DNA synthesis adds new
not have to be entirely complementary, and the “fuzziness” can molecules representing data to the system. Physical manipula-
be controlled by varying the temperature [5]. This property tion also adds data: the choice of which samples to combine
can be leveraged to perform distance computations, which we determines the behavior of the system. Changes to the environ-
discuss further in Section III. ment (e.g., temperature, humidity) can also control the system
More recently, researchers have shown that hybridization by influencing chemical properties.
reactions can form complex cascades called strand displace- Getting data from the molecular domain back into the
ment reactions, which can be used to implement general electronic domain varies as well. Some operations may obtain
purpose computations, including boolean circuits [10] and enough data from a simple sensor reading: for example,
neural networks [11]. fluorescent markers can indicate the presence of a particular
Beyond hybridization, evolution has led to a variety of substance or the occurrence of a reaction. DNA sequencing
enzymes for processing DNA, including cutting, joining, repli- provides even more information by reconstructing the exact
cation, and editing. These enzymes can be used to create even sequence of bases from a sample.
more complex circuits. The cost of getting data into and out of molecular compo-
nents is a crucial consideration. The extreme density and par-
III. H YBRID M OLECULAR -E LECTRONIC S YSTEMS allelism afforded by the molecular domain is of limited use if
A hybrid molecular-electronic system aims to leverage the the interface is a bottleneck. An efficient hybrid system would
best properties of each domain (Figure 3). As with any hetero- send a relatively small amount of information to the molecular
geneous system, the strengths and weaknesses of each domain domain, where lots of work would be done in parallel, and
4

return a relatively small amount of information again to the T


C
electronic domain. In this respect, hybrid molecular-electronic Encoding Synthesis C
A Random
Access Sequencing Decoding
G
systems are similar to heterogeneous systems with hardware (write) T
A
(read)
C
accelerators. G
A
Preservation

Electronic Molecular Electronic


C. Storage Considerations
Fig. 4: Overview of DNA-based data storage.
Molecular computation is based on strand interaction, so
having some information already in the molecular domain
would reduce the amount of data that crosses the interface. them in a a physical container or pool. To read data back,
Bandwidth into and out of the molecular domain is limited, so the system selects molecules from the pool, amplifies them
an existing database in molecular form could greatly improve using polymerase chain reaction (a standard molecular biology
performance at execution time. Information stored in DNA is protocol), and sequences them back to digital data. One can
dense and long-lived, so this molecular preprocessing could think of a DNA data storage system as a key-value store, in
be done out-of-band with actual execution. which input data is associated with a key, and read operations
The nature of molecular interactions may lead to destructive identify the key they wish to recover.
reads of edits of molecular information. We envision getting
around this potential issue via periodic molecular amplification
A. Requirements for End-to-End DNA-based Archival Storage
like polymerase chain reaction (PCR) or re-synthesis. Re-
amplification of the entire molecular database could lead to The requirements of a storage system are low read/write
errors accumulating over time (e.g., polymerase errors are esti- latency, high throughput (bits/s), random access and reliability.
mated to be 10−6 to 10−5 per base). If resorting to re-synthesis, DNA manipulation latency is significantly higher than elec-
it is important to include only data that was read out and not tronics. However, write and read throughput (bits/second) can
the entire database, which would possibly oversubscribe the be competitive. This makes DNA-based storage a good fit for
electronic domain with excessive data volumes. archival purposes, where latency is not critical if throughput
is high enough. For example, current archival storage services
quote access times in minutes to hours and sometimes service-
D. Computation Considerations level agreements (SLA) specify times in the order of a day.
When it comes to computation, our goal is to harness the But to be competitive with other commercial systems, a DNA
best of both the electronic and molecular systems. Electronic archival storage system will need to offer throughputs of about
platforms can be highly general and precise; they can perform 1 GB/s in a few years.
a wide variety of operations exactly as specified. No molecular
platforms exist today that match the generality and precision B. Encoding and Synthesis
of electronic systems, but they may offer orders of magnitude
Writing to DNA storage involves encoding binary data
improvements in performance and/or energy efficiency.
as DNA nucleotides and synthesizing the corresponding
Computationally, the main benefit of molecular systems is
molecules. Synthesizing and sequencing DNA is far from
that certain computations can be performed in a massively
perfect (errors on the order 1% per position), hence we need
parallel fashion. For example, the systems presented below use
a robust error correction scheme. This process involves two
hybridization to search for exact and approximate matches.
non-trivial steps. First, the trivial encoding from binary into
Since both query and data are in solution and there are
the four DNA nucleotides (A, C, T, G) produces problematic
multiple copies of both (DNA synthesis naturally creates
sequences such as long stretches of repeated letters. We avoid
multiple copies of the molecules), the search operation is
that with a rotating code [9] and randomization using one-
entirely parallel. We refer to this molecular version of near-
time pads [13]. Second, DNA synthesis technology effectively
data processing as near-molecule processing.
manufactures molecules one nucleotide at a time, so it cannot
Interestingly, the latencies of these parallel operations do not
synthesize molecules of arbitrary length without error. Based
change with the size of the dataset: performing an operation
on current efficiency of synthesis methods and technologies, a
on a few items takes as long as doing it over trillions. This
reasonably efficient strand length for DNA synthesis is about
“constant-time” performance is offset by a large overhead;
150 nucleotides (a couple hundred bits of information). The
operations could take on the order to hours to complete. As
write process therefore splits input data into small blocks
such, it may only be profitable to perform such operations in
which correspond to separate DNA sequences.
molecular form when the dataset is above a certain offload
Because DNA molecules do not offer spatial organization
break-even size. As with electronic systems, this break-even
like traditional storage media, we must explicitly include
point also determines the granularity of communication be-
addressing information in the DNA molecule. Figure 5 shows
tween the two domains.
the layout of an individual DNA strand in our system. Each
strand contains a payload, which is a substring of the input data
IV. DNA DATA S TORAGE to encode. An address includes both a key identifier and an
A DNA storage system (Figure 4) takes digital data as input, index into the input data (to allow data longer than one strand).
synthesizes DNA molecules to represent that data, and stores At each end of the strand, special primer sequences [13], [14]
5

match the given ones, creating many copies of those strands.


To recover the file, we now take a sample of the resulting pool,
TCTACGCTCGAGTGATACGAATGCGTCGTACTACGTCGTGTACGTA… which contains a large number of copies of all the relevant
strands but only a few other irrelevant strands. Sequencing
5’ TCTACGATC TCTACGCTCGAGTGATACGA TCTACG CCAGTATCA 3’
this sample therefore returns the data for the relevant key
Primer Payload Index Primer rather than all data in the system. As a side note, while
Target Target PCR amplification is not even (i.e., there may be bias) and
Fig. 5: Layout of a DNA strand for data storage. The primer may amplify undesired strands, it is not a problem for DNA
regions in both extremities are used to both enable molecular data storage because of underlying error tolerance of the
amplification and to map molecules to the object [13], [14]. encoding/decoding schemes.
The index region is necessary when reassembling the right While PCR-based random access [16], [14], [13] is a viable
order of payloads, since molecular storage does not have fixed implementation, we don’t believe it is practical to put all data
3D physical structure across data items. in a single pool. We instead envision a “library” of pools
offering spatial isolation. We estimate each pool to contain
100% about 1TB of data. An address then maps to both a pool
location and a PCR primer. This design is analogous to a
80% magnetic tape storage library, where robotic arms are used
Diminishing
Overhead

60%
to retrieve tapes. A production DNA-based storage system
returns
would require the use of microfluidic automation to perform
40% the necessary reactions. Tape libraries offer random access by
20%
robotic movement of cartridges and fast-forwarding to specific
tape segments. The equivalent in DNA would be physically
0% isolated “containers” with DNA, along with some form of
0 500 1000 1500 2000 2500 3000 molecular selection prior to sequencing and decoding. While
DNA strand length PCR is the mechanism we have focused on so far, one can
Fig. 6: Overheads as function of strand length. also use magnetic-bead based and other DNA random access
methods.

—- which correspond to the key identifier according to a


D. Reading and Decoding
hash function — allow for efficient sequencing during read
operations. Reading back the data involves selecting the appropriate
Splitting data into smaller strands requires a coding method pool where the data of interest is stored, retrieving a sample,
that provisions information for later reassembly. Previous and sequencing the DNA. No matter the DNA sequencing
work [9] overlapped multiple small blocks, but our experi- method, the result is a large number of reads. Recall that
mental and simulation results show this approach to sacrifice each unique strand is replicated many times in the sequenced
too much density for little gain. Our coding scheme embeds sample, so the result will contain many reads for each unique
indexing information within each block and uses a Reed DNA sequence. The decode process will then have to use
Solomon-based outer coding scheme [15]. Such coding meth- this physical redundancy to cope with errors introduced by
ods provision what we refer to as logical redundancy. Note synthesis and sequencing.
that DNA synthesis makes many copies of each sequences, and The decoder operates in three basic stages: The first step
hence also naturally offers physical redundancy, in the form is to cluster noisy reads by similarity to collect all available
of multiple copies of each sequence (on the order of hundreds reads that likely correspond to a unique originally stored DNA
of millions). Overheads in addressing and error correction can sequence. To do so, we employ an algorithm that leverages
be amortized with longer strands, but because of diminishing the input randomization done during encoding. The next step
returns and higher errors in longer synthesis processes, it is not is to processes each cluster to recover the original sequence
advantageous to go beyond 500-1,000 nucleotides (Figure 6). using a variant of the Bitwise Majority Alignment algorithm
(BMA) [17] adapted to support insertions, deletions, and
substitutions. Finally, the bits are recovered by using a Reed-
C. Random Access Solomon (RS) code to correct errors and erasures.
Random access is fundamental because it is not practical We have used an Illumina NextSeq instrument that im-
to have to sift through a vast data archive to retrieve a plements this technology to read over 200MB of encoded
desired data item. Our design allows for random access by data so far. We have re-sequenced the data several times,
using polymerase chain reaction (PCR). The read process first which brings the total of digital data read from DNA to the
determines the primers for the given key (analogous to a hash equivalent of well over 1GB. Sequencing error rates have been
function) and synthesizes them into new DNA molecules. reasonably low, typically below 1%, and has not prevented
Then, rather than applying sequencing to the entire pool of us from decoding any files. The largest commercial nanopore
stored molecules, we first apply PCR to the pool using these DNA sequencing device to which we have access contains
primers. PCR amplifies the strands in the pool whose primers about 2,000 nanopores and delivers error rates of about 12.5%,
6

after recent improvements in its chemistry. Despite this high Query Feature
error rate, we have been able to decode a file read with this Image Data Extraction

platform. Query Image Features

Feature /
E. Our results so far Synthesis
Query Feature Distance
Address

Hybrid Molecular-Electronic

Purely Electronic
Seq. Encoding Computation
Pairs
Our work so far demonstrates an end-to-end approach Query Strands
toward the viability of DNA data storage with large-scale Distance /
Address Pairs
random access. Although we have only reported on the initial Feature /
Partial
35 files and 200MB of data [13], we have so far encoded, Hybridization Address
Oligos Ranking /
stored, retrieved, and successfully recovered about 40 distinct Thresholding
Matching Strands
files totaling about 400MB of data in more than unique 25 mil- Matching Image
lion DNA oligonucleotides synthesized by Twist Bioscience Matching
Sequencing Image Database
(over 3 billion nucleotides in total). Our results represent an Addresses

advance of more than an order of magnitude over prior work.


Our dataset focused on technologically advanced data formats Fig. 7: Stack diagram for a hybrid and a purely electronic
and historical relevance, including the universal declaration of content-based image retrieval system. Electronic components
human rights in over 100 languages, a high-definition music are green; molecular components are pink.
video of the band OK Go, and a CropTrust database listing
seeds stored in the Svalbard Global Seed Vault.
We demonstrated our random access methodology based section is to, given a new molecular mechanism, discuss how
on selective PCR amplification, for which we designed and to design a feasible hybrid molecular-electronic system. We
validated a large library of primers, and randomly accessed also explore the practicality of such a system with a model of
arbitrarily chosen items from our whole pool with zero-byte its latency, necessary reaction volume, and scalability.
error. Moreover, we developed a novel coding scheme that
dramatically reduces the sequencing reads per DNA sequence
required for error-free decoding to about 6x, while maintaining A. Molecular Accelerated Similarity Search
levels of logical redundancy comparable to the best prior Similarity search is a mechanism for searching a large
codes. Finally, we further stress-tested our coding approach dataset for objects similar to some given query. We focus
by successfully decoding a file using the more error-prone on a particular instance of this problem, content-based im-
nanopore sequencing. age retrieval (CBIR), the task of finding images that appear
visually similar to a given query image. CBIR systems power
V. N EAR - MOLECULE PROCESSING real-world applications such as Google’s reverse image search.
Most computer systems consist of a few processors sur- Figure 7 shows a stack diagram for our proposed CBIR
rounded by memory. To perform computation, the proces- implementation and a purely electronic one.
sor must load data from memory, operate on it, and write The first step in building a CBIR system is to extract visual
it back. Even parallel processors and GPUs still have to features from each image in the database. Visual features are
load all of the relevant data before doing computation. As usually real-valued numbers that represent the activity of some
applications become bandwidth-bound, instead of compute- filter applied to the image. These can be hand-engineered
bound, researchers have sought to bring compute closer to features like scale-invariant feature transform (SIFT) [19], or
the data [18]. learned features such as intermediate layer activations from
In the molecular setting, we can take advantage of nature to a deep neural network [20]. Pairs of feature vectors can be
perform massively parallel near-molecule computation. If we compared using familiar functions such as Euclidean or cosine
can formulate the operation and data such that the result we distance. To find images that are visually similar to a query,
want is thermodynamically favorable, the operation will dif- the system searches for image feature vectors within some
fuse through solution and happen everywhere simultaneously. distance of the query’s feature vector.
Random access through hybridization and PCR as discussed Ordinarily, such searches could be accelerated by partition-
in the previous section is an example of this. The query strand ing the data into a tree-like data structure. However, when
“searches” for target in the entire dataset, all at once. However, feature vectors are high-dimensional, partitioning schemes
random access in electronic systems does not scan the entire become no better than a linear search. This phenomenon is
dataset, so molecular retrieval does not offer any performance popularly known as “the curse of dimensionality”. Real sys-
gain. tems overcome this limitation by using approximation schemes
Here we explore a more compelling case for near-molecular that reduce the amount of data to be sifted through, at the cost
processing. We describe a DNA-based hybrid system for of potentially missing similar images [21], [22], [23].
content-based image retrieval which we call MASS (molecular Ultimately, a high-recall CBIR system must examine a large
accelerated similarity search). MASS relies on a biomolecu- part of the dataset. This provides an opportunity for MASS
lar mechanism for “fuzzy matching” that has not yet been to outperform its purely electronic counterparts by using the
demonstrated but we believe is feasible. The purpose of this near-molecular compute afforded by the molecular domain.
7

B. Architecture Param Value Description


lsyn 100 bases Length of query strand
Much like our DNA data storage system described in
rsyn .02 b/s Synthesis rate in bases per second
Section IV, the database of the MASS system consists of
tsyn 83 min Synthesis latency
DNA strands that associate a primer with some data. Instead nrxn 1e16 Number of unique strands in reaction
of mapping an address (the primer) to some data, the strands crxn 10 Copies of each unique strand in reaction
in the MASS database map encoded feature vectors to the vrxn 1.7 ml Volume of reaction
address of the image in some other database. So the MASS ρrxn 100 µM Concentration of strands in reaction
system deals with the image feature vectors and the addresses nseq 10,000 Number of unique strands sequenced
of the images instead of the actual image data. cseq 40 Copy number required for sequencing
Figure 7 shows the lifetime of single query in the MASS lseq 100 bases Length of strands that get sequenced
system. The input to the system is a query feature vector rseq 226 b/s Sequencing rate in bases per second
which is encoded into a string of bases and then synthesized tseq 2 min Sequencing latency
into (many copies of) a query strand. The query strands are
tsyn = lsyn /rsyn (1)
then combined with a small sample of the database in the
reaction vessel. In the reaction, the query strands will partially vrxn ρrxn = nrxn crxn (2)
hybridize with matching targets, performing the similarity cseq lseq nseq
tseq = (3)
search with massive parallelism. The matching targets can then rseq
be sequenced, yielding the addresses of the similar images.
The images themselves can then be retrieved from a different Fig. 8: Parameters for the content-based retrieval model and
database downstream, potentially a DNA data store like ours equations that describe their relationships.
described in Section IV.
The encoder is a critical part of the system that we leave
unspecified. We believe that it is possible to encode feature are nrxn crxn total target strands in the reaction. These are stored
vectors into DNA such that their similarity correlates with at some concentration ρrxn , which determines the volume vrxn ;
partial hybridization efficiency, but this remains to be demon- Equation 2 shows this relation.
strated in future work. Because the targets come from a sample of the database,
The following section will describe the protocol for the the reaction has the same concentration (ρrxn ) and number of
molecular search in detail. We will also introduce a simple unique targets (nrxn ) as the database. The number of unique
analytical model that relates the important quantities describ- targets determines the capacity of the system; the reaction is
ing the protocol. The model will let us predict the systems effectively searching over nrxn unique image feature vectors in
latency and physical feasibility. parallel.
Once the query strands are added to the database sample,
C. Modeling a Hybrid Molecular-Electronic System partial hybridization binds the query to similar targets. These
Figure 8 shows the equations that comprise the model. targets can be retrieved with a procedure similar to the one
Parameters marked with syn refer to the synthesized query, used in DNA data storage (Section IV). For example, PCR
rxn refers to the reaction vessel, and seq refers to the solution can amplify strands that hybridized, leaving the reaction vessel
that actually gets sequenced. They will be introduced as they dominated by target image features that were similar (and
become relevant, but Figure 8 shows a summary of all model bound to) the query.
parameters and their values in a potential design. 3) Sequencing: Once the reaction vessel is dominated by
1) Query Synthesis: The protocol starts by synthesizing similar target strands, we take a sample of the vessel to
many copies of the query strand, which represents the encoded avoid unnecessary sequencing. We sample such that we only
feature vector of the query image. DNA synthesis makes many sequence cseq of each unique strand.
copies of a strand at once, so the latency is proportional to the The number of unique strands (nseq ) to be sequenced, their
length of the strand, not the number of copies. Synthesizing copy number (cseq ), and their length (lseq ) together determine
many copies helps ensure that partial hybridization happens the number of bases to be sequenced. This and the sequencing
quickly and allows us to perform PCR. rate rseq determine the latency (Equation 3).
To get the latency of synthesis, we model the the length Note that the amount to be sequenced is not dependent on
of the synthesized strand lsyn and the rate of synthesis rsyn . the size of the dataset, nrxn . Unlike electronic systems, whose
Equation 1 shows how to calculate the latency of query time-to-solution is proportional to the size of the dataset, the
synthesis. molecular system instead depends on the size of the result.
2) Reaction: The reaction vessel initially contains a sample This is the fundamental benefit provided by the near-molecule
from the database (see Figure 7). The rxn parameters describe computation.
this sample of target strands in the reaction vessel, not the For massive datasets on the order of trillions of images or
synthesized query strands. more, the number of images similar to a given query could
The number of unique targets in the reaction vessel is be quite large, so controlling the number of desired results
nrxn , and each unique strand is replicated crxn times. This (those that end up getting sequenced, nseq ) independently of
replication factor crxn is also called the copy number. There the dataset size nrxn is crucial to maintain good performance.
8

To that end, the temperature of the reaction vessel can be is important to note that, when used for data storage, DNA
raised or lowered to get more or fewer similar results. synthesis and sequencing have different requirements than
for life sciences. First, when storing data, control over the
D. Model Instantiation sequences to be synthesized allows for the use of smart error
correction to tolerate error rates orders of magnitude higher
Using the model in Figure 8, we can derive the latencies than those required for life sciences applications. Second,
(tsyn and tseq ) and capacity of the system (nrxn ). The remaining storage applications can tolerate completely missing sequences
model parameters are constrained by either the biomolecular as well as contamination. Third, data storage needs very
protocol or technology limits. few copies of each sequence, compared to the much higher
1) Protocol Constraints: We choose the reaction concen- life sciences requirements. Higher synthesis and sequencing
tration ρrxn = 100µM, a common concentration for synthetic density implies simultaneously higher throughput and lower
DNA [24]. We choose the reaction copy number crxn = 10. costs, so it will be key to a practical, large-scale end-to-end
PCR is incredibly specific, we have observed it working when DNA storage system.
the copy number is as low as 5.
We chose length of the synthesized query, lsyn , to be 100
ACKNOWLEDGEMENTS
bases. We believe that this is sufficient to encode feature
vectors given a dimensionality reduction. The length of the We would like to thank the anonymous reviewers for their
target strands that get sequenced, lseq , is 160 bases. This helpful feedback on the manuscript, and MISL members for
allocates 100 bases for the encoded feature vector and 60 bases input on the research and feedback on how to present this
for the address. At a density of 1 bit per base [13], 60 bases work. This work was partially support by Microsoft, the
is sufficient to uniquely address nrxn = 1e16 images. National Science Foundation and DARPA under the Molecular
2) Technological Constraints: Sequencing and synthesis Informatics Program.
are expected to get exponentially faster, improving at a rate
exceeding Moore’s Law [2]. However, we chose to model AUTHOR B IOS
sequencing and synthesis rates that are achievable today.
We draw the synthesis rate for our model, rsyn , from recent
literature proposing a method to synthesize a base every 50
seconds [25]. Recall that synthesis time Equation 1 is propor-
tional only to the length of the strand, not the copy number.
Synthesis of a single unique strand is already commercially
available on the scale of millimoles, which is well above the
Douglas Carmean is a Distinguished Engineer at
amount we require. Microsoft. His current work explores new architec-
Note that we are assuming the existence of a large database tures on futures device technology. Carmean holds
of potentially up to 10 quadrillion of unique targets. This a BS in electrical and electronics engineering from
Oregon State University.
is beyond the capability of DNA synthesis today. Today’s
technology can synthesize many unique strands of DNA at
once, on the order of millions [13], but making a database that
references 10 quadrillion images would only become feasible
with further advancements.
3) System Capability: Plugging the above constraints into
the model yields a synthesis latency of tsyn of 83 minutes and
a sequencing latency tseq of 2 minutes. These are of course
rough estimations due to the coarse granularity of our model.
The partial hybridization and PCR reactions would take on the
order of hours. The bottlenecks are clearly DNA synthesis and
the reactions, not sequencing. Luis Ceze is a Professor at the Paul G. Allen
School of Computer Science and Engineering at
If we plug in a dataset size (equal to the number of unique the University of Washington. His research focuses
strands in the reaction, nrxn ) of 1016 , the model shows we only on the intersection between computer architecture,
require a reaction volume of 1.7 mL. Modeling other systems programming languages, machine learning and biol-
ogy. His current focus is on approximate computing
is outside the scope of this paper, but we believe that MASS for efficient machine learning and DNA-based data
would be competitive with or outperform purely electronic storage. He co-directs the Molecular Information
systems at this scale. Systems Lab (MISL), the Systems and the Archi-
tectures and Programming Languages for Machine
Learning lab (SAMPL). He received his Ph.D. in
VI. D ISCUSSION Computer Science from UIUC and his M.Eng. and B.Eng. from USP, Brazil.
He is a Senior Member of IEEE and ACM.
Both synthesis and sequencing need to be lower cost and
higher throughput than they are today for DNA data storage
and computing to succeed. The gap in both dimensions is
daunting, estimated to be about 6 orders of magnitude, but it
9

Georg Seelig is an associate professor in the Paul G. R EFERENCES


Allen School of Computer Science & Engineering
and the Department of Electrical and Computer [1] Victor Zhirnov, Reza M. Zadegan, Gurtej S. Sandhu, George M. Church,
Engineering at the University of Washington. He and William L. Hughes. Nucleic acid memory. Nature Materials,
is an adjunct associate professor in Bioengineering. 15(4):366–370, 2016.
Seelig holds a PhD in physics from the University [2] Rob Carlson. Time for new dna synthesis and se-
of Geneva in Switzerland and did postdoctoral work quencing cost curves, 2014. https://synbiobeta.com/
in synthetic biology and DNA nanotechnology at time-new-dna-synthesis-sequencing-cost-curves-rob-carlson.
Caltech. [3] L M Adleman. Molecular computation of solutions to combinatorial
problems. Science, 266(5187):1021–1024, November 1994.
[4] E B Baum. Building an associative memory vastly larger than the brain.
Science, 268(5210):583–585, April 1995.
[5] Sotirios A Tsaftaris, A K Katsaggelos, T N Pappas, and T E Papoutsakis.
DNA-based matching of digital signals. In 2004 IEEE International
Conference on Acoustics, Speech, and Signal Processing, pages V–581–
4. IEEE, 2004.
[6] Joseph N Zadeh, Conrad D Steenberg, Justin S Bois, Brian R Wolfe,
Marshall B Pierce, Asif R Khan, Robert M Dirks, and Niles A Pierce.
NUPACK: Analysis and design of nucleic acid systems. Journal of
Computational Chemistry, 32(1):170–173, January 2011.
[7] Marvin H Caruthers. The Chemical Synthesis of DNA/RNA: Our Gift to
Kendall Stewart is a third-year PhD student at the Science. THE JOURNAL OF BIOLOGICAL CHEMISTRY, 288(2):1420–
University of Washington. Her focus is on designing 1427, 2013.
practical systems that leverage unique properties of [8] George M Church, Yuan Gao, and Sriram Kosuri. Next-Generation
unconventional devices. She is currently working on Digital Information Storage in DNA. Science, 337(6102):1628–1628,
architectures for high-throughput parallel computing September 2012.
using synthetic DNA. [9] Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz,
Emily M LeProust, Botond Sipos, and Ewan Birney. Towards practi-
cal, high-capacity, low-maintenance information storage in synthesized
DNA. Nature, 494(7435):77–80, January 2013.
[10] David Yu Zhang and Georg Seelig. Dynamic DNA nanotechnology
using strand-displacement reactions. Nature Chemistry, 3(2):103–113,
February 2011.
[11] Lulu Qian, Erik Winfree, and Jehoshua Bruck. Neural network compu-
tation with DNA strand displacement cascades. Nature, 475(7356):368–
372, July 2011.
[12] Nathan Blow. Microfluidics: The great divide. Nature Methods,
6(9):683–686, 2009.
[13] Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez,
Sergey Yekhanin, Konstantin Makarychev, Miklos Z. Racz, Govinda
Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher Takahashi,
Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart,
Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg
Karin Strauss is a Senior Researcher at Microsoft Seelig, Luis Ceze, and Karin Strauss. Random access in large scale dna
and an Affiliate Professor in the Allen School for data storage. Nature Biotechnology, 2018.
Computer Science and Engineering at University [14] James Bornholt, Randolph Lopez, Douglas M Carmean, Luis Ceze,
of Washington. Her research lies at the intersec- Georg Seelig, and Karin Strauss. A DNA-Based Archival Storage
tion of computer architecture, systems, and biology. System. In ASPLOS ’16: Proceedings of the Twenty-First International
Lately, her focus has been on DNA data storage. Conference on Architectural Support for Programming Languages and
In the past, she has studied other emerging memory Operating Systems. Microsoft Research, ACM, March 2016.
technologies and hardware accelerators for machine [15] Robert N. Grass, Reinhard Heckel, Michela Puddu, Daniela Paunescu,
learning, among others. Previously, she worked for and Wendelin J Stark. Robust chemical preservation of digital informa-
AMD Research, and before that she got her Ph.D. in tion on dna in silica with error-correcting codes. Angewandte Chemie
2007 from the Department of Computer Science at International Edition, 54(8):2552–2555, February 2015.
University of Illinois, Urbana-Champaign. She is a Senior Member of IEEE [16] S M Hossein Tabatabaei Yazdi, Yongbo Yuan, Jian Ma, Huimin Zhao,
and ACM. and Olgica Milenkovic. A Rewritable, Random-Access DNA-Based
Storage System. Scientific Reports, 5(1):1763, September 2015.
[17] Andrew Batu, Tugkan; Kannan, Sampath; Khanna, Sanjeev; Mcgregor.
Reconstructing Strings from Random Traces. Symposium A Quarterly
Journal In Modern Foreign Literatures, 2004(Soda):910–918, 2004.
[18] Rajeev Balasubramonian and Boris Grot. Near-Data Processing [Guest
editors’ introduction]. IEEE Micro, 36(1):4–5, 2016.
[19] David G Lowe. Distinctive Image Features from Scale-Invariant Key-
points. International Journal of Computer Vision, 2004.
[20] Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke
Zhu, Yongdong Zhang, and Jintao Li. Deep learning for content-based
image retrieval: A comprehensive study. pages 157–166, 2014.
Max Willsey is a Ph.D. student at the Paul G. Allen [21] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors:
School for Computer Science & Engineering at the Towards removing the curse of dimensionality. In Proceedings of the
University of Washington. He received his B.S. in Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98,
Computer Science from Carnegie Mellon University pages 604–613, New York, NY, USA, 1998. ACM.
(2016). He is a recipient of a Qualcomm Innovation [22] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for
Fellowship and an NSF Graduate Research Fellow- approximate nearest neighbor in high dimensions. Communications of
ship honorable mention. His research interests are in the ACM, 51(1):117–122, January 2008.
programming languages and computer architecture [23] Marius Muja and David G Lowe. Fast approximate nearest neighbors
with applications in biology. with automatic algorithm configuration. In VISAPP International Con-
ference on Computer Vision Theory and Applications, 2009.
10

[24] Integrated dna technologies. https://www.idtdna.com. Accessed: 2017-


08-11.
[25] Matej Sack, Kathrin Hölz, Ann-Katrin Holik, Nicole Kretschy, Veronika
Somoza, Klaus-Peter Stengele, and Mark M. Somoza. Express pho-
tolithographic dna microarray synthesis with optimized chemistry and
high-efficiency photolabile groups. Journal of Nanobiotechnology,
14(1):14, Mar 2016.

You might also like