DNA Data Storage and Hybrid Molecular-Electronic Computing
DNA Data Storage and Hybrid Molecular-Electronic Computing
Abstract—Moore’s Law may be slowing, but our ability to Today Projection Limit
1.0E+10
manipulate molecules is improving faster than ever. DNA could 1.0E+09
Gb/mm^3
the implications of this trend in computer architecture. 1.0E+05
1558-2256 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2
Electronics DNA Synthesis Molecules raise a series of critical design questions. In this section,
we discuss challenges and trade-offs pertaining to physical
constraints, communication, storage, and computation. We
Fluidic Control discuss these dimensions in general, and we provide examples
of how they guided design decisions in the systems presented
Complex control DNA Sequencing Dense, parallel, in Section IV and Section V.
flow & operations near-data compute
Ex: Sparse linear Ex: high-dimensional
algebra, image feature search, distance A. Physical Constraints
encoding Sensor Data computations
Molecular systems are unique because they require the stor-
Fig. 3: Hybrid eletronic-molecular architecture. Benefits of age and manipulation of various solutions, including mixing,
electronic and molecular components. Different applications splitting, diluting, and incubating them. Architects must take
may better fit the strengths of either domain. The arrows show care to ensure that a system is physically realizable. Adleman’s
ways of getting data from electronic to molecular components famous DNA-based algorithm for solving the Hamiltonian
and vice-versa. path problem [3] provides a cautionary tale: the amount of
DNA required grows exponentially with the graph size. A
system is not feasible if modestly sized problems require
these minute current fluctuations, which may require heavy
swimming pools or oceans of DNA. The systems presented
signal processing and more precise sensors, and increasing
in this paper, however, demonstrate that some applications
the density of nanopores on a physical substrate, as well as
require only a small reaction volume and are thus feasible.
solving problems with clogging and endurance of pores.
Since we are trying to build computer systems, physical
C. Brief History of DNA Data Storage manipulation also necessitates automation. The various steps
of preparing, operating on, and analyzing samples are typically
The general idea of using DNA as storage of synthetic
done by humans in a wetlab. Microfluidic technology could
information has been around since at least the mid 1960s,
provide the needed automation, but it is not yet advanced
when Norbert Wiener suggested the idea of ”genetic” memory
enough to support a practical computer system. Some instances
for computers. In the past 6 years, work from Harvard [8]
of the technology are not flexible enough, and those that are
and the European Bioinformatics Institute [9] showed that
remain error-prone and difficult to program [12]. Furthermore,
progress in modern DNA manipulation methods could make
programming these hybrid molecular-electronic systems will
it both possible and practical soon. Many research groups,
require intertwined control code, sample manipulation, data
including group at ETH Zurich, University of Illinois at
analysis, and conventional computation; these challenges re-
Urbana-Champaign, and Columbia University are working on
main to be explored.
this problem. Our own group at the University of Washington
and Microsoft holds the world record for the amount of data
successfully stored in and retrieved from DNA: over 500 B. Communication Considerations
megabytes as of June 2018. How to move information between domains is a primary
concern for any heterogeneous system, and it is especially im-
D. DNA-based computation portant for hybrid molecular-electronic systems, where com-
The kinetics of DNA hybridization enable more than just munication can be expensive.
a lookup operation. For instance, partial hybridization can There are many ways to communicate from the electronic
implement “fuzzy matching”, where the query and target do domain to the molecular domain. DNA synthesis adds new
not have to be entirely complementary, and the “fuzziness” can molecules representing data to the system. Physical manipula-
be controlled by varying the temperature [5]. This property tion also adds data: the choice of which samples to combine
can be leveraged to perform distance computations, which we determines the behavior of the system. Changes to the environ-
discuss further in Section III. ment (e.g., temperature, humidity) can also control the system
More recently, researchers have shown that hybridization by influencing chemical properties.
reactions can form complex cascades called strand displace- Getting data from the molecular domain back into the
ment reactions, which can be used to implement general electronic domain varies as well. Some operations may obtain
purpose computations, including boolean circuits [10] and enough data from a simple sensor reading: for example,
neural networks [11]. fluorescent markers can indicate the presence of a particular
Beyond hybridization, evolution has led to a variety of substance or the occurrence of a reaction. DNA sequencing
enzymes for processing DNA, including cutting, joining, repli- provides even more information by reconstructing the exact
cation, and editing. These enzymes can be used to create even sequence of bases from a sample.
more complex circuits. The cost of getting data into and out of molecular compo-
nents is a crucial consideration. The extreme density and par-
III. H YBRID M OLECULAR -E LECTRONIC S YSTEMS allelism afforded by the molecular domain is of limited use if
A hybrid molecular-electronic system aims to leverage the the interface is a bottleneck. An efficient hybrid system would
best properties of each domain (Figure 3). As with any hetero- send a relatively small amount of information to the molecular
geneous system, the strengths and weaknesses of each domain domain, where lots of work would be done in parallel, and
4
60%
to retrieve tapes. A production DNA-based storage system
returns
would require the use of microfluidic automation to perform
40% the necessary reactions. Tape libraries offer random access by
20%
robotic movement of cartridges and fast-forwarding to specific
tape segments. The equivalent in DNA would be physically
0% isolated “containers” with DNA, along with some form of
0 500 1000 1500 2000 2500 3000 molecular selection prior to sequencing and decoding. While
DNA strand length PCR is the mechanism we have focused on so far, one can
Fig. 6: Overheads as function of strand length. also use magnetic-bead based and other DNA random access
methods.
after recent improvements in its chemistry. Despite this high Query Feature
error rate, we have been able to decode a file read with this Image Data Extraction
Feature /
E. Our results so far Synthesis
Query Feature Distance
Address
Hybrid Molecular-Electronic
Purely Electronic
Seq. Encoding Computation
Pairs
Our work so far demonstrates an end-to-end approach Query Strands
toward the viability of DNA data storage with large-scale Distance /
Address Pairs
random access. Although we have only reported on the initial Feature /
Partial
35 files and 200MB of data [13], we have so far encoded, Hybridization Address
Oligos Ranking /
stored, retrieved, and successfully recovered about 40 distinct Thresholding
Matching Strands
files totaling about 400MB of data in more than unique 25 mil- Matching Image
lion DNA oligonucleotides synthesized by Twist Bioscience Matching
Sequencing Image Database
(over 3 billion nucleotides in total). Our results represent an Addresses
To that end, the temperature of the reaction vessel can be is important to note that, when used for data storage, DNA
raised or lowered to get more or fewer similar results. synthesis and sequencing have different requirements than
for life sciences. First, when storing data, control over the
D. Model Instantiation sequences to be synthesized allows for the use of smart error
correction to tolerate error rates orders of magnitude higher
Using the model in Figure 8, we can derive the latencies than those required for life sciences applications. Second,
(tsyn and tseq ) and capacity of the system (nrxn ). The remaining storage applications can tolerate completely missing sequences
model parameters are constrained by either the biomolecular as well as contamination. Third, data storage needs very
protocol or technology limits. few copies of each sequence, compared to the much higher
1) Protocol Constraints: We choose the reaction concen- life sciences requirements. Higher synthesis and sequencing
tration ρrxn = 100µM, a common concentration for synthetic density implies simultaneously higher throughput and lower
DNA [24]. We choose the reaction copy number crxn = 10. costs, so it will be key to a practical, large-scale end-to-end
PCR is incredibly specific, we have observed it working when DNA storage system.
the copy number is as low as 5.
We chose length of the synthesized query, lsyn , to be 100
ACKNOWLEDGEMENTS
bases. We believe that this is sufficient to encode feature
vectors given a dimensionality reduction. The length of the We would like to thank the anonymous reviewers for their
target strands that get sequenced, lseq , is 160 bases. This helpful feedback on the manuscript, and MISL members for
allocates 100 bases for the encoded feature vector and 60 bases input on the research and feedback on how to present this
for the address. At a density of 1 bit per base [13], 60 bases work. This work was partially support by Microsoft, the
is sufficient to uniquely address nrxn = 1e16 images. National Science Foundation and DARPA under the Molecular
2) Technological Constraints: Sequencing and synthesis Informatics Program.
are expected to get exponentially faster, improving at a rate
exceeding Moore’s Law [2]. However, we chose to model AUTHOR B IOS
sequencing and synthesis rates that are achievable today.
We draw the synthesis rate for our model, rsyn , from recent
literature proposing a method to synthesize a base every 50
seconds [25]. Recall that synthesis time Equation 1 is propor-
tional only to the length of the strand, not the copy number.
Synthesis of a single unique strand is already commercially
available on the scale of millimoles, which is well above the
Douglas Carmean is a Distinguished Engineer at
amount we require. Microsoft. His current work explores new architec-
Note that we are assuming the existence of a large database tures on futures device technology. Carmean holds
of potentially up to 10 quadrillion of unique targets. This a BS in electrical and electronics engineering from
Oregon State University.
is beyond the capability of DNA synthesis today. Today’s
technology can synthesize many unique strands of DNA at
once, on the order of millions [13], but making a database that
references 10 quadrillion images would only become feasible
with further advancements.
3) System Capability: Plugging the above constraints into
the model yields a synthesis latency of tsyn of 83 minutes and
a sequencing latency tseq of 2 minutes. These are of course
rough estimations due to the coarse granularity of our model.
The partial hybridization and PCR reactions would take on the
order of hours. The bottlenecks are clearly DNA synthesis and
the reactions, not sequencing. Luis Ceze is a Professor at the Paul G. Allen
School of Computer Science and Engineering at
If we plug in a dataset size (equal to the number of unique the University of Washington. His research focuses
strands in the reaction, nrxn ) of 1016 , the model shows we only on the intersection between computer architecture,
require a reaction volume of 1.7 mL. Modeling other systems programming languages, machine learning and biol-
ogy. His current focus is on approximate computing
is outside the scope of this paper, but we believe that MASS for efficient machine learning and DNA-based data
would be competitive with or outperform purely electronic storage. He co-directs the Molecular Information
systems at this scale. Systems Lab (MISL), the Systems and the Archi-
tectures and Programming Languages for Machine
Learning lab (SAMPL). He received his Ph.D. in
VI. D ISCUSSION Computer Science from UIUC and his M.Eng. and B.Eng. from USP, Brazil.
He is a Senior Member of IEEE and ACM.
Both synthesis and sequencing need to be lower cost and
higher throughput than they are today for DNA data storage
and computing to succeed. The gap in both dimensions is
daunting, estimated to be about 6 orders of magnitude, but it
9