0% found this document useful (0 votes)
36 views

Complexity Analysis

The document discusses a DNA-based encryption algorithm called DNA Indexing that uses genomic databases to generate one-time pads for encryption. It works by searching plaintext bytes in a chromosomal sequence chosen as the key, and memorizing the positions as possible substitution values for each byte during encryption. The algorithm is analyzed for security through statistical measurements, cryptanalysis techniques, and evaluating its large key space from the genomic databases. Its time complexity is also theoretically and practically analyzed.

Uploaded by

Rama Chandran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Complexity Analysis

The document discusses a DNA-based encryption algorithm called DNA Indexing that uses genomic databases to generate one-time pads for encryption. It works by searching plaintext bytes in a chromosomal sequence chosen as the key, and memorizing the positions as possible substitution values for each byte during encryption. The algorithm is analyzed for security through statistical measurements, cryptanalysis techniques, and evaluating its large key space from the genomic databases. Its time complexity is also theoretically and practically analyzed.

Uploaded by

Rama Chandran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Security and Complexity of a DNA-Based Cipher

Olga TORNEA, Monica E. BORDA


Communications Department
Technical University of Cluj-Napoca
Cluj-Napoca, Romania
[email protected], [email protected]

Abstract - DNA cryptography is a new and promising field in benefits from the huge randomness that DNA medium is
information security. It combines the classical solutions in offering; it uses publicly available genome databases in order
cryptography with the strength of the genetic material. In this to provide the OTP symmetric key. A variety of possible genes
work is evaluated an encryption algorithm that uses the genomic and chromosomes from different organisms are good materials
databases, where are stored the DNA sequences in digital form. for creation of random, non-repeating and for only one use
Genomic databases represent a feasible solution to the One-Time- pads.
Pad (OTP) symmetric key generation problem. Complexity of the
algorithm was evaluated by the theoretical analysis and practical Security level of the algorithm was analyzed using the
measurements of its execution time. Different techniques have following techniques: statistical measurements, some basic
been used to evaluate the security level of the algorithm such as cryptanalytic attacks, and analysis of the key space. Statistical
key space, cryptanalysis, and statistical measurements. measurements like histogram, correlation coefficient, and
entropy gives the knowledge about patterns in the analyzed
Keywords - DNA cryptography; one-time-pad; genomic information. The presence of patterns in the ciphertext gives
databases; symmetric encryption; security; complexity the opportunity for the attackers to define a rule by which they
can retrieve the information without using the key. Statistical
I. INTRODUCTION techniques are useful in case of a ciphertext-only attack, where
One of the newest directions in cryptography is the use of an attacker has the access to the ciphertext, but not to the key
genetics and biomolecular computation. Genetic material such or related plaintext. The major cryptanalytic attacks [6] can be
as DNA can be used as a vast storage space. This idea is classified in decreasing order of difficulty or increasing order
inspired from the fact that DNA is a natural carrier of of available information as follows: ciphertext-only attack,
information which is encoded by a 4-letter alphabet: A, C, G, known plaintext attack, chosen-plaintext or chosen-ciphertext
and T. This alphabet can be easily transposed into the binary attack, adaptive chosen-plaintext or chosen-ciphertext attack,
alphabet (A – 00, C – 01, G – 10, T - 11). Therefore DNA can related key attack. Kerckhoff's principle stipulates that the
be used as a storage media for any kind of information. The security of a cryptosystem need to lie only in the key. Thus the
property of hybridization between complementary DNA key space should be large enough making the brute force attack
nucleotides bases (A-T, C-G) is exploited in the biomolecular infeasible.
computing field as a central process of computations. It is a Computational complexity estimates the amount of
natural process that appears between complementary DNA resources required for solving a certain problem. In this work
strands of nucleotides and that’s why it is named a self- was performed a theoretical complexity analysis of the
assembling process. DNA computing started with Adleman’s algorithm. Obtained estimations where confirmed by the
research [1], while some basic directions of DNA cryptography measurements of the runtime from the implemented algorithm.
are described in [2]. The common notation for the complexity function is O(n),
Genomes sequencing and their appearance in the form of where n is the input parameter. Normally the execution time of
electronic databases was a big step for the growth in the an algorithm grows with the input size and this function can be:
genomic research domain [3]. The benefits of the digital logarithmic - O(log n), linear - O(n), quadratic - O(n2), cubic -
genomic databases can be extended also to the information O(n3), or exponential - O(2n). Logarithmic growth rate of the
security domain. For example, these databases can be used for runtime is the most optimal and the exponential time is
the practical application of the OTP encryption scheme. The preferably to avoid. On the other hand if the time needed for
OTP properties correspond to the characteristics of the breaking a cipher is exponential, then it is considered a secure
unbreakable encryption system defined by Claude Shannon as method of encryption [7].
follows: the key must be truly random, at least as large as the In section II is presented the principle of the algorithm,
plaintext, never reused in whole or part, and kept secret [4]. section III is about its time complexity with the theoretical and
In this work are analyzed security level and performances practical analysis results. The security level of the algorithm is
of a DNA-based encryption algorithm, presented in [5]. This discussed in section IV. Final conclusions and bibliography are
algorithm does not use the DNA biological medium, and still ending the paper.

“Improvement of the doctoral studies quality in engineering science for


development of the knowledge based society-QDOC" contract no.
POSDRU/107/1.5/S/78534, project co-funded by the European Social Fund
through the Sectorial Operational Program Human Resources 2007-2013.
b) For all the bytes is performed a search through the
II. DNA INDEXING ENCRYPTION ALGORITHM key, a long chromosomal sequence composed from letters: A,
DNA Indexing is a symmetric key encryption algorithm C, G, and T.
[5]. The OTP principle is ensured by the use of the genomic c) Each time the byte sequence is retrieved in the key
databases. Secret OTP key is a chromosomal sequence that can sequence, the index of that position is memorized in a vector
be downloaded from any genomic database like: GenBank,
dedicated for that byte.
DDBJ, etc. Encryption is performed with a very long sequence
made of thousands of DNA letters (nucleotide bases). Each d) The result of these operations is a key table of size
DNA sequence from the database has its unique identification 256xN, where N is a variable length because each byte can
number composed of 6 – 8 characters [8]. In a symmetric have a different number of corresponding values in this table.
cryptosystem encryption and decryption are performed with the 2) Encryption is performed one byte at a time. It consists
same key. Receiver needs to know the ID Number of the in substitution of the byte with a value randomly retrieved
sequence used as the key in order to find it in the established
from its vector in the key table.
database.
3) The final ciphertext is an array composed of the integer
A. Encryption process values.
DNA Indexing is a stream cipher where information is
B. Decryption process
processed one byte at a time. The principle is to transform one
plaintext byte into a sequence of 4 DNA letters. The next step Decryption is performed with the same key – sequence.
is to search this short sequence through the chromosomal Each number from the ciphertext is used as a pointer to the
sequence, which was chosen as the key for the encryption. DNA sequence, indicating where to read the plaintext byte
Each time a plaintext byte sequence is retrieved in the sequence.
chromosomal sequence, the position of this place is memorized The principle of this algorithm is similar to that of a book
in a vector as one of the possible values for encryption by cipher. The idea of using books in cryptographic purposes is
substitution for this byte. Vectors of substitutions for all the dated from 1526 [9]. In those times Jacobus Silvestri proposed
bytes are memorized in a key dictionary. Therefore for each in his work a code book cipher for the secret communications.
byte from the plaintext there is a range of possible values from Any encryption algorithm that has a long piece of text as the
which one is chosen randomly for encryption by substitution. key can be named a book cipher. An example of such
In order to obtain a substantial number of substitutions for each algorithm is a substitution of each word from the plaintext with
byte, the key-sequence needs to be sufficiently long, for its position in a certain book, where the position is given by
example 30 000 bases. Below are presented steps of encryption counting each word. A genome sequence can be considered as
and an example of encryption in Fig. 1. such book and the genomic databases as a digital library
1) Key dictionary computation: available for this cipher.
a) Each byte of 256 possible values is transformed in a DNA Indexing encryption algorithm can be considered
sequence of 4 letters by the following principle: 10 00 11 01 partially a homophonic substitution cipher. The principle of
(141) → GATC. homophonic substitution is to create a table where each letter
of the alphabet has a certain number of substitution values. The
number of substitution values corresponds to the frequency
with which a letter appears in the language.

Fig. 1. Encryption process of the DNA Indexing algorithm


TABLE III. MEASUREMENTS OF THE DECRYPTION RUNTIME
III. COMPUTATIONAL COMPLEXITY OF THE ALGORITHM
Ciphertext Size (KB) Runtime (ms)
Complexity analysis of an algorithm is important because it
reveals its efficiency for the real time applications. In this work 54.3 31
was analyzed computational time of the algorithm using 74.6 43
complexity theory methods. The obtained conclusions were 214 123
tested to be true through the implementation results. 419.9 248
720.4 421
The execution time of an algorithm is considered to be the
sum of all the operations. The number of operations can be
constant, or it can be variable and depend on the input
parameters. According to the approximations from complexity
theory, the smallest possible class of functions is used to
express the growing rate of the algorithm’s runtime [10].
Therefore, if the number of operations is for example 1 + 2n,
then the complexity would be O(n); if the number of operations
is 4 + n + n3, then the complexity would be O(n3).
In this work complexity was analyzed for 3 important
operations of the DNA Indexing algorithm: key dictionary
computation, encryption, and decryption. The key dictionary is
computed in 2*256*n operations, where 256 is the number of
possible values for a byte, and n is the length of the key
sequence. Encryption and decryption are performed in 2*m
operations, where m is the number of plaintext and ciphertext
words. Ciphertext has the same number of words as the Fig. 2. Growing rate of the key table computation runtime
plaintext. Taking the smallest class of functions, complexity for
the key dictionary computation is O(n) and for the encryption-
decryption process is O(m). This means that the growing rate
of the computational time is linear according to the input size.
The experimental results have proved the correctness of
the estimated complexity. In order to see the progression of
the runtime, the program was executed at different,
progressively increasing values of n and m. In Fig. 2 - 4 are
presented graphics of the runtime growing rate for the key
table computation, encryption and decryption processes. Some
of the execution time measurements are presented in Tables I -
III.

TABLE I. MEASUREMENTS OF THE KEY-DICTIONARY COMPUTATION


RUNTIME
Fig. 3. Growing rate of the encryption runtime
Key Length (nucleotides) Runtime (ms)
1000 62
5000 359
10000 702
15000 1029
25000 1763

TABLE II. MEASUREMENTS OF THE ENCRYPTION RUNTIME

Plaintext Size (KB) Runtime (ms)


74.6 5
214 16
419.9 31
720.4 49
957.4 63
Fig. 4. Growing rate of the decryption runtime
ciphertext pairs needs to be very large to make this attack
IV. SECURITY LEVEL OF THE ALGORITHM more successful. Ciphertext-only attack exploits patterns in
Security of the algorithm was analyzed using different the current ciphertext, disposing a set of previous ciphertexts.
approaches, like: statistical measurements, cryptanalytic Due to the OTP principle, plaintext bytes will always have a
attacks, key space analysis and secure transmission of the ID different range of substitution values from one encryption to
number. another, thus a set of ciphertexts or plaintext-ciphertext pears
from previous encryptions are not useful for breaking current
A. Statistical Measurements ciphertext.
Statistical probability distribution of a signal can be An interesting study would be a related-key attack on the
visualized through a graphical representation of a histogram. DNA Indexing algorithm. The keys for this encryption method
Given a discrete range of values for a signal, a histogram are DNA sequences from different databases. Similarities
shows the occurrence of each value in the signal. Statistical between different sequences can be analyzed using algorithms
distribution of the ciphertext can be analyzed in comparison based on string kernels [11]. According to the amount and
with the plaintext distribution or along. In cryptography it is length of the repeating intervals between two random key
important that the distribution of the ciphertext will not contain sequences, a certain level of vulnerability can be established to
patterns of the plaintext distribution and a more uniform this kind of attack.
distribution of the ciphertext offers a better security.
Implementation results (Fig. 5) shows that the distribution of TABLE IV. PLAINTEXT AND CIPHERTEXT ENTROPY MEASUREMENTS
the ciphertext is random and it doesn’t contain patterns of the
plaintext distribution. Plaintext Entropy of the Plaintext Entropy of the Ciphertext
Other statistical measurements we used to measure Images
security strength of the ciphertext were entropy and Mandrill.png 5,80911966936174 11,8045198613451
correlation coefficient (CC). Entropy measures the uncertainty Lena.jpeg 6,71751693770751 12,7449024853442
and randomness of the signal. Thus, entropy of the ciphertext Geometr.png 3,70751796253665 8,30648025043256
is desired to be high. CC indicates degree of correlation Text files
between neighboring values like pixels or letters; its value is Text 1 4,61331253051971 8,39557497165078
intended to be small for the ciphertext. From our Text 2 4,42641922538442 8,21010031257357
measurements (Tables IV - V) the ciphertext entropy is almost
twice higher than the plaintext entropy and CC of the data is in
average three times smaller after encryption.
TABLE V. PLAINTEXT AND CIPHERTEXT CC MEASUREMENTS

B. Cryptanalytic Attacks Correlation Coefficient Correlation Coefficient


Images
of the Plaintext of the Ciphertext
We analyzed the resistance of DNA Indexing algorithm to Images
the cryptanalytic attacks. The OTP principle on which it is Cameramen.tiff 0,991091310592027 0,3632402094301
based protects it from vulnerability to the most of the classic Lena.tiff 0,965094635609251 0,319845388706723
attacks. Resistant to the known-plaintext attack is given by the
Geometry.png 0,352053005688481 0,125730787177559
fact that each plaintext byte has not just one, but a range of
corresponding ciphertext values. Sample of known plaintext-

Fig. 5. Plaintext, ciphertext and their statistical measurements


of single use keys. With electronic available genetic databases
C. Key Space Analysis
this problem can be solved. Very long keys (genetic
The key space of an encryption algorithm is desired to be sequences) don’t have to be sent or stored, they are available
as large as possible in order to resist to the brute-force attack. on the internet database and only the ID of the sequence must
If the key is a sequence of bits then, its space will be 2length(Key). be sent. Considering a cryptosystem with private database,
Trying all the possible keys will give 2length(Key) number of transmission of the ID number doesn’t have to be secure and
attempts to obtain a successful break. On the average the
the key space becomes much larger. The analyses show that
correct answer can be found in half of this number of tries.
the DNA-Indexing algorithm provides a good level of security
In case of DNA Indexing the key is composed of two and can be adapted to the real-time applications.
parts: genetic sequence and its ID number. The key for
encryption-decryption is a genetic sequence. In this case, try
of all the possible keys means trying all the genetic sequences ACKNOWLEDGMENT
from a database. One of the databases can be the NIH genetic This paper was supported by the project “Improvement of
sequence database named GenBank. It is a collection of all the doctoral studies quality in engineering science for
publicly available DNA sequences [8]. It contains development of the knowledge based society-QDOC" contract
approximately 135 440 924 DNA sequence records. Trying all no. POSDRU/107/1.5/S/78534, project co-funded by the
these sequences will be equivalent to 227, which means a key European Social Fund through the Sectorial Operational
of 27 bits. On the other hand, genetic sequence is long, as Program Human Resources 2007-2013.
mentioned in section 2, it should be at least 30 000 bases. The
alphabet of the key-sequence is composed of 4 letters: A, C,
G, and T. Thus, in case of using a private database, trying all
the possible keys becomes a number of 430,000. There is also a REFERENCES
possibility to create a large public available database; in this
case the key space is equivalent to the number of sequences in [1] L.M. Adleman, “Molecular computation of solution to combinatorial
the database. problems”, Science, vol. 266, pp. 1021-1024, 1994.
[2] A. Gehani, T. LaBean, J. Reif, “DNA-based cryptography”, Dimacs
Series In Discrete Mathematics & Theoretical Computer Science, vol.
D. Secure Transmission of the ID Number 54, pp. 233-249, 2000.
Secure transmission of the sequence ID number is [3] E.S. Lander, “Initial impact of the sequencing of the human genome”,
important when a public database is used. If the database is Nature, vol. 470, pp. 187-197, 2011.
public, access to the ID number means a direct access to the [4] C.E. Shannon, “Communication theory of secrecy systems”, Bell
key-sequence. The length of the ID number in GenBank is 6 – System Technical Journal, vol. 28, no. 4, pp. 656-715, 1949.
8 characters, which means up to 64 bits. A block of 64 bits can [5] O. Tornea, M.E. Borda, T. Hodorogea, M. Vaida, “Encryption system
be encrypted with a traditional symmetric algorithm, like with Indexing DNA chromosomes cryptographic algorithm”, IASTED
International Conference, vol. 680-099, pp. 12-15, 2010.
DES, and then sent through an existing encryption channel,
[6] D.R. Kohel, “Cryptography” 11th July, 2008.
using a previously exchanged key. Accession number can be
[7] O.S. Rao, S.P. Setty, “Efficient mapping methods for Elliptic Curve
sent also using public-key cryptography. Anyone who has a Cryptosystems”, International Journal of engineering Science and
copy of a public key can easily encrypt information that only Technology. Vol. 2(8), pp. 3651-3656, 2010.
with private key can be read. Communication involves only [8] http://www.dsimb.inserm.fr/~fuchs/M2BI/AnalSeq/Annexes/Sequences/
public keys, and no private key is ever transmitted or shared. Accession_Numbers.htm
This type of key distribution is used in PGP [13] and in many [9] A.C. Leighton, S.M. Matyas, “The history of book ciphers”,
other systems, due to the facility of public key distribution. CRYPTO'84, pp. 101-113, 1984.
[10] J. Hastad, “Lower bounds in computational complexity theory”, Notices
of the AMS, vol. 35, no 5, pp. 677–683, 1988.
V. CONCLUSIONS
[11] A. Mohapatra, P.M. Mishra, S. Padhy, “Discriminative DNA
In this paper was analyzed the security level and time classification and motif prediction using weighted degree string kernels
performance of the DNA-based algorithm. This algorithm with shift and mismatch”, Advances in Computing, Communication and
Control, pp. 56-61, 2009.
represents a practical application of DNA cryptography and
[12] http://www.ncbi.nlm.nih.gov/genbank/
brings a great advantage of using genomic databases; a feature
[13] PGP Corporation, Phil Zimmermann, “An introduction to cryptography”,
that was not exploited before in cryptography. Encryption 2004.
schemes based on OTP principle provides a strong security.
The drawback of the OTP is generation of long, random, and

You might also like