0% found this document useful (0 votes)
49 views

What Does Mean?: Scalable

The document discusses the concept of scalability in operations, algorithms, and data sizes. Operationally, scalability now means being able to utilize thousands of cheap computers rather than just fitting in main memory. Algorithmically, scalable algorithms should perform operations proportional to N log N rather than Nm as data sizes increase. With large datasets like telescopic images, algorithms may only get one pass through streaming data so must make the most of that pass. Relational databases can help solve "needle in haystack" problems scalably by indexing large datasets. However, some tasks like trimming sequences require touching all data records so cannot leverage indexing for improved scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

What Does Mean?: Scalable

The document discusses the concept of scalability in operations, algorithms, and data sizes. Operationally, scalability now means being able to utilize thousands of cheap computers rather than just fitting in main memory. Algorithmically, scalable algorithms should perform operations proportional to N log N rather than Nm as data sizes increase. With large datasets like telescopic images, algorithms may only get one pass through streaming data so must make the most of that pass. Relational databases can help solve "needle in haystack" problems scalably by indexing large datasets. However, some tasks like trimming sequences require touching all data records so cannot leverage indexing for improved scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

What Does Scalable Mean?

Operationally:
In the past: Works even if data doesnt fit in main memory
Now: Can make use of 1000s of cheap computers

Algorithmically:
In the past: If you have N data items, you must do no more

than Nm operations -- polynomial time algorithms


Now: If you have N data items, you must do no more than Nm/k
operations, for some large k
Polynomial-time algorithms must be parallelized

Soon: If you have N data items, you should do no more than N *


log(N) operations

As data sizes go up, you may only get one pass at the data
The data is streaming -- you better make that one pass count
Ex: Large Synoptic Survey Telescope (30TB / night)
5/15/13

Bill Howe, eScience Institute

Example: Find matching DNA sequences


Given a set of sequences
Find all sequences equal to
GATTACGATATTA

5/15/13

Bill Howe, eScience Institute

TACCTGCCGTAA

GATTACGATATTA

GATTACGATATTA

TACCTGCCGTAA = GATTACGATATTA?

No.

time = 0
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

CCCCCAATGAC = GATTACGATATTA?

No.

time = 1
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

GATTACGATATTA contains GATTACGATATTA?


Yes!
Send it to the output.

time = 17
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

40 records, 40 comparions
N records, N comparisons
The algorithmic complexity is order N: O(N)

5/15/13

Bill Howe, eScience Institute

TTTTCGTAATT
AAAATCCTGCA
AAACGCCTGCA

TTTACGTCAA
GATTACGATATTA

What if we sort the sequences?

Start at the 50% mark

CTGTACACAACCT
GATTACGATATTA

0%

100%

CTGTACACAACCT < GATTACGATATTA

time = 0

No match.
Skip to 75% mark

GATTACGATATTA
GGATACACATTTA

0%

100%

GGATACACATTTA > GATTACGATATTA

time = 1

No match.
Go back to 62.5% mark

GATATTTTAAGC
GATTACGATATTA

0%

100%

GATATTTTAAGC < GATTACGATATTA


No match.
Skip back to 68.75% mark

GATTACGATATTA

0%

100%

GATTACGATATTA = GATTACGATATTA
Match!
Walk through the records until we
fail to match.

GATTACGATATTA

0%

100%

How many comparisons did we do?


40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N))

Far better scalability

Relational Databases

Databases are good at Needle in Haystack problems:


Extracting small results from big datasets
Transparently provide old style scalability
Your query will always* finish, regardless of dataset size.
Indexes are easily built and automatically used when appropriate
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = GATTACGATATTA;

5/15/13

Bill Howe, eScience Institute

*almost

14

New task: Read Trimming


Given a set of DNA sequences
Trim the final n bps of each sequence
Generate a new dataset

5/15/13

Bill Howe, eScience Institute

15

TACCTGCCGTAA

GATTACGATATTA

TACCTGCCGTAA becomes TACCT

time = 0

CCCCCAATGAC becomes CCCCC

time = 1

GATTACGATATTA becomes GATTA

time = 17

Can we use an index?


No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?

time = 0

time = 1

time = 2

time = 3

How much time did this take?


7 cycles

time = 7

40 records, 6 workers
O(N/k)

You might also like