0% found this document useful (0 votes)

49 views

What Does Mean?: Scalable

The document discusses the concept of scalability in operations, algorithms, and data sizes. Operationally, scalability now means being able to utilize thousands of cheap computers rather than just fitting in main memory. Algorithmically, scalable algorithms should perform operations proportional to N log N rather than Nm as data sizes increase. With large datasets like telescopic images, algorithms may only get one pass through streaming data so must make the most of that pass. Relational databases can help solve "needle in haystack" problems scalably by indexing large datasets. However, some tasks like trimming sequences require touching all data records so cannot leverage indexing for improved scalability.

Uploaded by

Daniel Softjoy Montjoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

What Does Mean?: Scalable

Uploaded by

Daniel Softjoy Montjoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

What Does Scalable Mean?

Operationally:
In the past: Works even if data doesnt fit in main memory
Now: Can make use of 1000s of cheap computers

Algorithmically:
In the past: If you have N data items, you must do no more

than Nm operations -- polynomial time algorithms

Now: If you have N data items, you must do no more than Nm/k
operations, for some large k
Polynomial-time algorithms must be parallelized

Soon: If you have N data items, you should do no more than N *

log(N) operations

As data sizes go up, you may only get one pass at the data
The data is streaming -- you better make that one pass count
Ex: Large Synoptic Survey Telescope (30TB / night)
5/15/13

Bill Howe, eScience Institute

Example: Find matching DNA sequences

Given a set of sequences
Find all sequences equal to
GATTACGATATTA

5/15/13

Bill Howe, eScience Institute

TACCTGCCGTAA

GATTACGATATTA

TACCTGCCGTAA = GATTACGATATTA?

No.

time = 0
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

CCCCCAATGAC = GATTACGATATTA?

No.

time = 1
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

GATTACGATATTA contains GATTACGATATTA?

Yes!
Send it to the output.

time = 17
5/15/13

Bill Howe, eScience Institute

GATTACGATATTA

40 records, 40 comparions
N records, N comparisons
The algorithmic complexity is order N: O(N)

5/15/13

Bill Howe, eScience Institute

TTTTCGTAATT
AAAATCCTGCA
AAACGCCTGCA

TTTACGTCAA
GATTACGATATTA

What if we sort the sequences?

Start at the 50% mark

CTGTACACAACCT
GATTACGATATTA

100%

CTGTACACAACCT < GATTACGATATTA

time = 0

No match.
Skip to 75% mark

GATTACGATATTA
GGATACACATTTA

100%

GGATACACATTTA > GATTACGATATTA

time = 1

No match.
Go back to 62.5% mark

GATATTTTAAGC
GATTACGATATTA

100%

GATATTTTAAGC < GATTACGATATTA

No match.
Skip back to 68.75% mark

GATTACGATATTA

100%

GATTACGATATTA = GATTACGATATTA
Match!
Walk through the records until we
fail to match.

GATTACGATATTA

100%

How many comparisons did we do?

40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N))

Far better scalability

Relational Databases

Databases are good at Needle in Haystack problems:

Extracting small results from big datasets
Transparently provide old style scalability
Your query will always* finish, regardless of dataset size.
Indexes are easily built and automatically used when appropriate
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = GATTACGATATTA;

5/15/13

Bill Howe, eScience Institute

*almost

New task: Read Trimming

Given a set of DNA sequences
Trim the final n bps of each sequence
Generate a new dataset

5/15/13

Bill Howe, eScience Institute

TACCTGCCGTAA

GATTACGATATTA

TACCTGCCGTAA becomes TACCT

time = 0

CCCCCAATGAC becomes CCCCC

time = 1

GATTACGATATTA becomes GATTA

time = 17

Can we use an index?

No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?

time = 0

time = 1

time = 2

time = 3

How much time did this take?

7 cycles

time = 7

40 records, 6 workers
O(N/k)

Khan Academy
No ratings yet
Khan Academy
1 page
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Sas Data Governance Framework 107325
No ratings yet
Sas Data Governance Framework 107325
12 pages
Implementation of A Read Mapping Tool Based On The Pigeon-Hole Principle
No ratings yet
Implementation of A Read Mapping Tool Based On The Pigeon-Hole Principle
38 pages
Future of Genomic Data
No ratings yet
Future of Genomic Data
3 pages
1.2 Five Representative Problems
No ratings yet
1.2 Five Representative Problems
30 pages
Awesome Big Data Algorithms
No ratings yet
Awesome Big Data Algorithms
37 pages
CSE 241 Class Notes
No ratings yet
CSE 241 Class Notes
7 pages
CSC 172 Midterm
No ratings yet
CSC 172 Midterm
11 pages
PPT 1.2.1 c. Algorithm Complexity
No ratings yet
PPT 1.2.1 c. Algorithm Complexity
20 pages
Session - 10 Querying
No ratings yet
Session - 10 Querying
36 pages
12 AlgorithmAnalysis
No ratings yet
12 AlgorithmAnalysis
45 pages
02b Analysis
No ratings yet
02b Analysis
19 pages
Algo Run Time
No ratings yet
Algo Run Time
9 pages
NoSQL Database For Software
No ratings yet
NoSQL Database For Software
49 pages
Algorithm Analysis Big Oh: Data Structures and Design With Java and Junit
No ratings yet
Algorithm Analysis Big Oh: Data Structures and Design With Java and Junit
45 pages
ADA Lab Manual Updated 2023-24
No ratings yet
ADA Lab Manual Updated 2023-24
39 pages
ParallelDBs PDF
No ratings yet
ParallelDBs PDF
23 pages
Parallel & Distributed Databases: C S 5 6 1 - S P R I N G 2 0 1 2 Wpi, Mohamed Eltabakh
No ratings yet
Parallel & Distributed Databases: C S 5 6 1 - S P R I N G 2 0 1 2 Wpi, Mohamed Eltabakh
23 pages
Introduction To Algorithms Second Edition Solutions Manual: Christopher Clark University of Virginia Summer 2010
0% (1)
Introduction To Algorithms Second Edition Solutions Manual: Christopher Clark University of Virginia Summer 2010
9 pages
Data Structures and Algorithms: CS210/CS210A
No ratings yet
Data Structures and Algorithms: CS210/CS210A
31 pages
Lecture10 Nosql PDF
No ratings yet
Lecture10 Nosql PDF
80 pages
Alg record-new (1) (1)
No ratings yet
Alg record-new (1) (1)
51 pages
Algorithms Parallel and Sequential
No ratings yet
Algorithms Parallel and Sequential
514 pages
3.1 Searching Techniques
No ratings yet
3.1 Searching Techniques
48 pages
weak 1
No ratings yet
weak 1
81 pages
Lecture13_IO_BLG336E
No ratings yet
Lecture13_IO_BLG336E
58 pages
003 This Course 1
No ratings yet
003 This Course 1
7 pages
Final Review
No ratings yet
Final Review
96 pages
Algorithmic Cheatsheet: Typesetting Math: 97%
No ratings yet
Algorithmic Cheatsheet: Typesetting Math: 97%
12 pages
The Chinese University of Hong Kong: Course Code: CSCI 2100A Final Examination 10f 2
No ratings yet
The Chinese University of Hong Kong: Course Code: CSCI 2100A Final Examination 10f 2
2 pages
Algorithms Parallel And Sequential Umut A Acar Guy E Blelloch pdf download
No ratings yet
Algorithms Parallel And Sequential Umut A Acar Guy E Blelloch pdf download
88 pages
Mc4101 Ads Notes Advance Data Structure Nodes
0% (1)
Mc4101 Ads Notes Advance Data Structure Nodes
144 pages
28-Execution Plan Optimization Techniques Stroffek Kovarik
No ratings yet
28-Execution Plan Optimization Techniques Stroffek Kovarik
49 pages
Answer Script: University of Bedfordshire
No ratings yet
Answer Script: University of Bedfordshire
13 pages
Lec01 Motivation
No ratings yet
Lec01 Motivation
30 pages
Lecture 2: Asymptotic Analysis: CSE 373 Edition 2016
No ratings yet
Lecture 2: Asymptotic Analysis: CSE 373 Edition 2016
32 pages
Data Search Merged
No ratings yet
Data Search Merged
66 pages
Neural Networks
100% (1)
Neural Networks
4 pages
1.3.2 Searching Operation On Array
No ratings yet
1.3.2 Searching Operation On Array
24 pages
DSA_Notes 2
No ratings yet
DSA_Notes 2
19 pages
00 Introduction 320 2025w
No ratings yet
00 Introduction 320 2025w
26 pages
Research
No ratings yet
Research
2 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
22 pages
01-Time and Space Complexity
No ratings yet
01-Time and Space Complexity
41 pages
ApCompSci
No ratings yet
ApCompSci
7 pages
Instant Ebooks Textbook Introduction To Recursive Programming 1st Edition Manuel Rubio-Sanchez Download All Chapters
100% (2)
Instant Ebooks Textbook Introduction To Recursive Programming 1st Edition Manuel Rubio-Sanchez Download All Chapters
62 pages
@3 All UNIT 3 perfect UNIT 4 perfect
No ratings yet
@3 All UNIT 3 perfect UNIT 4 perfect
144 pages
Basic of Algorithms Analysis: Computational Tractability
No ratings yet
Basic of Algorithms Analysis: Computational Tractability
4 pages
Zhao Xiaojian
No ratings yet
Zhao Xiaojian
114 pages
Introduction to Recursive Programming 1st Edition Manuel Rubio-Sanchez instant download
100% (1)
Introduction to Recursive Programming 1st Edition Manuel Rubio-Sanchez instant download
58 pages
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
ECE408MT2ReviewFA24
No ratings yet
ECE408MT2ReviewFA24
58 pages
Mc4101 Ads Notes Advance Data Structure Nodes
No ratings yet
Mc4101 Ads Notes Advance Data Structure Nodes
144 pages
Complex It 4
No ratings yet
Complex It 4
32 pages
Mc4101 - Adsa Notes
No ratings yet
Mc4101 - Adsa Notes
142 pages
3.1 Searching Techniques
No ratings yet
3.1 Searching Techniques
49 pages
MIT6 006F11 ps4
No ratings yet
MIT6 006F11 ps4
5 pages
Will Computers Crash Genomics?
No ratings yet
Will Computers Crash Genomics?
3 pages
Query-Processing
No ratings yet
Query-Processing
77 pages
Introduction & Median Finding: 1.1 The Course
No ratings yet
Introduction & Median Finding: 1.1 The Course
8 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Digital Transformation Strategy
No ratings yet
Digital Transformation Strategy
6 pages
Certified CMMI Associate Exam Was Completed by Daniel Geray Montjoy Pita On Aug 21, 2017 2:56:34 PM
No ratings yet
Certified CMMI Associate Exam Was Completed by Daniel Geray Montjoy Pita On Aug 21, 2017 2:56:34 PM
8 pages
Building Blocks of Deep Neural Networks
No ratings yet
Building Blocks of Deep Neural Networks
7 pages
Introdatasciencepython Week1
No ratings yet
Introdatasciencepython Week1
5 pages
Lecture 13 Quiz - Coursera
No ratings yet
Lecture 13 Quiz - Coursera
5 pages
DG at Chevron Gom - Pnec17
No ratings yet
DG at Chevron Gom - Pnec17
34 pages
10 Kinds of Stories To Tell With Data
No ratings yet
10 Kinds of Stories To Tell With Data
5 pages
Competive On Analitycs Resumen
No ratings yet
Competive On Analitycs Resumen
12 pages
EMT Lecture Notes TEC
No ratings yet
EMT Lecture Notes TEC
139 pages
Solution 2024-25 Maths MQPs
No ratings yet
Solution 2024-25 Maths MQPs
58 pages
Chapter 10 Test
No ratings yet
Chapter 10 Test
9 pages
Lesson 7 - USELF - The Nature of Materialism
No ratings yet
Lesson 7 - USELF - The Nature of Materialism
14 pages
5solving Exponential Equations Using Logarithms
No ratings yet
5solving Exponential Equations Using Logarithms
14 pages
Stiggins - Assessment Through The Students Eyes 2
No ratings yet
Stiggins - Assessment Through The Students Eyes 2
6 pages
Implementing Cisco Quality of Service
No ratings yet
Implementing Cisco Quality of Service
1,092 pages
Angga Pratama Putra: Bachelor of Mining Engineering
No ratings yet
Angga Pratama Putra: Bachelor of Mining Engineering
1 page
HPI_B252_Internal_Pressure_Gage_03-2024EN
No ratings yet
HPI_B252_Internal_Pressure_Gage_03-2024EN
1 page
Test Bank for Culture and Psychology 6th Edition by Matsumoto instant download
100% (5)
Test Bank for Culture and Psychology 6th Edition by Matsumoto instant download
53 pages
ISW Tact 2 Manual
No ratings yet
ISW Tact 2 Manual
7 pages
WorkBook - The Professional Receptionist
No ratings yet
WorkBook - The Professional Receptionist
27 pages
Ece V Analog Communication Notes
No ratings yet
Ece V Analog Communication Notes
74 pages
Properties of A Substance
No ratings yet
Properties of A Substance
17 pages
Study Case Week Cost Management
No ratings yet
Study Case Week Cost Management
2 pages
WST01 01 Que 20160615
No ratings yet
WST01 01 Que 20160615
28 pages
Estimation of Electrical Power Consumption in Subway Station Design
No ratings yet
Estimation of Electrical Power Consumption in Subway Station Design
10 pages
Management, 8/E: Powerpoint Presentation To Accompany Chapter 8 of
No ratings yet
Management, 8/E: Powerpoint Presentation To Accompany Chapter 8 of
47 pages
VSX 1021 K
No ratings yet
VSX 1021 K
184 pages
2.9 Problems Solved
No ratings yet
2.9 Problems Solved
12 pages
Recruitment: By: K.C.Pattanaik Regd:1561301024
100% (1)
Recruitment: By: K.C.Pattanaik Regd:1561301024
22 pages
Aggregate Planning
No ratings yet
Aggregate Planning
46 pages
Lesson 10 Conceptual Framework of The Study
No ratings yet
Lesson 10 Conceptual Framework of The Study
15 pages
Computer Science / It Digital Logic: Gate 2016
No ratings yet
Computer Science / It Digital Logic: Gate 2016
71 pages
Practical 1 - Heteroskedasticity & Autocorrelation
No ratings yet
Practical 1 - Heteroskedasticity & Autocorrelation
2 pages
Ref. Andrea Collina
No ratings yet
Ref. Andrea Collina
43 pages
Daily Lesson Planinpe3: (Third Quarter)
No ratings yet
Daily Lesson Planinpe3: (Third Quarter)
37 pages
Bishop Lessonplan Day8
No ratings yet
Bishop Lessonplan Day8
5 pages

What Does Mean?: Scalable

Uploaded by

What Does Mean?: Scalable

Uploaded by

What Does Scalable Mean?

than Nm operations -- polynomial time algorithms

Soon: If you have N data items, you should do no more than N *

Bill Howe, eScience Institute

Example: Find matching DNA sequences

Bill Howe, eScience Institute

Bill Howe, eScience Institute

Bill Howe, eScience Institute

GATTACGATATTA contains GATTACGATATTA?

Bill Howe, eScience Institute

Bill Howe, eScience Institute

What if we sort the sequences?

Start at the 50% mark

CTGTACACAACCT < GATTACGATATTA

GGATACACATTTA > GATTACGATATTA

GATATTTTAAGC < GATTACGATATTA

How many comparisons did we do?

Far better scalability

Databases are good at Needle in Haystack problems:

Bill Howe, eScience Institute

New task: Read Trimming

Bill Howe, eScience Institute

TACCTGCCGTAA becomes TACCT

CCCCCAATGAC becomes CCCCC

GATTACGATATTA becomes GATTA

Can we use an index?

How much time did this take?

You might also like