0% found this document useful (0 votes)
45 views

Seeing Past SQL

The document discusses how different types of data like events, objects, reports and tables have differing access requirements that may not all be optimally supported by traditional SQL databases. It proposes categorizing data according to these four types to help organizations choose appropriate data stores to match the specific needs. Recognizing the different data types early in application design could help avoid issues later on from trying to use a single data store for requirements that may conflict.

Uploaded by

Patrick Thompson
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Seeing Past SQL

The document discusses how different types of data like events, objects, reports and tables have differing access requirements that may not all be optimally supported by traditional SQL databases. It proposes categorizing data according to these four types to help organizations choose appropriate data stores to match the specific needs. Recognizing the different data types early in application design could help avoid issues later on from trying to use a single data store for requirements that may conflict.

Uploaded by

Patrick Thompson
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Looking at data in terms of common usage patterns provides context for the SQL-NoSql-big data debate and a basis

for professionals building applications to choose between the available alternatives.

Seeing Past SQL


SQL has been an increasingly prominent part of the Information Technology landscape for over 30 years4 . Current SQL implementations offer a veritable smorgasbord of features seemingly capable of dealing with any data management scenario. While theres little doubt that relational algebra can be used to describe any data representation and its manipulation, SQL perversions of the algebra aside, it seems reasonable to ask, is SQL always the right answer? Two developments in the form of map-reduce and NoSql data stores are a reflection of the fact that, for some, the answer is No! There has been lively debate around both developments arguing the merits and demerits of the approaches useful to frame the discussion in terms of the types of data involved.
2,5,6 ,3

; it is

Events
High volume in, low volume out unpredictable content, Scan, scrub & grep

Objects
Simple key lookup No joins No scans

Reports
Cubes & spreadsheets

Tables
Joins, inequalities, complex fixed structure
The picture is not intended as a data architecture but rather as a reflection of the way that data is commonly stored and manipulated within an organization. For example, a single application may exhibit all four types of data; events might be handled by several different implementations across an organization. Using ecommerce as a background: Objects are used to deliver web pages, serve ads, handle messages, user profiles and so on. Objects are geographically distributed, (mostly) requiring low consistency, high availability, medium volume, and medium schema stability

Events represent web page interactions (by-products of navigation, searching, and url query content). They have a low consistency requirement, low availability, high volume, and volatile schema Tables support data analysis, for example product search optimization, or user profiling. They require high consistency, low availability, medium volume, and a very stable schema Reports provide actionable data principally used to control event handling and analysis. Reports require high consistency, high availability, low volume, and have very volatile schema

The precise characterization of the four data types will vary depending on the industry. The distinction between the types is important mainly because the access requirements differ significantly across the four types and, if one data store is used for all four, it is very likely that some of the requirements will be poorly met. For example, tables require complex indexing and the set based operations typical of table accesses require significant amounts of memory relative to the size of the data being manipulated. By contrast events must optimize for I/O throughput, not I/O access. What matters is transfer rate, not access time. Indexes are just overhead as they have to be built up and then torn down. Memory utilization for data processing is bounded by buffer size. If the application is reading a buffer, processing it and then writing it back, all it needs is enough memory to accommodate whatever parallelism exists in the I/O system. There are similar collisions between the requirements typical of object stores versus events, and object stores versus tables and again between events and reports and tables and reports. It is possible to accommodate all four in a single data store but, particularly once a system starts to scale up, collisions between the requirements are likely to become a critical problem. Any given organization will have what might be regarded as an ideal profile with respect to the four data types. For example, the ecommerce realm is heavily biased towards events leading to a picture like this:

relative size
100.00% 90.00% 80.00%

70.00%
60.00% 50.00% 40.00%

30.00%
20.00% 10.00% 0.00%

objects events

tables reports

In practice, events, objects and reports tend to get mixed in with tables resulting in a departure from the ideal and resulting in issues with report generation times, event load times, object access times, data distribution and so on. Obviously keeping to the ideal profile will not solve these problems, but it will mitigate them. The sheer success of SQL data stores is, perversely, a part of the problem. There are adequate SQL implementations of all four data types providing an insidious path of least resistance for the developer to follow. SQL expertise is the norm. Most developers faced with a data manipulation problem will come up with a SQL answer, regardless of whether they are dealing with events, objects, reports or (appropriately) tables. Breaking the data realm up into function specific stores is no harder than dealing with the issues around the various data types within a single store (just ask anyone who has had to deal with a distributed object store built on SQL databases ). Making such a break in the tail end of the development cycle is practically impossible. Once an application has been built around a unitary SQL database, introducing separate stores for events, objects, reports and tables will amount to starting again from scratch. It seems reasonable to look carefully at the applications data architecture in these terms before any commitment is made to a specific store. The application should be designed with an appropriate allocation of data types to data stores, rather than confronting the decision when it is too late. Organizations need to recognize that, particularly so far as events and objects are concerned, it is unlikely that individual development projects will come up with optimal solutions that can meet any requirements beyond those of the project at hand. Common infrastructure for dealing with objects and events is essential if server proliferation and consequent scaling limits are to be avoided. So far as application design is concerned, it is important to be able to recognize as soon as possible how the applications data will be allocated across the data types, while at the same time being able to defer for as long as possible a commitment to a specific underlying store. References 1. Pavlo, A., et al, A comparison of approaches to large-scale data analysis, Proceedings of the 35th SIGMOD international conference on Management of data, 2009 2. Dean, J., Ghemawat, S., MapReduce: a flexible data processing tool, Communications of the ACM - Volume 53 Issue 1, January 2010 3. Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM (Association for Computing Machinery) 13 (6): 377387. 4. Chamberlin, Donald D.; Boyce, Raymond F. (1974). "SEQUEL: A Structured English Query Language" (PDF). Proceedings of the 1974 ACM SIGFIDET Workshop on Data Description, Access and Control (Association for Computing Machinery): 249264. 5. Stonebraker, M., SQL databases v. NoSQL databases, Communications of the ACM, Volume 53 Issue 4, April 2010 6. Pujol, Josep M., et al, The little engine(s) that could: scaling online social networks, SIGCOMM '10 Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM
6

You might also like