Seeing Past SQL
Seeing Past SQL
; it is
Events
High volume in, low volume out unpredictable content, Scan, scrub & grep
Objects
Simple key lookup No joins No scans
Reports
Cubes & spreadsheets
Tables
Joins, inequalities, complex fixed structure
The picture is not intended as a data architecture but rather as a reflection of the way that data is commonly stored and manipulated within an organization. For example, a single application may exhibit all four types of data; events might be handled by several different implementations across an organization. Using ecommerce as a background: Objects are used to deliver web pages, serve ads, handle messages, user profiles and so on. Objects are geographically distributed, (mostly) requiring low consistency, high availability, medium volume, and medium schema stability
Events represent web page interactions (by-products of navigation, searching, and url query content). They have a low consistency requirement, low availability, high volume, and volatile schema Tables support data analysis, for example product search optimization, or user profiling. They require high consistency, low availability, medium volume, and a very stable schema Reports provide actionable data principally used to control event handling and analysis. Reports require high consistency, high availability, low volume, and have very volatile schema
The precise characterization of the four data types will vary depending on the industry. The distinction between the types is important mainly because the access requirements differ significantly across the four types and, if one data store is used for all four, it is very likely that some of the requirements will be poorly met. For example, tables require complex indexing and the set based operations typical of table accesses require significant amounts of memory relative to the size of the data being manipulated. By contrast events must optimize for I/O throughput, not I/O access. What matters is transfer rate, not access time. Indexes are just overhead as they have to be built up and then torn down. Memory utilization for data processing is bounded by buffer size. If the application is reading a buffer, processing it and then writing it back, all it needs is enough memory to accommodate whatever parallelism exists in the I/O system. There are similar collisions between the requirements typical of object stores versus events, and object stores versus tables and again between events and reports and tables and reports. It is possible to accommodate all four in a single data store but, particularly once a system starts to scale up, collisions between the requirements are likely to become a critical problem. Any given organization will have what might be regarded as an ideal profile with respect to the four data types. For example, the ecommerce realm is heavily biased towards events leading to a picture like this:
relative size
100.00% 90.00% 80.00%
70.00%
60.00% 50.00% 40.00%
30.00%
20.00% 10.00% 0.00%
objects events
tables reports
In practice, events, objects and reports tend to get mixed in with tables resulting in a departure from the ideal and resulting in issues with report generation times, event load times, object access times, data distribution and so on. Obviously keeping to the ideal profile will not solve these problems, but it will mitigate them. The sheer success of SQL data stores is, perversely, a part of the problem. There are adequate SQL implementations of all four data types providing an insidious path of least resistance for the developer to follow. SQL expertise is the norm. Most developers faced with a data manipulation problem will come up with a SQL answer, regardless of whether they are dealing with events, objects, reports or (appropriately) tables. Breaking the data realm up into function specific stores is no harder than dealing with the issues around the various data types within a single store (just ask anyone who has had to deal with a distributed object store built on SQL databases ). Making such a break in the tail end of the development cycle is practically impossible. Once an application has been built around a unitary SQL database, introducing separate stores for events, objects, reports and tables will amount to starting again from scratch. It seems reasonable to look carefully at the applications data architecture in these terms before any commitment is made to a specific store. The application should be designed with an appropriate allocation of data types to data stores, rather than confronting the decision when it is too late. Organizations need to recognize that, particularly so far as events and objects are concerned, it is unlikely that individual development projects will come up with optimal solutions that can meet any requirements beyond those of the project at hand. Common infrastructure for dealing with objects and events is essential if server proliferation and consequent scaling limits are to be avoided. So far as application design is concerned, it is important to be able to recognize as soon as possible how the applications data will be allocated across the data types, while at the same time being able to defer for as long as possible a commitment to a specific underlying store. References 1. Pavlo, A., et al, A comparison of approaches to large-scale data analysis, Proceedings of the 35th SIGMOD international conference on Management of data, 2009 2. Dean, J., Ghemawat, S., MapReduce: a flexible data processing tool, Communications of the ACM - Volume 53 Issue 1, January 2010 3. Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM (Association for Computing Machinery) 13 (6): 377387. 4. Chamberlin, Donald D.; Boyce, Raymond F. (1974). "SEQUEL: A Structured English Query Language" (PDF). Proceedings of the 1974 ACM SIGFIDET Workshop on Data Description, Access and Control (Association for Computing Machinery): 249264. 5. Stonebraker, M., SQL databases v. NoSQL databases, Communications of the ACM, Volume 53 Issue 4, April 2010 6. Pujol, Josep M., et al, The little engine(s) that could: scaling online social networks, SIGCOMM '10 Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM
6