Where is the database schema? #SQL #NoSQL

Where is the database schema? #SQL #NoSQL

Although SQL databases can evolve using DDL (Data Definition Language), they are recognized for rigid schemas. In contrast, NoSQL is frequently labeled schemaless because data can be stored without a defined structure. This description is misleading and may lead to poor application design.

Schema-on-read?

Storing data without a schema is futile. You store data to retrieve it later, and a schema is essential for understanding what was stored. You likely used the same schema when writing to avoid saving garbage. In a key-value database, the value has a schema, which may be as straightforward as identifying whether it’s an integer, a float, or text encoded in UTF-8. Document databases handle more complex structures, with JSON describing the schema within each document. SQL databases establish a standard schema for a table, and all rows must adopt it. Still, it can vary in flexibility (allowing nulls, enabling constraints, and utilizing large object datatypes like JSON). While NoSQL databases do not require declaring a standard schema for a table or collection, they can enforce some rules through schema validation. Still, the application will likely know the schema when getting the document. Instead of being labeled schemaless, it is sometimes called schema on read. However, that is not sufficient. You need a schema before reading. There's always a schema-on-write.

Schema-on-write?

If values or documents are stored without a standard schema on write, it is impossible to query efficiently. You don't need a database if you only want to scan an entire table or collection and see what is there while reading it ("schema on read"). File or object storage, like S3, may suffice for this purpose. A database allows for efficient data retrieval, which requires a schema known before reading and used when writing to store data accordingly. SQL and NoSQL databases use a primary key to organize data and find it through queries quickly. This key is part of a standard schema, including a datatype, possibly multiple attributes (composite or compound key), and hashing or sorting properties. To enable more complex queries beyond retrieving data by a primary key, SQL and NoSQL databases allow the declaration of secondary indexes, and they have a schema: you must define what is indexed and how the index value is found in the row, the value, or the document that is being written.

Schema in application, in database, or both?

There are no "schemaless" or "schema on read" databases. The database schema always exists and involves a balance of the following:

  • Some declarations must be in the application to read and understand what has been written, allocate the memory to retrieve it, map it to the application data structures and data types

  • Some declarations must be in the database to validate data independently of the application, optimize access paths, and maintain indexes for fast queries.

Relational databases (RDBMS) and SQL aim to provide data independence by centralizing all schemas in the database, referred to as metadata, a catalog, or a dictionary. In the application, you can insert data into a table without defining all the structures. An INSERT statement does not require listing the target columns or sending data in the correct data type. SQL attempts to match the data to the target table using implicit data type casting and default values for nulls. You can insert data into a view without knowing the table name. This capability was beneficial for monolithic applications, where the code was deployed alongside the database in stored procedures or pre-compiled packages, and errors were detected during deployment. The application code did not need to be aware of the schema, as it could be read from the catalog when required, during compilation, or at runtime. For instance, if it needed to display a text column, the presentation layer would directly query the database dictionary (though queries on the catalog or describing the parsed cursor) to determine the maximum length and set the size of the textbox accordingly in the user interface.

Software architecture evolution

What was defined for application code in the database was still helpful with client-server deployments where multiple application versions could access the database. Views, default values, and generated columns could provide this data independence because a shared dictionary, stored in the database catalog, maps the logical view to the physical tables. At that time, having a single shared catalog was not a problem because databases were monolithic and didn't try to scale horizontally. The dictionary was stored alongside the data.

Software architecture has since evolved. The application is deployed on a limited number of servers, with a significant portion of data logic running in the application tier backend and the presentation layer being distinct from the backend that accessed the database. This setup could only work with the schema present in the application. Instead of setting a text column length in the database, the application handles this to be able to run the presentation layer without interacting with the database, and the maximum length is often declared in the database, with TEXT in PostgreSQL or VARCHAR2(4000) in Oracle.

This marked just the beginning of a shift. As object-oriented languages gained traction, interacting with SQL databases through JDBC became more complicated, prompting the use of object-relational mappers like Hibernate. As a result, the application began to define the schema using JPA annotations. The SQL database's schema, constructed with DDL statements, still played an essential role in data validation and assisting the query planner optimizer. Still, the point of truth was the JPA annotations in the application. The DDL is often generated from the JPA annotations and enhanced with more physical attributes dependent on the database platform. You can read DB-first vs. JPA-first approach by Andrey Belyaev.

Creating a unified and centralized schema in the database quickly became a challenge for scalability. While sharding and data distribution are viable options, synchronizing a shared catalog introduces specific challenges, especially during application migrations or database upgrades. This difficulty led to the emergence of NoSQL, which eliminated reliance on a central dictionary, allowing each application instance to create its own document with its structure. It is not entirely schemaless, nor is it simply schema-on-read. Instead, it positions the application as the primary authority on the database schema.

So, where is the database schema?

The application code typically defines the schema in modern applications except for certain exceptions, such as business logic and presentation embedded in stored procedures (like in Oracle APEX). A schema is always present when writing data, even if some attributes may only be identified when reading. Moreover, certain schema elements must be communicated to the database for effective data validation and to optimize access paths. NoSQL databases generally store minimal information, such as key and index definitions and some schema validation. In contrast, SQL databases maintain a shared catalog or dictionary accessible to all sessions. Even distributed SQL databases, which utilize a shared-nothing architecture for data, still require a shared catalog across all nodes (YugabyteDB stores the PostgreSQL catalog in a single tablet deployed with the yb-master that holds cluster metadata). This necessity can pose scalability challenges and complicate application schema migrations because a query execution plan depends on a schema version.

A schema always describes the data in your database, and the application defines it with proper data modeling. SQL and NoSQL differ in whether the entire schema is stored in the database or just a tiny subset necessary for efficient data storage and retrieval.

schemaless is myth!

Shemal Gandhi ☁️

Senior Database Engineer | RDBMS | NoSQL | Multi-Cloud | DevOps | Tech Lead @ Invenco By GVR [A Vontier company]

3mo

Interesting read... thanks for sharing.

Dario Vega

Product Manager @Oracle ☁️ OSS | noSQL | inMemory | OCI | cloud-native | serverless | Architecture | Development

3mo

I prefer to say there is always a data model somewhere, whatever the data modeling techniques, patterns/approaches, format, or store you choose.

Frederic Brunet

CEO, Trois O a Funky Business company

3mo

Wouldn't this be a false debate? Many people are ignorant when it comes to databases, particularly relational ones (the relations are not the links between the Tables but rather the tables themselves (a French misconception because there is a translation confusion between RELATION and RELATIONSHIP) ), claim or defend misconceptions to hide their incompetence and ignorance in the field of databases. The objectives and issues are fundamentally different between the use of relational databases (those which use the SQL language) and the bases which are called by the acronym NOSQL (Not Only SQL, to indicate that these are systems which do not use a language like SQL but it would be more practical to be able to do so anyway ;-) and the so-called "NOSQL" databases. This is a characteristic illustration of the Dunning Kruger effect. 😊

Jeff Smith

Product Manager | Databases | Blogger | Software Development | Cloud | Social | Community Management | Product Marketing

3mo

There is always a schema ✔️ 

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics