poetix

this time for sure

On Persistence and Data Management

CTM, page 654:

A database is a collection of data that has a well-defined structure. Usually, it is assumed that the data are long-lived, in some loose sense, e.g. they survive independently of whether the applications or the computer itself is running. The latter property is often called persistence. In this section we will not be concerned about persistence [emphasis mine - DF].

It is so rare these days to hear anybody getting this even remotely right. Databases are not just about persistence. Databases are about data management. Persistence is a separate problem from data management. Any usable database system will probably, but not necessarily, offer some approach to both problems. If all you have is persistence, then you don’t have a database.

The Zope Object DataBase might, I suppose, qualify as a real database, because besides offering object persistence (via pickling) it also provides some basic features for management of its collection of pickles (via object IDs). But it isn’t really about data management. Here’s a quote from an Introduction to the Zope Object Database:

Object databases provide a tighter integration between an application’s object model and data storage. Data are not stored in tables, but in ways that reflect the organization of the information in the problem domain. Application developers are freed from writing logic for moving data to and from storage.

That is what the dreaded object-relational mapping problem is all about, apparently: writing logic for moving data to and from storage. Not writing logic for managing data. The structuring of the data is done within the application’s object model. But the structure of the object model is not the same as the organization of information in the problem domain. The object model is a model of the problem domain, not the domain itself. (This very quickly becomes apparent when we find we start needing structures other than hierarchical arrangements of objects and sub-objects to organize data with).

The real task of object-relational mapping is to map between two models of the same problem domain. The object model is intended to make the entities in that domain tractable by application code. The schema of the relational database is intended to make the entities in that domain tractable by the general mechanisms provided by a relational query language. The relational model is more general than the object oriented approach, and relational database schemata are typically more general than the object models adopted by specific applications (meaning that they don’t structure data in ways that are tightly coupled to the needs of any one application - or programming language. Try getting anything meaningful out of a ZODB from Java). None of this has anything to do with persistence.

Now, in many cases it turns out that Worse Is Better, because many applications don’t require general mechanisms for data management. It will often be the case that we’ve already created a set of ad hoc data management mechanisms when we were writing the application code, and all we need now is persistence. Hence the demand for ZODB, Prevayler &co.

The purpose of the relational model was to replace ad hoc data management mechanisms with a general approach, but it turns out that providing such an approach doesn’t make programmers stop creating ad hoc data management mechanisms; and once they’ve created them, making them provide a mapping to the more general model means extra work - often not for the sake of exposing the application’s data to a wider range of uses, but merely in order to gain access to the persistence mechanisms provided by some database product. This is a recipe for resentment and misunderstanding.

Why do programmers still create ad hoc data management mechanisms, when the relational model is demonstrably more robust, coherent and general? I think the reason has to do with the ways programming languages have of pushing data around. Name me a programming language with a relation datatype (if you answered SQL, you’re wrong). In most mainstream programming languages, you work with primitive values, structs and maybe objects (or ADTs). You don’t select from a relation; you filter a list. You don’t join relations; you dereference pointers (or perform lookups in dictionaries).

This gives rise to an approach to data manipulation that, whether the language be functional or imperative, will be largely algorithmic rather than declarative. (This is much less the case in a language like Oz, CTM’s native tongue, which includes logic programming among its rich stew of supported paradigms). When we execute an SQL query and pass the results back to some piece of application code for processing, we’re crossing several boundaries at once: the boundary between our application and the database server, the boundary between our programming language and SQL, and the boundary between data processing (transformation of state) and data management (maintenance and verification of state).

Fabian Pascal’s oft-repeated accusation that application programmers are largely ignorant of what he calls database fundamentals is probably true, but it’s true for other reasons besides the general inadequacy of the US education system (see Pascal’s Database Debunkings, passim). It’s true because database fundamentals - the mathematical foundations of the relational model - are not what programmers need to know in order to write programs, and would not actually be much help to them if they did know them. How much of Knuth’s TAOCP (a densely mathematical text, way over the head of your average ignoramus) is about relations as such?

One of the most interesting things about the Concurrency-Oriented Programming approach discussed in CTM is that it goes some way towards changing this state of affairs, by bringing logic programming to the table and demonstrating the deep affinity between declarative techniques for managing concurrency and the first-order predicate calculus used by logic programming and - incidentally - relational databases. That CTM has a section on relational programming that is not simply about how to hook your Oz programs up to your DBMS of choice is evidence of how much the terrain has shifted.