Saturday, April 21, 2007

Heresy! Web apps without SQL databases

Apple's Jens Alfke posted a thought-provoking piece the other day about data storage in the web world. In it he questions the conventional wisdom that SQL databases are the one and only solution for that problem space. While discussing the scalability problems faced by twitter.com (built on Ruby on Rails), he ponders over using plain files as a storage format and playing some clever tricks to get more performance in a simple way. The comments on his post are also interesting and the debate is still going.

Although I've been working for seven years in web applications with SQL databases, that's not the first time I've heard of this heretic opinion. As one of his readers accurately points out, Paul Graham's Viaweb (now Yahoo! Store) was (still is?) also using flat files to store its data in a FreeBSD UFS file system. Many large-scale server applications (say, Directory Servers) use storage managers like Berkeley DB to keep their data, without having the SQL query engine or the relational model to depend upon. Also, very highly transactional applications in the financial banking sector frequently use distributed in-memory object caches like Tangosol's (now Oracle's) Coherence as an ultra fast data store, with an RDBMS as a backup.

There is a lesson to be learned here, and it is finely articulated by Poul-Henning Kamp's presentation on Varnish, a web accelerator. Poul-Henning, a veteran FreeBSD kernel developer, created a Squid-killer by, to use his own words, not fighting with the operating system. The architecture of his web proxy is closely aligned with the way a contemporary UNIX-like operating system works, thus avoiding pointless layers of abstraction and architectural mismatch. The same logic accounts for Coherence's performance gains, a good mapping of the data storage architecture (in-memory object maps) to the web application architecture (Java EE or a lightweight alternative). The same way Varnish avoids traditional file I/O by using the OS's virtual memory (directly mapping files into pages via mmap), Coherence avoids the database overhead (SQL query optimizers, table indexes, disk access, etc.) by always dealing with objects stored in memory.

If for a particular web-based application problem we can devise a non-SQL storage solution that maps well to its use case requirements, then why suffer the (financial, support, technical) overhead of a SQL database? If there are no requirements of OLAP functionality or data warehousing, why bother? Remember the KISS principle: Keep It Simple Stupid! Who would knowingly subject himself to the evils of the object-relational mismatch (the Vietnam of Computer Science), if there was another way to do it?

Admittedly, the great thing about SQL databases is that they let you query your data in ways that you may have not contemplated as necessary. In a system with constantly changing requirements for data view (or with requirements for ad-hoc, dynamic viewing) you may well not have any other choice, but to use a SQL DBMS. Perhaps Object-oriented DBMSs can be a solution too, but I believe they are just starting to obtain OLAP and data warehousing capabilities and the performance may not be quite there yet.

Can it be an accident that Google uses Bigtable, a distributed storage system of their own devising for internal use? When distributed/federated databases have major drawbacks (namely cost and vendor lock-in), where do you go when you have to scale? Horizontal partitioning is a solution, but you'd better plan ahead.

In such circumstances, it may be appropriate to consider even heretic solutions.


Update: For those who won't take the time to watch the whole Varnish presentation (hey, it's fun, really!) Kostas Kalevras reminds me that you can find the meat of the presentation in the architect's notes. Apart from Poul-Henning's witty comments, they contain pretty much all the important points made there.

15 comments:

Dionysios G. Synodinos said...

The inherent support for ad-hoc queries in relational DBs seems to be an overwhelming reason to use this data model and not object oriented, hybrid or older models like the hierarchical or network model. At least for most developer, hence the overwhelming number of RDBMS installations. Also the focus on this paradigm has elevated the relational model to a “swiss-army-knife” like solution that almost fits all occasions. Maybe it not optimal for many domains but it does the jobs and the facilities that come with contemporary RDBMS are very powerful.

For projects with “exotic” requirements like the one that Google handles maybe the solution is an exotic data storage solution, but I’m afraid that for down to earth cases the quest for differentiation usually leads to inferior implementation. The later is a general rule :-)

If you are interested in OODBs, next time we go out for coffee remind me to lend you the book "Object-Oriented Database Design” by Jan L, Harrington, which was for me an excellent starting point to get to know OODBs.

Since we're talking about data storage I see an emerging trend for XML DBs (again) that is fueled by the fact that rich internet application have emerged into heavy XML consumers, both in the javascript-based AJAX and Adobe’s Flex realm. I’ll try to find time to write about this trend in my blog later this week…

Anonymous said...

@past:
In such circumstances, it may be appropriate to consider even heretic solutions.

Not using a relational database in an application is not heretic. It may well be the correct thing to do given the nature of the application.

Relational Database Systems are not the hammer; for if they are, everything else is a nail.

@synodinos:
For projects with “exotic” requirements like the one that Google handles maybe the solution is an exotic data storage solution

You do not have to have an exotic problem to decide not to use a relational system. The down-to-earth thing is to know when to use a relational system and when not. And that is what leads to an inferior implementation.

past said...

adamo, I wholeheartedly agree. However not many fellow computer engineers appear to feel the same way, since the percentage of web apps that I have seen practicing that mantra is zero. That includes every project I've ever worked on all these years and every project I've heard about via a colleague or a client. That's what bugs me.

Dionysis's comment on RDBMS's as a swiss-army knife is I believe spot-on. For most people, if you have something tried and true, why bother charting new territories and taking the risk? Especially if your development team is staffed by junior developers mostly.

Another reason is that the RDBMS licenses is a lucrative business for the big vendors and they wouldn't mind getting the extra cash for a consulting/development contract. The good thing is that this leaves open a window of opportunity for smaller groups to innovate.

Anonymous said...

@past:
adamo, I wholeheartedly agree. However not many fellow computer engineers appear to feel the same way, since the percentage of web apps that I have seen practicing that mantra is zero.

Well, 90% of everything is crap.

As far as Dionysis' comment, the problem is exactly what you both state: "why bother?". Well the developer should bother, if the developer is intersted in building something that lasts, instead of taking the money and run.

Read also "when you hold a hammer" (in Greek).

Dionysios G. Synodinos said...

Since relational databases can be applied in the majority of situations and the adoption of ad-hoc solutions for every project or the use of new technologies have unnecessary side-effects like learning curve, acquisition of new software, etc., I feel there is a small number of cases (hence the characterization “exotic”) where you would consider alternatives. If you feel uncomfortable with the word “exotic” try “rare”.

Of course there is that little demon inside all of us that wants to try new things and it is not uncommon for projects to suffer due to the use of the “coolest new technology”:

“The ‘Neat Technology’ Trap

Avoid using technologies for their own sake. Every new technology added to a project makes it harder to maintain, unless alternative approaches are clearly inadequate. Adding new technologies is a strategic decision, and shouldn't be taken lightly, to solve a particular problem.”

(“Expert One-on-One - J2EE Design and Development”, Ch. 2 “J2EE Projects: Choices and Risks”).


Also to quote a friend: “To use a new technology in the place of an established one it does not just need to be little better than the one I know. It needs to be 10 times better, in order for me to make the transition”. Maybe a little exaggerated but I find his point is valid.

Anonymous said...

Since relational databases can be applied in the majority of situations

This is a grossly overstated assumption.

Of course there is that little demon inside all of us that wants to try new things and it is not uncommon for projects to suffer due to the use of the “coolest new technology”

And the error here lies in the "coolest new technology". It is pretty simple:

1. When you need a relational storage you use a relational storage.
2. When you do not need a relational storage you use what fits the purpose.
3. When you do not use a relational storage this does not mean that the tool that you are going to use has to be "cool and new". It can always be tried and proved but not relational.

Using relational storage as a hammer (because everybody else does) only improves your laziness and in the long run impacts the product that you are building. This is the same erroneous logic that makes people think they know about databases just because they have built a form application in VB and/or Access.

It needs to be 10 times better, in order for me to make the transition

It depends on the context.

Dionysios G. Synodinos said...

Using relational storage as a hammer (because everybody else does) only improves your laziness

Usually this is something good :-)

It depends on the context.

Well most of the points on this thread do depend on the context and as far as I understand our views are somewhat similar.

past said...

Well most of the points on this thread do depend on the context and as far as I understand our views are somewhat similar.

To put it another way, we are all in violent agreement. It is of minor significance if we see the glass as half-full or half-empty, as long as we can agree that it definitely needs a refill. At least in some contexts :-)

Dionysios G. Synodinos said...

Not that I consider CW the ultimate source for information but here is something I read today:

http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9020942

"The top 10 dead (or dying) computer skills"

#2 -> Nonrelational DBMS :)

past said...

Yeah, but look at what it is compared against: hierarchical databases!
I mean, yikes!

But if we're collecting opinions all around, here is what Mr. GMail has to offer:

The secret to making things easy: avoid hard problems

and

The problem with conventional databases

Anonymous said...

@dionysios g. synodinos:

So is LDAP dead yet?

Dionysios G. Synodinos said...

Just got a hold of a copy of "Java Persistence with Hibernate" by Christian Bauer and Gavin King.

I quote a comment they make that represents my view:

"Because the data access tasks are often so tedious, we have to ask: Are the relational data model and (especially) SQL the right choices for persistence in OO applications? We answer this question immediately: Yes! There are many reasons why SQL databases dominate the computing industry - relational database systems are the only proven data management technology, and they're almost always a requirement in any Java project".

Dionysios G. Synodinos said...

@adamo

not dead but "shrunk" down to a fraction of its size :)

past said...

Since you quote Gavin King, here is a rebuttal to a more recent pro-DBMS position of his, by Ted Neward. Both are interesting reading material. My take: technical issues should always be approached without religious prejudices and the bias of each commenter's opinion should be taken into account as well.

Anonymous said...

I'm a bit late for this discussion, but anyway...

I'm working every day in Web Application development, and I'm not using any SQL databases at all. I've been doing this for a few years now.

The secret? I'm using Zope. Zope has its own Object Oriented Database, the ZODB. So if you want to learn about OO databases, instead of reading some theory text books from 20 years back, you could just install Zope and play with such a database.

You will be disappointed though. Programming with Zope mostly feels like there is no database at all. I don't write stuff to the database - it just "gets in there". I usually don't "query" stuff from the database either, unless I'm really doing "user search functionality".

It's not like I wouldn't use a SQL database if it really made sense, but in all those years that has happened only a couple of times. Most web application developers use SQL for one reason only: That's what they know.

Creative Commons License Unless otherwise expressly stated, all original material in this weblog is licensed under a Creative Commons Attribution 3.0 License.