Heresy! Web apps without SQL databases
Apple's Jens Alfke posted a thought-provoking piece the other day about data storage in the web world. In it he questions the conventional wisdom that SQL databases are the one and only solution for that problem space. While discussing the scalability problems faced by twitter.com (built on Ruby on Rails), he ponders over using plain files as a storage format and playing some clever tricks to get more performance in a simple way. The comments on his post are also interesting and the debate is still going.
Although I've been working for seven years in web applications with SQL databases, that's not the first time I've heard of this heretic opinion. As one of his readers accurately points out, Paul Graham's Viaweb (now Yahoo! Store) was (still is?) also using flat files to store its data in a FreeBSD UFS file system. Many large-scale server applications (say, Directory Servers) use storage managers like Berkeley DB to keep their data, without having the SQL query engine or the relational model to depend upon. Also, very highly transactional applications in the financial banking sector frequently use distributed in-memory object caches like Tangosol's (now Oracle's) Coherence as an ultra fast data store, with an RDBMS as a backup.
There is a lesson to be learned here, and it is finely articulated by Poul-Henning Kamp's presentation on Varnish, a web accelerator. Poul-Henning, a veteran FreeBSD kernel developer, created a Squid-killer by, to use his own words, not fighting with the operating system. The architecture of his web proxy is closely aligned with the way a contemporary UNIX-like operating system works, thus avoiding pointless layers of abstraction and architectural mismatch. The same logic accounts for Coherence's performance gains, a good mapping of the data storage architecture (in-memory object maps) to the web application architecture (Java EE or a lightweight alternative). The same way Varnish avoids traditional file I/O by using the OS's virtual memory (directly mapping files into pages via mmap), Coherence avoids the database overhead (SQL query optimizers, table indexes, disk access, etc.) by always dealing with objects stored in memory.
If for a particular web-based application problem we can devise a non-SQL storage solution that maps well to its use case requirements, then why suffer the (financial, support, technical) overhead of a SQL database? If there are no requirements of OLAP functionality or data warehousing, why bother? Remember the KISS principle: Keep It Simple Stupid! Who would knowingly subject himself to the evils of the object-relational mismatch (the Vietnam of Computer Science), if there was another way to do it?
Admittedly, the great thing about SQL databases is that they let you query your data in ways that you may have not contemplated as necessary. In a system with constantly changing requirements for data view (or with requirements for ad-hoc, dynamic viewing) you may well not have any other choice, but to use a SQL DBMS. Perhaps Object-oriented DBMSs can be a solution too, but I believe they are just starting to obtain OLAP and data warehousing capabilities and the performance may not be quite there yet.
Can it be an accident that Google uses Bigtable, a distributed storage system of their own devising for internal use? When distributed/federated databases have major drawbacks (namely cost and vendor lock-in), where do you go when you have to scale? Horizontal partitioning is a solution, but you'd better plan ahead.
In such circumstances, it may be appropriate to consider even heretic solutions.
Update: For those who won't take the time to watch the whole Varnish presentation (hey, it's fun, really!) Kostas Kalevras reminds me that you can find the meat of the presentation in the architect's notes. Apart from Poul-Henning's witty comments, they contain pretty much all the important points made there.