Status: This article is currently pretty dry. I’d like feedback on how to make it more eloquent.
In my previous blog post, I promised to write more about using databases as the main integration strategy. In the current post, I plan to cover maybe the most important question: “Why?”
Imagine an application where every time it wants to communicate with another system, it reads or writes to the database. For now, let’s ignore how this would work, and how it would evolve, which will be the subject of later posts. What advantages does this offer?
The alternative is usually to integrate with another system though a variety of means. In Java, the most common ones are Web Services, RMI, EJBs (which offers it own quirks in addition to those of RMI), Sockets, and various tricks using the file system.
The most important issue to me is invariably productivity. When I work with databases, I generally can use Object-Relation Mapping tools. This is a very productive way of accessing database data in an application. RMI offers similar advantages, but you will have to build lazy loading on top of the domain model if you want to have a rich model where the objects are interconnected. Web Services generally have some bindings to Java, but in my experience, these are really inadequate. Either the Java side suffers, for example by forcing you to have getters and setters, by forcing you to use arrays instead of collections, or by forcing you to use strings as the main data type. Alternatively, the XML-side suffers by having non-specific types (if you use collections). Sockets, of course are very unproductive. They give up productivity for simplicity.
The data that is managed by the remote service generally will come from a database anyway. This means that the data access code will be have to be developed somewhere anyway. A remoting layer will have to be developed in addition.
To maintain sustainable productivity, we need unit tests. Unit testing has for me proved to be hard to do well for both Web Services and RMI, and EJBs are of course out of the question. As my regular readers know, using a test database for standalone unit testing is quite simple. As an added bonus, tests that use the database will essentially have verified the integration. When I use a remoting protocol, I always run into strange problems very late in the test process.
Both unit testing and productivity benefits from the fact that dealing with databases is something we’ve done for a long time. The tools and techniques for doing so are very mature, compared to other methods of integration.
Secondly, there is the problem of reliability. If you use a single database, everything you do is within one transaction. Either all work will be committed, or it will be rolled back. This vastly simplifies your logic if you care about your correctness. For distributed systems, this will in theory be solved by the 2-phase commit protocol. However, my experience is that this adds so much complexity to a solution that the system can metaphorically collapse under its own weight. As a result, most solutions I’ve seen (and, I suspect, most solutions I haven’t) simply ignore this problem. This means that the odd resource error that occurs might very well have very unpredictable results.
A remote layer will also introduce another place where things can go wrong. Many developers end up coding recovery rutines for dealing with these kinds of errors. In my experience, this is some of the most error prone code you can write.
Third, performance-wise it is hard to beat the database. Most other methods will eventually hit the database anyway, and as a general rule, adding more steps to a solution seldom makes it faster. There are some issues with scalabilitity, however, that I will address in a later post.
Last, and maybe most importantly, I have never seen a standard interface for dealing with remote services. Solutions generally end up having half-a-dozen or more different policies for accessing different back end systems. There is one thing we will always be sure of, though: There’ll always be a database among these backend systems, no matter what else you have to talk to. Every extra communication mechanism you remove will reduce the shoestring-and-paperclip-factor of your system.
By using a single data source as the place for communicating with other systems, we will reduce complexity and improve testability, performance and reliabilty.
I hope that in this post, I have demonstrated why, in an ideal world, you would want to use a single database as your primary integration mechanism. However, the world is rarely ideal. Database schemas change, more load is added than what a single database can tackle, you have to understand a forest of database schemas, some applications should not be allowed to access all the data. In my next blog post, I will talk about how to solve these problems with database without giving up the single database vision. Stay tuned for evolution, scalability, security, reuse, and understandability.