The One-Two Punch: Serialization and Databases

Duct Tape XDuct tape is a wonderful, versatile thing, as proven by hordes of people out there with everything from wallets and clothing made entirely of the sticky steel. But those duct tape wallets and shirts are also great examples of using something too much. The same concept applies to programming, and I see a lot of developers fall prey to over-using a shiny new toy that they just learned how to use.

In the past few months, I’ve taken over a new project – a real fixer-upper. Putting aside a UI that would make Jakob Nielsen cry, the program was horribly slow. The simplest of pages would take 5 to 15 seconds to load. A lot of this was due to poor project planning, resulting in things like 1000-iteration loops that each performed a small database query. Combine that with the fact that the database was located on a separate server, and you can see where we might have problems.

(Since then, I’ve put together an in-house development team of a few different guys, each with their own strengths, and we’ve been able to breathe a bit of life back into the product, but it’s a daily battle of optimizing code without accidentally taking out that one line that, for whatever reason, prevents some bit of code later on from dropping the entire database…)

Getting back to the point, we’re rebuilding the product, and a concept that I’ve been discussing with the developers is a better way of using serialized information: When, how, and why? The old team would use it sporadically, but there wasn’t any rhyme or reason to when, how, or why they would use it.

What Is Serialization?
Hopefully you already know this if you’re reading this entry. However, if you don’t know what serialization is, then the simplest way to think about it is to think of some sort of complex thing (maybe a database record, or a binary file), and then imagine a book that described how to re-create that thing. Serialization is just a way of saving that complex thing as something that’s more portable. For example, you might have a multi-dimensional array that would be nice to be saved somewhere and then loaded back into perfect existence at a later time. Serialization could take that array and turn it into one long string that completely and accurately represents that array. Later, you de-serialize it from the string back into the multi-dimensional array.

Why Not JUST Use a Database?
So the question presents itself, “Why not just use the database to store the different pieces of data that you need to be saved?” This isn’t a bad idea at all for many situations. With my above-mentioned project, the product is extremely dynamic, has many different plug-ins, and so could have hundreds of different configurations. All the different settings could be stored in their own fields, but then you run into the problem of inefficient, unwieldy database tables with potentially HUNDREDS of fields.

By having a single field contain a serialized array of information, you can consolidate some (possibly even most) of those fields, and make the database much simpler to manage, and much more efficient. It’s like having a miniature, dynamic database WITHIN your database! So how do we determine what should be serialized and what should be stored in a normal database field?

This was apparently the dilemma that the old development team couldn’t quite figure out. They would store some serialized information into the database, and other pieces of data would become columns in a database table, but they didn’t do it efficiently.

To Serialize or Not to Serialize: The Three Rules
I have three general rules for determining whether or not to serialize information:

1. If you’re going to need to search against that field later, don’t serialize it!
(It’s much more difficult and much slower to accurately search through a large bunch of serialized/encoded text.)

2. If you’re going to need to display that particular field in a list later, don’t serialize it!
(Looping through 200 records and de-serializing a field in each record can put an unnecessary load on your server.)

3. If someone else using a different programming language may need access to that information later, don’t serialize it!
(Just because YOU can unserialize the data doesn’t mean that ASP or ColdFusion or some other language will be able to!)

The new version of the product efficiently stores serialized information so that we can pull nearly all the information we need in one quick query, and after one or two de-serializations, the engine has all the variables it needs to perform its duties. Not only is the engine more powerful and more flexible, but it runs faster, is more scalable, and puts much less strain on the database and on the web server itself. Not a bad start at all. Thanks, serialization!