Jonathan Hilgeman

Everything complex is made up of simpler things.

Archive for January, 2008

Speed Up Your Application!

Go SpeedSo your PHP application is running slow… no… scratch that – slow still implies that it seems like your application is doing something after 45 seconds of loading. No, your application is a crippled duckling, dragging itself slowly towards the shoreline so it can end it all. What do you do??? Here are a few quick steps to help:

Add Log Points
Create a function that writes a message to a file. Then go through the code and add calls to this function at strategic points (i.e. after a particularly large query). In the message, dump the date and time, the __LINE__ constant (which simply outputs the current line number being executed), and a brief description of what happened since the last message. If YOU can reproduce the speed problem, then it also helps to make the function only write to the log file when your IP is the one visiting the application, so your log file doesn’t fill up too quickly or with other data.

Once the log file has some data in it, you should be able to see the flow of the program and be able to determine chunks of code that are running slow. Continue to refine the locations of the function calls to drill down to the problem points.

Improve Your SQL Queries
In many cases, a slow application is due to slow queries. Often, slow queries can be DRAMATICALLY improved with some very minor and safe tweaks to the database table indexes. I can’t begin to count the number of queries I’ve seen that tried to join two large tables using fields that were not indexed. There are several things to do to improve performance, but simply indexing those fields can often make a HUGE difference in query speed. Some databases allow for more specific indexing options that can make additional improvements, but nearly every database has basic indexing.

Speaking of joining tables, data types can also play a large part in performance. Joining tables on numeric field types like INT is usually much faster than joining on VARCHAR fields (although you should be VERY careful about a decision to change a VARCHAR to a numeric field). This is why it’s a good habit to add auto-incrementing, numeric ID fields to the tables you create. However, data types aren’t just important when joining. Minor improvements can be made by making sure that you’re using the right data types to store things. There’s no reason to use a BLOB or TEXT field to store a Unix timestamp, a first name, or a tiny on/off flag (would you use a crate to hold a tiny pebble?).

If you have a query with a WHERE clause that looks up more than one field, and is looking through a single, big table, then consider making a multi-field index that contains each of the fields used in the WHERE clause.

Some databases, like MySQL, have additional features that allow you to discover problematic queries. These features include things like automatically logging any queries that take longer than a certain number of seconds, or commands that will show details about the query that you’re running. For example, if you’re using MySQL, take a slow-running SELECT query and simply add the word EXPLAIN before the query. The result is a description of how MySQL runs the query, what indexes it uses (if any), and other useful information.

There are too many tricks to list here, but it’s not difficult to find out even more simple ways of optimizing your queries and your database performance. If the simplest approaches don’t fix the problem, then you may be facing a hardware issue or something more complex. Hiring a temp DBA may be a good idea here.

Use Datasets
In cases where you might be re-using a set of records from the database more than once, consider copying those records into a multi-dimensional array, using a primary key (or something else appropriate) as the index of that array. This essentially creates a “cached” version of that recordset that you can use throughout the rest of the script. When it comes time to loop through those records to generate a dropdown box or refer to a value, then you don’t need to go back to the database again. This can also help eliminate an additional JOIN from your queries if all the data you need is in that array. Datasets are most effective when they’re small so they don’t take up much memory and don’t take too much time to loop through.

An example of a good dataset would be a list of car manufacturers (not that many records, and possibly re-used multiple times throughout the rest of the page).

An example of a bad dataset would be an inventory of cars (probably too many records, and you probably wouldn’t re-use them on the same page).

Reduce Output
I’ve seen a lot of scheduled jobs / cron job scripts that print out a lot of output, and some of it includes calculations and additional processing simply for the purposes of outputting to the screen. But if the output isn’t been seen by anyone or processed by anything, then why send the output? Output is especially draining when it’s inside large loops, which brings us to that topic.

Take Back Control Over Loops
Lots of scripts have processes with loops that have tens of thousands, hundreds of thousands, even millions of iterations. This means that every improvement you make is multiplied times the number of times that loop runs. If you have some old debugging code that opens a log file, writes to it, and closes the file, then running that 100,000 times as fast as possible is going to be a real big system hit. Try as hard as possible to NOT run SELECT queries inside loops, because it often means loops within loops (exponentially increasing the speed hit). Even simple things like a substr(), in_array(), or strpos() call can take a bit of processing time when you run them a million times. But if you’re performing the same function with the same variables over and over again, then consider storing the result in a variable and checking that variable instead:

Before:
$MyText = “The Quick Brown Fox”;
LOOP BEGINS
if(strpos($MyText,”Quick”))
{
// do something
}
LOOP ENDS

After:
$MyText = “The Quick Brown Fox”;
$QuickIsInMyText = strpos($MyText,”Quick”);
LOOP BEGINS
if($QuickIsInMyText)
{
// do something
}
LOOP ENDS

I try to get into the habit of creating boolean flags like $QuickIsInMyText. If you name the variables correctly ($IsAdmin, $HasEditingPrivileges), they make the code easy to read and eliminate possibilities of rewriting code over and over again.

Install and Use XDebug
XDebug (http://xdebug.org) is a free extension for PHP that can be a godsend. It’s usually easy to install without recompiling PHP, and offers a slew of features for finding performance issues with your application (although it is best used in a development environment, NOT in a production environment).

One of the most valuable features of it is its profiler, which will basically attach a little homing device to PHP so when PHP goes to execute your application, the homing device follows it all the way through and logs everything to a file. You end up with a file that shows you the details of every line of code that was executed in your application, and how long each line took to run. Sounds useful but complicated, right? Well, it is… if you were to look at the file manually.

The file that gets generated is called a cachegrind file, and it’s pretty big, and is not meant to be read as-is. Instead, there are free programs out there like KCacheGrind (for Linux) and WinCacheGrind (for Windows) which will read a cachegrind file, and display it in an easy-to-understand fashion. You can see a top-level view of the major points in your program, and drill down into the areas that are taking more processing power, down to the exact line. It’s pretty much like a super-charged version of the Log Points I mentioned earlier.

Hopefully these tips will help you get on your way to making your application run faster. Good luck!

The One-Two Punch: Serialization and Databases

Duct Tape XDuct tape is a wonderful, versatile thing, as proven by hordes of people out there with everything from wallets and clothing made entirely of the sticky steel. But those duct tape wallets and shirts are also great examples of using something too much. The same concept applies to programming, and I see a lot of developers fall prey to over-using a shiny new toy that they just learned how to use.

In the past few months, I’ve taken over a new project – a real fixer-upper. Putting aside a UI that would make Jakob Nielsen cry, the program was horribly slow. The simplest of pages would take 5 to 15 seconds to load. A lot of this was due to poor project planning, resulting in things like 1000-iteration loops that each performed a small database query. Combine that with the fact that the database was located on a separate server, and you can see where we might have problems.

(Since then, I’ve put together an in-house development team of a few different guys, each with their own strengths, and we’ve been able to breathe a bit of life back into the product, but it’s a daily battle of optimizing code without accidentally taking out that one line that, for whatever reason, prevents some bit of code later on from dropping the entire database…)

Getting back to the point, we’re rebuilding the product, and a concept that I’ve been discussing with the developers is a better way of using serialized information: When, how, and why? The old team would use it sporadically, but there wasn’t any rhyme or reason to when, how, or why they would use it.

What Is Serialization?
Hopefully you already know this if you’re reading this entry. However, if you don’t know what serialization is, then the simplest way to think about it is to think of some sort of complex thing (maybe a database record, or a binary file), and then imagine a book that described how to re-create that thing. Serialization is just a way of saving that complex thing as something that’s more portable. For example, you might have a multi-dimensional array that would be nice to be saved somewhere and then loaded back into perfect existence at a later time. Serialization could take that array and turn it into one long string that completely and accurately represents that array. Later, you de-serialize it from the string back into the multi-dimensional array.

Why Not JUST Use a Database?
So the question presents itself, “Why not just use the database to store the different pieces of data that you need to be saved?” This isn’t a bad idea at all for many situations. With my above-mentioned project, the product is extremely dynamic, has many different plug-ins, and so could have hundreds of different configurations. All the different settings could be stored in their own fields, but then you run into the problem of inefficient, unwieldy database tables with potentially HUNDREDS of fields.

By having a single field contain a serialized array of information, you can consolidate some (possibly even most) of those fields, and make the database much simpler to manage, and much more efficient. It’s like having a miniature, dynamic database WITHIN your database! So how do we determine what should be serialized and what should be stored in a normal database field?

This was apparently the dilemma that the old development team couldn’t quite figure out. They would store some serialized information into the database, and other pieces of data would become columns in a database table, but they didn’t do it efficiently.

To Serialize or Not to Serialize: The Three Rules
I have three general rules for determining whether or not to serialize information:

1. If you’re going to need to search against that field later, don’t serialize it!
(It’s much more difficult and much slower to accurately search through a large bunch of serialized/encoded text.)

2. If you’re going to need to display that particular field in a list later, don’t serialize it!
(Looping through 200 records and de-serializing a field in each record can put an unnecessary load on your server.)

3. If someone else using a different programming language may need access to that information later, don’t serialize it!
(Just because YOU can unserialize the data doesn’t mean that ASP or ColdFusion or some other language will be able to!)

The new version of the product efficiently stores serialized information so that we can pull nearly all the information we need in one quick query, and after one or two de-serializations, the engine has all the variables it needs to perform its duties. Not only is the engine more powerful and more flexible, but it runs faster, is more scalable, and puts much less strain on the database and on the web server itself. Not a bad start at all. Thanks, serialization!