Jonathan Hilgeman

Everything complex is made up of simpler things.

Archive for the ‘programming’ Category

Use filter_var to Strip Out Non-Digit Characters

Here’s a fun tidbit – how many times have you used a simple regular expression like this:

$x = preg_replace("/[^0-9]/","", $y);

…to quickly strip out all non-digit characters in a string? I’ve done it probably a hundred times over the years, but I ran across a script today that was packed FULL of code that used regular expressions where they didn’t need to be used, and it was taking about 100 milliseconds to run on each loop (and there were thousands of loops).

I know the character-by-character parsing in C# is extremely fast, so I first tried to do the same thing with a user function in PHP like this:

function stripNonDigits($str)
  $new = "";
  $len = strlen($str);
  for($i = 0; $i < $len; $i++)
    $c = $str[$i];
    if(($c >= '0') || ($c <= '9'))
      $new .= $c;
  return $new;

I assumed that wasn’t going to work all that well since it wasn’t compiled code, and I was right. The overhead alone from calling a user function almost surpassed the regex performance times, and once I added in the character comparison and string concatenation, it was all over. It was way slower than regex (over 2x as slow)!

I then figured that PHP had to have a way to do this, and I remembered filter_var had a filter for integer numbers, although it didn’t strip out + or – signs, so I set up another test where the filtering line was just:

$signs = array("-","+");
$x = str_replace($signs,"",filter_var($y, FILTER_SANITIZE_NUMBER_INT));

The result was blistering faster compared to preg_replace. I ran the same code against a variety of data sources – from 64-byte strings to 256k strings, and filter_var() consistently outperformed all other methods. The only way I can think of to get better performance is to build a custom PHP extension that would strip out the plus and minus signs as well, but this is about as good as we can get it using standard PHP functionality.

So next time you reach for regex, check out filter_var() first!

Using Sessions Securely

php, programming

When using sessions, usually your biggest concern is cross-site scripting (or XSS for short). Without getting into too much depth, XSS is basically when one of your users can steal the cookies of other users. The malicious user (call him Bob) is able to write a script that is displayed to other users. That script (when viewed by other users) reads the cookie from the viewing user’s PC, and then transmits the cookie back to Bob. At that point, Bob can take the cookie and pretend to be any of the users whose cookies he stole.

Just for explanation purposes, here’s another analogy. Let’s say you want to break into John’s house. If you had a copy of John’s key to his front door, it’d be easy, right? So all you need to do is find a way to pickpocket John and copy his key. All the door cares about is that the key fits the lock – it doesn’t care who uses it.

The door is the session authentication mechanism in PHP, and the key is your session ID. The session ID is stored inside a cookie, so there is nothing that prevents you or anyone else from just editing the cookie and changing the session ID to whatever you want. Now, if you change the session ID to something that doesn’t match up to a valid session on the server, then nothing will happen. BUT, if you change your session ID to something that -is- valid on the server, then you’ll automatically be logged into that session, no questions asked.

The security of sessions is all about the complexity of session IDs. It’d be one thing if the session ID was just a number between 1 and 100, but trying to figure out a long combination of letters and numbers is pretty hard to just do.

That’s where XSS comes in – most XSS attacks are all about trying to figure out valid session IDs so hackers don’t have to guess at which ones are valid. Now, XSS is just a concept. In practice, it’s usually done with Javascript, because Javascript can read cookies (there are some minor exceptions). Now, it’s easy to write Javascript that will read your OWN cookies, because you can run the Javascript on your OWN computer. The trick is to get OTHER people to run your cookie-stealing Javascript on THEIR computers (especially without them knowing about it). So how do hackers do this?

Take a message board for example. I’m sure you’ve been on message boards where people have their own special “signatures” with images and favorite quotes and stuff. That’s all custom HTML / code that the users have provided after they’ve signed up. If the message board program doesn’t do any security checks on the signature, then someone could put their cookie-stealing Javascript code into their signature. Now, it’s just a waiting game. As soon as someone else “sees” your signature, they’re unknowingly running your cookie-stealing Javascript. The Javascript reads that user’s cookie (which has their session ID), and transmits it back to the hacker.

So, the ultimate point of all this is that you should ALWAYS ALWAYS ALWAYS sanitize any data before allowing it to be savedĀ  or used in any way. Generally speaking, you should never use $_GET or $_POST or $_REQUEST (or any other $_….) variables without first running them through a function that erases characters that aren’t applicable. For example, if someone’s typing in their first name and sending it to your server, you should probably strip out any characters that don’t appear in first names (letters, numbers, spaces, and single/double quote marks, commas, and periods are usually okay for names), and then run addslashes() on the final value for good measure.

As long as you’re properly sanitizing your data before using it, you should take care of 99% of all potential XSS attacks.

ParosProxy is a good open-source tool for scanning web applications and checking for security problems. There’s also a commercial spin-off of ParosProxy called Burp Professional. It’s basically the same thing but has some better/easier reports, better recommendations, and scanning for more recent problems.

Using Nintendo for Web Site Performance

Even though the original Nintendo system is more than 2 decades old, it’s actually the source of inspiration for another way to increase web site performance!

I’m getting ahead of myself, though. Let’s start with the problem. Every web site has external files that it needs in order to be presented properly. These files are usually images, Javascript libraries, and CSS documents. When the browser goes to look at a web page containing references to these external files, it automatically goes and requests those files.

Each time it has to request a file, it goes through the entire process of:

1. Sending the request all the way through the internet to the web server
2. Waiting for the web server to look for the file and comes back with an initial response (which is sent all the way back through the internet)
3. Downloading the data, saving and using it.

Normally, this process goes pretty quickly and you see a whole page displayed in 4 or 5 seconds.

In those 4 or 5 seconds, there’s a good chance that 20% of the time is wasted in the first 2 steps by making the browser go back to the web server over and over again, while it requests the 30 images that make up your pretty little web page. The time it takes for the browser to get ANY data to (a request) and from (a response) the web server is referred to here as network overhead. The farther away the server, the more overhead you have.

So the problem that we’re going to tackle is the overhead involved in a site that has a lot of external pieces. Now back to Nintendo:

One of the programming tricks of early video games was using sprites. In case you aren’t familiar with the term (besides the drink), here’s the idea: instead of having all the different images in the game be separate images that have to be accessed from storage (which is slow) each time that they need to be shown, the game designers would put all (or most) of the images right next to each other into one big image. Then, the game would load that one big image into memory (which is fast), and then use a simple coordinates system to display a small section of that already-loaded image, which ended up giving them the image they wanted.

Think of it as having a workbench with some tools on it. You’re working across the room on some project and you occasionally want access to different tools. Do you walk across the room each time you need a hammer, then use it, and then go back and put it back on the bench? No – that would be very slow and inefficient. Instead, you move the workbench over to the other side of the room so you can just reach to your left and pick up the hammer when you need it and put it down. All the tools you need are right there on one workbench, which is now right next to you!

So how do we apply this concept to web page performance?

Most web pages have user interfaces that are made up of several images that have been stitched together with code so they look like one seamless interface. However, if we take the “sprites” concept and put all the images together into one image, then we’ve instantly reduced:

1. The number of times the browser has to go request something from the web server.
2. The load on the web server.
3. The number of records in the web server log.
4. The time it takes to load each image on the browser’s side of things.

In terms of performance, we’ve dramatically reduced that network overhead. So now we have to figure out how to tell the browser to only show the correct parts of that “master” image. This is where CSS comes in.

The background property of CSS allows you to specify an image or a color as the background of just about any element. If you use an image, then you can also provide “offsets” which basically tell the background to start displaying at a certain position. The CSS code looks something like this:

background: transparent url(“TheMasterImage.gif”) -123px -456px no-repeat;

That tells the browser to use TheMasterImage.gif as the background, and to move over to the right by 123 pixels, and then down 456 pixels. That should correspond to the top-left corner of an image that is inside TheMasterImage.gif. Now we’re getting somewhere!

At this point, all you have to do is take whatever HTML element that has this background applied to it, and then make the size of the element be the width and height of the image that you want to display. Enough talk – let’s look at an example:

Super Mario Sprites

Hey, it’sa Mario!

  The left side (0px and 0px)
  And the right side (-46px and 0px)

Here we can see it in action! Even though you’ve only downloaded one image file, it looks just like two images with the magic of CSS!

Of course, with any trick, there are downsides. Being a background, there could be issues with printing the page if background printing is disabled. There may also be older browsers that are unable to properly display CSS backgrounds. Finally, assembling the master image and the list of coordinates takes a little bit of effort and time.

However, if your web page isn’t really meant to be commonly printed out, and does not have to conform to browser requirements from 10 years ago, (and you actually care enough about performance), then put this trick to work for you! It’s good for your visitors and for your web server!

The topic for next time will be optimizing Javascript and CSS files.

Speed Up Your Application!

Go SpeedSo your PHP application is running slow… no… scratch that – slow still implies that it seems like your application is doing something after 45 seconds of loading. No, your application is a crippled duckling, dragging itself slowly towards the shoreline so it can end it all. What do you do??? Here are a few quick steps to help:

Add Log Points
Create a function that writes a message to a file. Then go through the code and add calls to this function at strategic points (i.e. after a particularly large query). In the message, dump the date and time, the __LINE__ constant (which simply outputs the current line number being executed), and a brief description of what happened since the last message. If YOU can reproduce the speed problem, then it also helps to make the function only write to the log file when your IP is the one visiting the application, so your log file doesn’t fill up too quickly or with other data.

Once the log file has some data in it, you should be able to see the flow of the program and be able to determine chunks of code that are running slow. Continue to refine the locations of the function calls to drill down to the problem points.

Improve Your SQL Queries
In many cases, a slow application is due to slow queries. Often, slow queries can be DRAMATICALLY improved with some very minor and safe tweaks to the database table indexes. I can’t begin to count the number of queries I’ve seen that tried to join two large tables using fields that were not indexed. There are several things to do to improve performance, but simply indexing those fields can often make a HUGE difference in query speed. Some databases allow for more specific indexing options that can make additional improvements, but nearly every database has basic indexing.

Speaking of joining tables, data types can also play a large part in performance. Joining tables on numeric field types like INT is usually much faster than joining on VARCHAR fields (although you should be VERY careful about a decision to change a VARCHAR to a numeric field). This is why it’s a good habit to add auto-incrementing, numeric ID fields to the tables you create. However, data types aren’t just important when joining. Minor improvements can be made by making sure that you’re using the right data types to store things. There’s no reason to use a BLOB or TEXT field to store a Unix timestamp, a first name, or a tiny on/off flag (would you use a crate to hold a tiny pebble?).

If you have a query with a WHERE clause that looks up more than one field, and is looking through a single, big table, then consider making a multi-field index that contains each of the fields used in the WHERE clause.

Some databases, like MySQL, have additional features that allow you to discover problematic queries. These features include things like automatically logging any queries that take longer than a certain number of seconds, or commands that will show details about the query that you’re running. For example, if you’re using MySQL, take a slow-running SELECT query and simply add the word EXPLAIN before the query. The result is a description of how MySQL runs the query, what indexes it uses (if any), and other useful information.

There are too many tricks to list here, but it’s not difficult to find out even more simple ways of optimizing your queries and your database performance. If the simplest approaches don’t fix the problem, then you may be facing a hardware issue or something more complex. Hiring a temp DBA may be a good idea here.

Use Datasets
In cases where you might be re-using a set of records from the database more than once, consider copying those records into a multi-dimensional array, using a primary key (or something else appropriate) as the index of that array. This essentially creates a “cached” version of that recordset that you can use throughout the rest of the script. When it comes time to loop through those records to generate a dropdown box or refer to a value, then you don’t need to go back to the database again. This can also help eliminate an additional JOIN from your queries if all the data you need is in that array. Datasets are most effective when they’re small so they don’t take up much memory and don’t take too much time to loop through.

An example of a good dataset would be a list of car manufacturers (not that many records, and possibly re-used multiple times throughout the rest of the page).

An example of a bad dataset would be an inventory of cars (probably too many records, and you probably wouldn’t re-use them on the same page).

Reduce Output
I’ve seen a lot of scheduled jobs / cron job scripts that print out a lot of output, and some of it includes calculations and additional processing simply for the purposes of outputting to the screen. But if the output isn’t been seen by anyone or processed by anything, then why send the output? Output is especially draining when it’s inside large loops, which brings us to that topic.

Take Back Control Over Loops
Lots of scripts have processes with loops that have tens of thousands, hundreds of thousands, even millions of iterations. This means that every improvement you make is multiplied times the number of times that loop runs. If you have some old debugging code that opens a log file, writes to it, and closes the file, then running that 100,000 times as fast as possible is going to be a real big system hit. Try as hard as possible to NOT run SELECT queries inside loops, because it often means loops within loops (exponentially increasing the speed hit). Even simple things like a substr(), in_array(), or strpos() call can take a bit of processing time when you run them a million times. But if you’re performing the same function with the same variables over and over again, then consider storing the result in a variable and checking that variable instead:

$MyText = “The Quick Brown Fox”;
// do something

$MyText = “The Quick Brown Fox”;
$QuickIsInMyText = strpos($MyText,”Quick”);
// do something

I try to get into the habit of creating boolean flags like $QuickIsInMyText. If you name the variables correctly ($IsAdmin, $HasEditingPrivileges), they make the code easy to read and eliminate possibilities of rewriting code over and over again.

Install and Use XDebug
XDebug ( is a free extension for PHP that can be a godsend. It’s usually easy to install without recompiling PHP, and offers a slew of features for finding performance issues with your application (although it is best used in a development environment, NOT in a production environment).

One of the most valuable features of it is its profiler, which will basically attach a little homing device to PHP so when PHP goes to execute your application, the homing device follows it all the way through and logs everything to a file. You end up with a file that shows you the details of every line of code that was executed in your application, and how long each line took to run. Sounds useful but complicated, right? Well, it is… if you were to look at the file manually.

The file that gets generated is called a cachegrind file, and it’s pretty big, and is not meant to be read as-is. Instead, there are free programs out there like KCacheGrind (for Linux) and WinCacheGrind (for Windows) which will read a cachegrind file, and display it in an easy-to-understand fashion. You can see a top-level view of the major points in your program, and drill down into the areas that are taking more processing power, down to the exact line. It’s pretty much like a super-charged version of the Log Points I mentioned earlier.

Hopefully these tips will help you get on your way to making your application run faster. Good luck!