Jaa


Performance work is slippery

… like a water snake.  Which is why we measure, measure and measure again.  Even the most obvious rule of thumb can be wrong.  I was optimizing some string parsing code (which is almost always ripe for optimization), and trying to reduce the cost of traversing a string. 

When iterating over the elements of an array there are two typical models:
index increment

     while (ptr[index] != 0)
    {
        index++;
        //  do stuff with ptr[index]
    }

pointer increment

     while (*ptr != 0)
    {
        ptr++;
        //  do stuff with *ptr
    }

And while they both generate reasonable code the index increment is typically ~2x the cost of the pointer increment.  For iterations that aren’t called much or that have expensive inner operations this is likely not noticeable.  This makes sense because it avoids having to recalculate the offset to the current character inside the loop (ptr[index]).  The incremented pointer is the current offset.  My rule of thumb  is to use array indexes when random access is required and use pointer increment for iteration.

As an example for this rule of thumb, I wrote the simplest of string iterations: calculating a string’s length.

 size_t slen_index(PCWSTR str)
{
    size_t index = 0;
    while (str[index] != 0)
    {
        index++;
    }
    return index;
}

size_t slen_ptr(PCWSTR str)
{
    PCWSTR ptr = str;
    while (*ptr != 0)
    {
        ptr++;
    }
    return ptr - str;
}

And then I measured to find out how much faster the ptr loop is (lower numbers are faster):

     slen_ptr   = 265
    slen_index = 188

Surprise, the intuitively “faster” version is actually slower.  Peeking at the disassembly provides a quick answer.  slen_index() keeps index in a register instead of a stack location, whereas slen_ptr()stores the incremented back into the stack variable.  This is a pathological case where the compiler is able to make a better optimization because we don’t do anything with the characters we are traversing. Let's check out functions that actually do something in the loop.

 #define IS_ALLOWED(ch) (((ch) > 0x20) && (((ch) < 0x80))) 

bool isallowed_index(PCWSTR str)
{
    size_t index = 0;
    while (str[index] != 0)
    {
        index++;
        if (!IS_ALLOWED(str[index]))
            return false;
    }
    return true;
}

bool isallowed_ptr(PCWSTR str)
{
    PCWSTR ptr = str;
    while (*ptr != 0)
    {
        ptr++;
        if (!IS_ALLOWED(*ptr))
            return false;
    }
    return true;
}

And now measure (lower numbers are faster):

     isallowed_ptr    = 250
    isallowed_index  = 390

That’s more like it!  My rule of thumb makes more sense, but the stronger rule is to measure, measure, and measure again!

PS - these numbers also show that for most functions it probably won't make a difference in your app.  But if the function is called a lot, it can (eg strlen()).

Comments

  • Anonymous
    August 13, 2008
    What compiler, what compile options? I'm sure I remember reading or discovering at some stage that the compiler would turn uses of indexing, where the loop counter variable isn't used for something else, into pointer manipulation. I've just compiled it (after correcting the bug in isallowed_index - it should read IS_ALLOWED(str[index])) with Visual Studio 2008 using its default Release build options and the generated code, as shown by the Assembler Output option Assembly and Source Code (/FAs), is 100% identical for the two functions. If there's a timing difference, there's some other cause. Build options as given in the output file: /Ogtpy. I tried with and without /GL, the only difference is that the parameter ends up in a register rather than on the stack. My preferred option is not /O2 but /Ox /Os (Full optimization, prefer small size) which I picked up years ago as it tends to reduce the number of pages for the code, reducing paging overall. The only difference in this case is that it doesn't pad before entering the loop (aligned loops are faster on some processors) and uses two 'inc' instructions rather than an 'add' to move the pointer. [Fixed the typo, thanks. The point of the post was not about optimizing any of this code in particular, but rather that whatever optimizations made need to be measured in order to prove their value. -ZekeL]