Thursday, October 20, 2011

c++ timing strategies compared (in Windows)

My coworkers always come to me with questions about the best way to time a block of C++ code.  I get asked, “I want a high-resolution timer – shouldn’t I use a timer based on cpu clock-cycles?”  I’m also asked, “I need to time something that is very short (microseconds) – so a high-resolution timer will work, right?”

I’ve followed my intuitions for years on this controversial matter.  My observation has been that the CRT function clock() was accurate enough - and I've steered away from timing "really small things" since I've seen too much instability in the timing result.

Instead of intuition, though, I’m gonna put a microscope on the matter.

the plan

I figure three things are important to determine regarding C++ timers in Windows:
1. What timer is most accurate?
2. What timer can measure the smallest time?
3. Are timing calls expensive?

I’d like to compare these three timing methods:
1. clock()
2. QueryPerformanceCounter()
3. GetProcessTimes()


the strategy

I’ve written a for-loop that will always take two exactly cpu clock cycles per iteration.  Because I know the clock-speed of my computer, I can predict how long the loop should take.  I will use this super simple for-loop to measure accuracy, resolution, and overhead of the different timing methods.

for ( unsigned int i = 0; i < iterations; i++ )
{
asm { nop }
}

The asm block is required so the compiler doesn’t optimize the loop away.  As it turns out, the nop is zero clock-cycles (I verified this by adding multiple nop’s, which had no effect on the overall time).  The assembly instructions implementing the for-loop are dec and jne – both of which are one clock-cycle.

I know how long my for-loop will take given the follow equation:

                time = iterations * (number of clock cycles) / (machine clock speed)

I can now solve for iterations:

                iterations = (time) * (machine clock speed ) / (number of clock cycles)

For example, solving for iterations given I want to run for one second:

                iterations = (1 sec ) * (2.5 GHz) /  (2 cycles) = 1 250 000 000


accuracy

For each target-time, I run my simple for-loop for the number of iterations that would equate to the expected time.  Then I have an outer loop that runs 100-10000 times, so I can compute the mean and standard deviation – for all three timing methods.

mean times

Expected time (s)
clock()
QueryPerformanceCounter()
GetProcessTimes()
2.000000
2.001710
2.001693
2.004769
1.000000
1.049360
1.049450
1.004646
0.500000
0.521580
0.521670
0.503415
0.250000
0.265080
0.265137
0.255686
0.125000
0.129920
0.129856
0.125737
0.062500
0.064680
0.064644
0.063180
0.030000
0.032180
0.032244
0.030732
0.001000
0.001069
0.001069
0.001045
0.000100
0.000123
0.000123
0.000125
0.000010
0.000019
0.000022
0.000017


standard deviations

Expected time (s)
clock()
QueryPerformanceCounter()
GetProcessTimes()
2.000000
0.001663
0.001567
0.011982
1.000000
0.068809
0.068873
0.014298
0.500000
0.033658
0.033675
0.010314
0.250000
0.019893
0.019949
0.008796
0.125000
0.010623
0.010622
0.004844
0.062500
0.004164
0.004179
0.004053
0.030000
0.005066
0.005025
0.005153
0.001000
0.000504
0.000473
0.003962
0.000100
0.000424
0.000275
0.001407
0.000010
0.000136
0.000010
0.000562
Average std.dev.
0.0145
0.0145
0.0065

accuracy conclusion
In almost every way, clock() and QueryPerformanceCounter() perform exactly the same.  Their results, both the means and standard deviations, are nearly equal.  One might even conclude they’re based on the same underlying logic.

On the other hand, GetProcessTimes() generates results that are closer to the expected values than the other two timing methods.  Also, GetProcessTimes() standard deviations are better on average.

In all three cases, measuring sub-millisecond times became very difficult.  I had to move up to averaging 10000 iterations – and even then, do multiple runs.  While the standard deviations weren’t too bad for submillisecond times, all three timing methods missed the expected time-value by a fair amount.

cost of timing calls

This will be a simpler experiment.  I’ll perform one million of iterations for the above loop and make a call to each timing function within the loop.  Remember, this is the cost of making one million calls using the above loop – the cost of the loop construct is insignificant compared to the cost of the timing calls.

Note this data is averaged over ten runs and based on calls to GetProcessTimes() (since we just determined it’s the most accurate):

clock()
QueryPerformanceCounter()
GetProcessTimes()
4.963952 s
4.812631 s
0.363482 s

cost conclusion
Again, clock() and QueryPerformanceCounter() are so well matched I’m nearly convinced they have the same underlying implementation.  On the other hand, GetProcessTimes() beats them both by an order of magnitude, making it the most lightweight timing call of the three.

overall conclusion

·         clock() and QueryPerformanceCounter() are almost definitely based on the same underlying implementation (at least for Visual Studio 2010).
·         GetProcessTimes() is the accuracy winner.  It gets closer to the expected time, and generates more consistent (lower standard deviation) results.
·         GetProcessTimes() is also the call performance winner, being faster to call than the other two by an order of magnitude.
·         Below 1 ms, all three timers struggle to measure accurately - even with lots of averaging.  I don’t recommend trying to measure such small intervals.

While you may conclude you should use GetProcessTimes() for everything, I need to end with some important facts regarding it.

GetProcessTimes() only measures active CPU time – so if you call Sleep(), it doesn’t count the time spent sleeping.  Also, if the algorithm you’re benchmarking performs some type of I/O, GetProcessTimes() won’t measure the time spent waiting for I/O resources.

Finally, when multiple threads are working concurrently within the process, GetProcessTimes() reports the aggregate amount of time spend on all threads.  Therefore, if you’re trying to benchmark the performance benefit of your new multithreaded algorithm, GetProcessTimes() is not the right choice.

These two issues with GetProcessTimes() are so glaring, that for me, I always use clock().  It is by far the easiest to call – and simply measures “wall-clock” time – which in all cases, is exactly what I want.  (There are times for using GetProcessTimes - just keep these caveats in mind.)

Regardless of the timing method you choose for your C++ code:
  • To ensure reasonable accuracy, focus on timing things that take more than 30 ms.
  • Do many runs; take an average.
  • Do your best to “quiet” the machine that is performing the benchmark by closing all un programs and background processes.
  •  Remember that calls to your timing method aren’t free.
  • Always benchmark using fully optimized builds outside the debugger (or Ctrl-F5 within the debugger).

3 comments:

  1. Hi Kev

    This was a really helpful blog post. Working on a multithreaded project I ran amiss of the process time thing and incorrectly got the result that multithreading was slower.

    If you take requests for these technical blog entries: I am looking to find out more about how to resolve int64 vs. int compatibility issues. STL uses size_t (which seems to be 64 bit unsigned int on 64 bit platforms) and pointer differences on 64 bit platforms are also 64 bit. Do you have any recommendations for dealing with this? I often run into cases where I know a particular size_t is small (i.e., <MAX_INT) based on context, but there doesn't seem to be a good way to convert.

    good work

    Eric Z.

    ReplyDelete
  2. Thanks, Erik. Funny you're asking about size_t - that's definitely in my top 5 of things to blog about. I also have thoughts on int/int64, etc. Good suggestions, let me see what I can do!

    ReplyDelete