Wednesday, January 16, 2013

coding is like sculpting - but is it agile?

I've had a recent epiphany:  coding is like sculpting.  I guess it took me a while to see the connection (since I'm not a sculptor), but as I imagine it, sculpting goes something like this:

  • Get a big pile of clay.
  • Reduce the pile of clay to the size of the intended sculpture.
  • Start cutting the clay into a roughly hewn representation of the sculpture.
  • Examine your work - is the design going as intended?  If so, continue, else start over.
  • Refine your sculpture with finer cuts - yielding a less coarse result.
  • Examine your work again.  If on track, continue, else back-up.
  • Continue refining your piece - refining, backing up, continuing, etc.
  • Repeat for as many steps as needed.

I've lived with Agile for many years - and my conclusion is that Agile is purposefully designed to support the idea that coding is like sculpting.

Every sprint is your opportunity to move your sculpture forward to it's next level of refinement.  Said another way, at the end of each sprint, your sculpture should be a finer version of what you sculpted the prior sprint.

Agile gives this a name:  the vertical slice.  Every sprint should be a working model that moves (feature-wise) closer toward the end result.

I begrudgingly admit these facts, since I admit I set about writing this to bash Agile for not supporting the notion that coding is like sculpting - when in fact, I concluded the opposite.  Score one for Agile.

Thursday, December 27, 2012

intel's hyperthreading: super smart marketing of a super weak feature

Here's my beef:  Intel's asks consumers to pay 50% more for their hyperthreaded chips, but those hyperthreaded chips will only ever, in the best case scenarios, be 10-15% faster than their non-hyperthreaded cousins.  Few people understand exactly what this means - so let's discuss.

Consider Intel’s latest Ivy Bridge series.  The i5-3570k is a quad-core clocked at 3.4 GHz and does not support hyperthreading.  The i7-3770 is a quad-core clocked at 3.4 GHz and supports hyperthreading.  When you bring up the Windows Task Manager with the i5-3570k, you will see four cores.  When you bring up the Windows Task Manager with the i7-3770, you will see eight cores (four logical cores plus four hyperthreaded cores).  As a result, you, the consumer, might be led to believe the i7 is twice as fast as the i5 – it has twice as many cores, right?!  Sorry, not even close.  Here’s the real story.

If you understand parallel computing, you know that a perfectly parallelized algorithm will continue to improve its performance based the number of available cores - its called scaling.  The computation is simple.  If an perfectly parallelizable algorithm takes 840 s to compute using a single core, it will take 420 s using two cores (twice as fast, or 2x), and 210 s using four cores (four times as fast, or 4x).  But what happens if we perform this scalability test on an i7-3770?  The numbers (assuming the best case that hyperthreading performs at 15% the speed of a logical core):

i7-3770 Cores employed
Time (s)
1
840
2
420 (2x)
3
280 (3x)
4
210 (4x)
5
202 (4.15x)
6
195 (4.3x)
7
189 (4.45x)
8
182 (4.6x)

Did you see what happened after four cores?  The scaling tanked.  Even using all of the i7-3770’s eight cores is only 10% faster than using four cores – and that’s assuming the best case scenario that each hyperthreaded core is 15% as fast as a logical core.  In a system with eight real, logical cores, the performance should have scaled to 105 s (eight times as fast as one core, or 8x).

That's the ideal world – but as it turns out, few real world algorithms will ever get close to seeing the 10-15% hyperthreaded improvement hypothesized.  While purely scientific applications will sometimes approach this ideal, the real world is riddled with constraints:  in particular, threads/processes share memory, disk, and oftentimes data-structure/resources within the parallel application itself.  Shared resources prevent ideal scaling, as concurrently executing threads/processes are forced to wait for one another to acquire/use these shared resources.  This means there is just a fraction of room left for hyperthreading to make its little voice heard above the loud, real world din of shared resources.

Here’s a real world algorithm that I deal with every day:  compiling my C/C++ source code.  Visual Studio (and other third party build systems) support multithreaded/multiprocess builds.  This means that the compilation phase is done in parallel.  When compiling, each thread/process is reading a source file from disk and writing an object file to disk – concurrently.  What’s the shared resource?  The disk.  What can happen to disks?  Fragmentation.  What exacerbates fragmentation?  Lots of threads/processes writing to a disk concurrently.  Which is exactly what is happening!  As much as I love the fact that multithreaded/multiprocess builds speed up my compilation phase, fragmentation causes the “win” to diminish over time – and eventually, the subsequent link phase, which is particularly disk intensive and single-threaded, becomes a disappointing drag.  What happens with hyperthreading?  Well, even if hyperthreading improves the compilation performance by 10% (a best case given the disk is shared), disk fragmentation is increasing at twice its normal rate - and whatever win was witnessed during compilation is quickly lost over time – and can sometimes lead to reverse scaling - when using more cores actually slows things down.  (And don't even get me started on how Windows manages its NTFS partitions and how performance can degrade over time, despite the fact they've led consumers to believe that "defragging" is all they have to do... this cake is a lie.)

I have to give credit to the Intel marketing department.  They spun the hyperthreading message and have made tons of money as a result.  Even human intuition is on their side – when I see eight cores in Task Manager I’m wayyyyy faster than four cores, right?!  Here’s what’s crazy – even though there are tons of independent benchmarks to back-up my statements (that i7's are only fractionally faster than i5's) – people still dole out the big bucks for the i7-3770.  Please stop.  Please use your common sense.  Invest your money sensibly.  Buy an i5-3570k and use the savings to buy a bad-ass solid-state-disk (ssd).  I wager you’ll never notice the difference between an i5-3570k and an i7-3770 – but I guarantee your life will be shockingly improved when you upgrade from a mechanical disk to a killer ssd.

Tuesday, January 24, 2012

using size_t pragmatically

size_t is awesome - but I rarely see it employed correctly. You may think I'm here to lay down some dogmatic "type correct" philosophy, but rather, I'd like to instead give a pragmatic treatment for using size_t. First, though, some background.

Background

Predominantly, size_t is used to represent the size of objects. For example, malloc is prototyped as:

void * malloc( size_t );

As such, size_t needs to be scalable. On 32-bit platforms, size_t is an unsigned int. On 64-bit platforms, size_t is an unsigned 64-bit int.

The next most predominant usage of size_t is the "length" of strings and containers. For example, strlen is prototyped as:

size_t strlen( const char * );

Also, many STL containers (e.g., vector, queue, etc.) have methods for returning their "length". These "length" and "size" methods (when boiled down primitives) also return size_t.

The pragmatic view of size_t

It's pretty simple:  know your application's needs.

For example, if your app has a ton of classes/structs for storing fixed length strings (I'm talking char[]) that are programmatically limited to 256 characters or less, storing their length with size_t is overkill - especially on 64-bit systems. In this case, you can get away with using an unsigned char for storing your string's lengths. (You just saved seven bytes:  sizeof(size_t) - sizeof(unsigned char). Multiplied, this savings adds up quick.)

In contrast, if your application places no restriction on the length of strings stored - you'll definitely want to use size_t for storing string lengths.

If you're rolling your own container objects (e.g., vector, dynamic array, hash table), my preference is to use size_t for storing your element size and your element count. As an author of a "generic container", it is difficult to predict its usage - and size_t provides the maximum headroom. That said, if you're watching your memory usage closely - and you can reliably predict your application's usage of your container - you can choose a smaller built-in for storing your element size and/or element count. You would do this to save memory, of course - and depending on the container's usage, you could see a big savings.

My pet peeve for size_t is when I see it casted away needlessly. For example:

int length = (int)strlen( psz );
for ( int i = 0; i < length; i++ )
   ...

The above is just lazy coding. Why downcast the length of psz? (Probably: "to quiet the compiler 'downcast' warning".) On 64-bit systems, you're adding down-cast instructions  In all cases, stack is cheap. In these cases, I always use size_t:

size_t length = strlen( psz );
for ( size_t i = 0; i < length; i++ )
   ...
The above is cleaner, faster, and in my opinion, smarter.

A couple final items of note:
  • size_t is completely portable. Use it with confidence between Windows and Unix. As mentioned, on 32-bit systems, it is an unsigned int (4 bytes); on 64-bit systems, it is an unsigned 64-bit int (8 bytes).
  • size_t is not a built-in type. When name mangled, it will be reduced to the built-in types mentioned.  Therefore, when you dumpbin/nm your library you will see your usages of size_t replaced by unsigned int (32-bit systems) and unsigned 64-bit int (64-bit systems).

In a future post, I'd like to get into portable usage of integral built-in's like int, long, long long, 64-bit ints, and pointers. But if you have questions on size_t - let's have 'em.

Monday, January 16, 2012

why software projects are like progress meters

Why is it so hard to make a decent progress meter?  How many times have you seen a progress meter do something ridiculous like:

  • Get stuck at 71% for a long time.
  • Get stuck at 100% for a long time.
  • Get to 100%, then start over at 0% and go to 100% again.
  • Repeat the prior bullet five times.
  • See a time estimate of 324518 hours, then two seconds later, see an estimate of two minutes.

I'm being rhetorical when I ask "why can't we make better progress meters?"  I expect answers like, "how can I account for a network blip?" or "what should I do if the user suddenly loads the machine?" or "the work units being measured aren't equal complexity".

Funnily enough, executing a software project is quite similar to watching a progress meter.  Low effort planning, inexperience, failing to account for interruptions, and other combinations of factors can cause your software project to go totally awry of the estimate, get stuck at 71%, and in some cases, spawn off entirely new projects.

It all starts with the estimation process. A good estimate is the prescription for a smoothly run project.  Sadly, though, I've heard more than once, "the software process can't be accurately estimated".  This matter-of-fact resignation is bogus.  Authoring quality estimates isn't easy - but it allows your development and project managers to more accurately plan upcoming work.  So let's get to it:  my suggestions.

My first suggestion:  for those of you who write software estimates - try harder.  Study the existing code intensively, you may find logic that (nearly) meets the need.  Also, consider writing experimental logic "outside the app" to test your approach.  You won't be able to author a decent estimate until you've "cut a mental path" through the challenging parts of the project.  Once your estimate is done and the work begins, there should only be small mysteries remaining.

Second, know your "load" - your meeting schedule, how much time you spend on email, whether you'll be taking vacation, etc.  This is an Agile concept - knowing how many hours per day/week you can work on your projects.

The third suggestion may be obvious: let experience be your guide. However, you may be asked to work in unfamiliar code.  In these cases, spend a couple hours studying the code - then review your thoughts with one or more of your team members.  When no one is available to consult, you may need to more than a few days studying the code in order to provide a reasonable estimate.

Finally, make every effort to break your estimate into sub-one-week tasks. Admittedly, I struggle with this.  I argue with myself that I can call on my experience - that "I've done something similar before".  Invariably, though, my three week estimate either takes a week or five weeks.  In other words, I goofed my estimate because I didn't break it down.

If you put some effort into it, you should be able to author estimates that allow you to execute your software project with reasonable predictability - which is a heck of a lot more than I can say about most progress meters.  :)

Sunday, January 8, 2012

interviewing ftw

The interview process - what a mess.

Sometimes, superstar candidates get looked over.  Sometimes, less than stellar candidates get hired.  I've pondered this at length... it's frustrating.

As an interviewee, you're at the will of an unpredictable, inconsistent, filtering process.  As such, I suggest the following conventional wisdoms:
  • Your resume and cover letter should present your best efforts and qualities.  Sell your "wins".
  • Enumerate your skills, e.g., C++, SQL Server, Perl, etc.  It might seem trivial, especially if you think of yourself of more than just a set of skills, but if you don't, you'll be at a competitive disadvantage.
  • During the entire interview process - from the first contact, the first phone call, and the first interview - speak calmly and purposefully.  Answer questions directly, don't speak wastefully, and communicate as eloquently as possible.
  • Whether asked or not, when you're posed with a "thought" question, speak through it.  Reveal yourself, your manner of thinking, and above all, be honest - in particular, when you don't know an answer, say so.
  • Know something about the position and company for which you're interviewing.  Think of poignant questions to ask your interviewer(s).
The frustrating part is this:  even if you're at 100%, your interviewers might not be.  They might be facing deadlines of their own, interviewing other candidates that day, or just having an "off" day.  You simply cannot control their perspective on that particular day, regardless of how prepared you are.

For interviewees, that's the reality of it:  despite your best efforts and best qualities, your interviewers may fail to expose how awesome you are.  If your interviewers decide to pass on hiring you, that's their loss, unless in fact, you're not awesome.  But that's another issue altogether.

So let's assume you are awesome.  Now we need to examine the process from the interviewer's standpoint.  Because the failure is the interviewer's if they choose not to hire you.

As an interviewer, you're faced with a huge responsibility.  Being one of three to five interviewers (typically), you share the responsibility of whether to spend tens to hundreds of thousands of dollars on a new hire.  Will they be worth more than their salary?  Will they add to the social glue of the team?  Will they motivate their fellow team members, junior and senior, to rise to new levels?  Holy crap this is a big deal.

Here's the weird thing about hiring:  typically, senior staff will conduct interviews.  Being senior, they'll have biases.  Their opinions on hiring will be based on 1) whether the candidate shares similar personal/educational growth, and 2) the statistical pattern bias of their hiring history.  For example, if the interviewer has spent ten+ years becoming a master at recursion, they will take a negative bias of the candidate if they don't firmly grasp recursion - regardless of how much recursion is used in their products.

If you're an interviewer, here's my advice:  erase your biases.  Think of every candidate of being great until you've proven otherwise.  Your job is to discover the candidate's qualities - not eliminate them because they don't have the same technical "chops" as you.  This becomes the interviewer's challenge...  so how do you do it?

I have an approach to interviewing.  Of course, it is biased to my personal viewpoints and past interview experience.  But I strive desperately to avoiding succumbing to my biases.  I honestly believe that, as an interviewer, I am burdened with the challenge of discovering the candidate's greatness in 45 minutes or less.  So let's get to the meat of it - my approach to interviewing:
  • At some point during the interview, I like to take a negative stance to something the candidate expresses a belief in.  I'm trying to evoke a response - because work isn't all positive.
  • All good workers care deeply about the quality of their work.  So I might ask them to explain, comprehensively, how they ensured the quality of a particular project they implemented.
  • As I interview, I listen to every word choice - and every sentence.  Communication skills are paramount.  This is a potential teammate - they need to be articulate.
  • Also, as I listen, I am measuring the candidate's passion.  I am seeking natural, unforced passion.  The best workers are naturally motivated.
  • Finally, I want to measure the candidate's talents for the position.  Being technically savvy is crucial - and I usually pose a handful of technical questions to measure their technical aptitude.
That's just a flattened perspective on conducting a decent interview.  The fact is, I cannot write a recipe on how to conduct the perfect interview.  I will say that, as an interviewer, I am always prepared:  I've read and researched the candidate and their past few employers, I've written my questions on paper, and I am mentally prepared to make the most of the few minutes I'll have with the candidate.

Regardless of what side you're on, I suggest putting forth a strong effort.  Distinguish yourself, whether interviewee or interviewer - your (potential) employer will recognize the effort.

Thursday, October 20, 2011

c++ timing strategies compared (in Windows)

My coworkers always come to me with questions about the best way to time a block of C++ code.  I get asked, “I want a high-resolution timer – shouldn’t I use a timer based on cpu clock-cycles?”  I’m also asked, “I need to time something that is very short (microseconds) – so a high-resolution timer will work, right?”

I’ve followed my intuitions for years on this controversial matter.  My observation has been that the CRT function clock() was accurate enough - and I've steered away from timing "really small things" since I've seen too much instability in the timing result.

Instead of intuition, though, I’m gonna put a microscope on the matter.

the plan

I figure three things are important to determine regarding C++ timers in Windows:
1. What timer is most accurate?
2. What timer can measure the smallest time?
3. Are timing calls expensive?

I’d like to compare these three timing methods:
1. clock()
2. QueryPerformanceCounter()
3. GetProcessTimes()


the strategy

I’ve written a for-loop that will always take two exactly cpu clock cycles per iteration.  Because I know the clock-speed of my computer, I can predict how long the loop should take.  I will use this super simple for-loop to measure accuracy, resolution, and overhead of the different timing methods.

for ( unsigned int i = 0; i < iterations; i++ )
{
asm { nop }
}

The asm block is required so the compiler doesn’t optimize the loop away.  As it turns out, the nop is zero clock-cycles (I verified this by adding multiple nop’s, which had no effect on the overall time).  The assembly instructions implementing the for-loop are dec and jne – both of which are one clock-cycle.

I know how long my for-loop will take given the follow equation:

                time = iterations * (number of clock cycles) / (machine clock speed)

I can now solve for iterations:

                iterations = (time) * (machine clock speed ) / (number of clock cycles)

For example, solving for iterations given I want to run for one second:

                iterations = (1 sec ) * (2.5 GHz) /  (2 cycles) = 1 250 000 000


accuracy

For each target-time, I run my simple for-loop for the number of iterations that would equate to the expected time.  Then I have an outer loop that runs 100-10000 times, so I can compute the mean and standard deviation – for all three timing methods.

mean times

Expected time (s)
clock()
QueryPerformanceCounter()
GetProcessTimes()
2.000000
2.001710
2.001693
2.004769
1.000000
1.049360
1.049450
1.004646
0.500000
0.521580
0.521670
0.503415
0.250000
0.265080
0.265137
0.255686
0.125000
0.129920
0.129856
0.125737
0.062500
0.064680
0.064644
0.063180
0.030000
0.032180
0.032244
0.030732
0.001000
0.001069
0.001069
0.001045
0.000100
0.000123
0.000123
0.000125
0.000010
0.000019
0.000022
0.000017


standard deviations

Expected time (s)
clock()
QueryPerformanceCounter()
GetProcessTimes()
2.000000
0.001663
0.001567
0.011982
1.000000
0.068809
0.068873
0.014298
0.500000
0.033658
0.033675
0.010314
0.250000
0.019893
0.019949
0.008796
0.125000
0.010623
0.010622
0.004844
0.062500
0.004164
0.004179
0.004053
0.030000
0.005066
0.005025
0.005153
0.001000
0.000504
0.000473
0.003962
0.000100
0.000424
0.000275
0.001407
0.000010
0.000136
0.000010
0.000562
Average std.dev.
0.0145
0.0145
0.0065

accuracy conclusion
In almost every way, clock() and QueryPerformanceCounter() perform exactly the same.  Their results, both the means and standard deviations, are nearly equal.  One might even conclude they’re based on the same underlying logic.

On the other hand, GetProcessTimes() generates results that are closer to the expected values than the other two timing methods.  Also, GetProcessTimes() standard deviations are better on average.

In all three cases, measuring sub-millisecond times became very difficult.  I had to move up to averaging 10000 iterations – and even then, do multiple runs.  While the standard deviations weren’t too bad for submillisecond times, all three timing methods missed the expected time-value by a fair amount.

cost of timing calls

This will be a simpler experiment.  I’ll perform one million of iterations for the above loop and make a call to each timing function within the loop.  Remember, this is the cost of making one million calls using the above loop – the cost of the loop construct is insignificant compared to the cost of the timing calls.

Note this data is averaged over ten runs and based on calls to GetProcessTimes() (since we just determined it’s the most accurate):

clock()
QueryPerformanceCounter()
GetProcessTimes()
4.963952 s
4.812631 s
0.363482 s

cost conclusion
Again, clock() and QueryPerformanceCounter() are so well matched I’m nearly convinced they have the same underlying implementation.  On the other hand, GetProcessTimes() beats them both by an order of magnitude, making it the most lightweight timing call of the three.

overall conclusion

·         clock() and QueryPerformanceCounter() are almost definitely based on the same underlying implementation (at least for Visual Studio 2010).
·         GetProcessTimes() is the accuracy winner.  It gets closer to the expected time, and generates more consistent (lower standard deviation) results.
·         GetProcessTimes() is also the call performance winner, being faster to call than the other two by an order of magnitude.
·         Below 1 ms, all three timers struggle to measure accurately - even with lots of averaging.  I don’t recommend trying to measure such small intervals.

While you may conclude you should use GetProcessTimes() for everything, I need to end with some important facts regarding it.

GetProcessTimes() only measures active CPU time – so if you call Sleep(), it doesn’t count the time spent sleeping.  Also, if the algorithm you’re benchmarking performs some type of I/O, GetProcessTimes() won’t measure the time spent waiting for I/O resources.

Finally, when multiple threads are working concurrently within the process, GetProcessTimes() reports the aggregate amount of time spend on all threads.  Therefore, if you’re trying to benchmark the performance benefit of your new multithreaded algorithm, GetProcessTimes() is not the right choice.

These two issues with GetProcessTimes() are so glaring, that for me, I always use clock().  It is by far the easiest to call – and simply measures “wall-clock” time – which in all cases, is exactly what I want.  (There are times for using GetProcessTimes - just keep these caveats in mind.)

Regardless of the timing method you choose for your C++ code:
  • To ensure reasonable accuracy, focus on timing things that take more than 30 ms.
  • Do many runs; take an average.
  • Do your best to “quiet” the machine that is performing the benchmark by closing all un programs and background processes.
  •  Remember that calls to your timing method aren’t free.
  • Always benchmark using fully optimized builds outside the debugger (or Ctrl-F5 within the debugger).

Wednesday, October 19, 2011

maybe I've figured out the AMD Bulldozer strategy

In my prior post, I came down pretty hard on AMD's recent release of their Bulldozer cpu.  I had concluded that AMD was poisoned from the inside - that they put all their money on an engineer's long-shot idea.  That still might be right - but after giving this more thought, there is another possibility - and I hope I'm right.

I'm a big believer in approaching problems laterally (my first post discusses a lateral approach to goal setting).  What if AMD, instead of tackling Intel head-on, is taking a lateral approach?

Let's face it, AMD and Intel are in a longstanding war.  Intel clearly has more troops (staff), better equipment (their own fabs), and have honed their talents in frontal assault (making insanely fast chips).  Being smaller and without equally good weaponry, AMD has to think laterally.

As a frontal assault, the Bulldozer wasn't a total flop.  AMD made a strong effort - not as strong as the market leader - but strong nonetheless.  Now, remember the big picture - the Bulldozer is just one volley in a longstanding war.  AMD, we have to assume, has something up it's sleeve.

The market for cpus goes beyond the desktop.  In fact, many argue the desktop is a dying breed.  Where are the fastest growing trends for cpus?  Massive servers (many cores, low power) for virtualization, cloud computing, and databases.  (Not to mention tablets and smart phones.)  In AMD's case, guess what?  They actually kick ass in the server space.  Check out the CPU Mark for multi-cpu systems - AMD has three systems in the top five.

Battling over the desktop is like battling over a huge land mass that is becoming less-and-less tactical for the overall war strategy.  My prediction is that AMD will, within the next year, deliver a brutal blow to Intel in the server cpu space.  Intel, will be caught off guard, because their troops are largely amassed for maintaining desktop supremacy.

Do you believe in the future that many pundits predict, of smart phones, tablets, and browser based PC's?   If so, then complex computing is moving more and more to back-end servers.  And if that's the case, then AMD's server strategy is a pretty darn good one.