Tuesday, January 24, 2012

using size_t pragmatically

size_t is awesome - but I rarely see it employed correctly. You may think I'm here to lay down some dogmatic "type correct" philosophy, but rather, I'd like to instead give a pragmatic treatment for using size_t. First, though, some background.

Background

Predominantly, size_t is used to represent the size of objects. For example, malloc is prototyped as:

void * malloc( size_t );

As such, size_t needs to be scalable. On 32-bit platforms, size_t is an unsigned int. On 64-bit platforms, size_t is an unsigned 64-bit int.

The next most predominant usage of size_t is the "length" of strings and containers. For example, strlen is prototyped as:

size_t strlen( const char * );

Also, many STL containers (e.g., vector, queue, etc.) have methods for returning their "length". These "length" and "size" methods (when boiled down primitives) also return size_t.

The pragmatic view of size_t

It's pretty simple:  know your application's needs.

For example, if your app has a ton of classes/structs for storing fixed length strings (I'm talking char[]) that are programmatically limited to 256 characters or less, storing their length with size_t is overkill - especially on 64-bit systems. In this case, you can get away with using an unsigned char for storing your string's lengths. (You just saved seven bytes:  sizeof(size_t) - sizeof(unsigned char). Multiplied, this savings adds up quick.)

In contrast, if your application places no restriction on the length of strings stored - you'll definitely want to use size_t for storing string lengths.

If you're rolling your own container objects (e.g., vector, dynamic array, hash table), my preference is to use size_t for storing your element size and your element count. As an author of a "generic container", it is difficult to predict its usage - and size_t provides the maximum headroom. That said, if you're watching your memory usage closely - and you can reliably predict your application's usage of your container - you can choose a smaller built-in for storing your element size and/or element count. You would do this to save memory, of course - and depending on the container's usage, you could see a big savings.

My pet peeve for size_t is when I see it casted away needlessly. For example:

int length = (int)strlen( psz );
for ( int i = 0; i < length; i++ )
   ...

The above is just lazy coding. Why downcast the length of psz? (Probably: "to quiet the compiler 'downcast' warning".) On 64-bit systems, you're adding down-cast instructions  In all cases, stack is cheap. In these cases, I always use size_t:

size_t length = strlen( psz );
for ( size_t i = 0; i < length; i++ )
   ...
The above is cleaner, faster, and in my opinion, smarter.

A couple final items of note:
  • size_t is completely portable. Use it with confidence between Windows and Unix. As mentioned, on 32-bit systems, it is an unsigned int (4 bytes); on 64-bit systems, it is an unsigned 64-bit int (8 bytes).
  • size_t is not a built-in type. When name mangled, it will be reduced to the built-in types mentioned.  Therefore, when you dumpbin/nm your library you will see your usages of size_t replaced by unsigned int (32-bit systems) and unsigned 64-bit int (64-bit systems).

In a future post, I'd like to get into portable usage of integral built-in's like int, long, long long, 64-bit ints, and pointers. But if you have questions on size_t - let's have 'em.

2 comments:

  1. Do you have any idea why they choose to do size_t as unsigned? It is inconvenient for backwards loops

    std::vector v;
    // fill v ...
    for( size_t ii=v.size()-1; ii>=0; --ii ) // bad idea
    {
    }

    Is there an idiom for doing this better?

    ReplyDelete
    Replies
    1. My guess is that size_t is unsigned for a couple reasons:
      * it measures a dimensional quantity
      * importantly, it can scale to the extents of addressable space

      Your example is a good one. For those wondering what is going on, when ii finally gets to 0 (zero), then decrements one final time, instead of becoming -1, it becomes 4294967295 - because size_t is unsigned. The loop fails to terminate, accessing uninitialized memory, eventually crashing.

      When using primitives for iterating, this case does lead to a dilemma. However, you've kindly decided to use an stl vector in your example - which means you can use a reverse iterator:

      typedef vector vint;
      vint v;
      // fill v ...
      for ( vint::reverse_iterator vi = v.rbegin(); vi != v.rend(); vi++ )
         printf( "%d\n", *vi );

      This is the "smoothest" way I can think of for the backwards traversal of a vector. However, when stl isn't around and such conveniences aren't provided, I think we're stuck with inelegant solutions such as:

      typedef vector vint;
      vint v;
      // fill v ...
      size_t ii;
      for ( ii = v.size() - 1; ii > 0; ii-- )
         printf( "%d\n", v[ii] );
      printf( "%d\n", v[ii] );

      Can you or anyone else think of other (elegant) solutions?

      Delete