views:

189

answers:

4

Hey all,

I was wondering how one can fix an upper limit for the length of a string (in C++) for a given platform.

I scrutinized a lot of libraries, and most of them define it arbitrarily. The GNU C++ STL (the one with experimental C++0x features) has quite a definition:

size_t npos = size_t(-1); /*!< The maximum value that can be stored in a variable of type size_t */
size_t _S_max_len = ((npos - sizeof(_Rep_base))/sizeof(_CharT) - 1) / 4; /*!< Where _CharT is a template parameter; _Rep_base is a structure which encapsulates the allocated memory */

Here's how I understand the formula:

  • The size_t type must hold the count of units allocated to the string (where each unit is of type _CharT)
  • Theoretically, the maximum value that a variable of type size_t can take on is the total number of units of 1 byte (ie, of type char) that may be allocated
  • The previous value minus the overhead required to keep track of the allocated memory (_Rep_base) is therefore the maximum number of units in a string. Divide this value by sizeof(_CharT) as _CharT may require more than a byte
  • Subtract 1 from the previous value to account for a terminating character
  • Finally, that leave the division by 4. I have absolutely no idea why!

I looked at a lot of places for an explanation, but couldn't find a satisfactory one anywhere (that's why I've been trying to make up something for it! Please correct me if I'm wrong!!).

A: 

You could create a small wrapper class that contains a std::string. Expose the interface functions you care about. If any function call would increase your string beyond your desired maximum length, you could throw an exception or otherwise trigger an error.

This is intended as a way to achieve your goal (fix a max length on your string) without digging into the mess of deciphering the standard library implementation.

Kristo
Ya, sure, that's absolutely fine. In fact, there is a function in the __gnu_cxx namespace just for that (I think it is __gnu_cxx::__throw_length_exception or something)! The thing is I know how I can handle it, but I want to know why.
themoondothshine
A: 

If you don't mind checking at runtime, you can call std::string::max_size, which returns the maximum possible length of a string. This won't give you any reasons for its result (and I've no idea what the /4 is for in the GNU code I'm afraid) but it will at least give you something definite to work with.

This is not a static function though so determining the right value for every string might require a bit of care and/or a spot of system-specific code. (The VC++ string looks to defer to its allocator for this information, for example. That means that different strings could have different maximum sizes, if they're using different allocators, I suppose.)

brone
Hmmm... That could be possible. With GCC however, this value is somewhat independent of the allocator. It depends on the typedef allocator::size_type, which almost always resolves to the standard size_t type.
themoondothshine
A: 

The practical limit is likely to be much smaller than the absolute limit. Memory allocation will fail, for example. The practical limits can't really be known ahead of time.

Mark Ransom
So I gather that the divisor 4 is arbitrarily chosen, or perhaps based on practical experience; a safe-guard against overflow?
themoondothshine
+2  A: 

The comments in basic_string.h from GCC 4.3.4 state:

    // The maximum number of individual char_type elements of an
    // individual string is determined by _S_max_size. This is the
    // value that will be returned by max_size().  (Whereas npos
    // is the maximum number of bytes the allocator can allocate.)
    // If one was to divvy up the theoretical largest size string,
    // with a terminating character and m _CharT elements, it'd
    // look like this:
    // npos = sizeof(_Rep) + (m * sizeof(_CharT)) + sizeof(_CharT)
    // Solving for m:
    // m = ((npos - sizeof(_Rep))/sizeof(CharT)) - 1
    // In addition, this implementation quarters this amount.

In particular, note the last line, "In addition, this implementation quarters this amount." I take that to mean that the division by four is in fact entirely arbitrary.

I tried to find more information in the checkin log for basic_string.h, but it only goes back to October 5, 2000, and this comment was already present as shown in that revision, and I'm not familiar enough with that code base to know where the file might have lived in the source tree before it was moved to its current location.

Eric Melski
@Eric: Thanks for the info! Apparently, this particular comment is missing from GCC 4.4.0's `basic_string.h` (I wonder why!). 'Arbitrary' really explains away some of the questions I had about basic_string.
themoondothshine