ansaurus

Question

vector <unsigned char> vs string for binary data

Answer 1

A:

If you just want to store your binary data, you can use bitset which optimizes for space allocation. Otherwise go for vector, as it's more appropriate for your usage.

Jacob 2009-10-12 18:47:17

bitset is not a good choice. How are you going to get the data back out without casting? How do you easily read a byte out of a bitset? This isn't the right application for bitset.

Brian Neal 2009-10-12 20:04:44

Hence, "if you just want to store your binary data". This is important in some memory intensive processes - for e.g. when working with binary images, you'd want to store them temporarily and then reuse them later.

Jacob 2009-10-12 20:26:20

How often do you actually "just store data" though? If I was going to store it I would use a file or just an array or vector. What advantages does bitset have for storage? How do you even get your binary data into a bitset? Bitset has really lousy contructors for that purpose. Have you actually tried to do this? Bitset has a default constructor, a constructor that takes an unsigned long, and one that takes a string. Not real convenient for this purpose.

Brian Neal 2009-10-12 23:18:26

Storing it in an array or a vector would defeat the purpose of storage since we're using bitset for it's optimized allocation of *bits*. Passing a string of bits is not that difficulty. As for applications, binary images are one: an RGB 1024x768 is 2.25MB stored as uchars - imagine storing a small batch of frames (which is **not** unrealistic). Also, r/w to files is much slower than storing it temporarily as a bitset.Additionally, I did mention that if storage wasn't the prime motivation, `vector` is better.

Jacob 2009-10-13 00:15:36

Bitset is not optimized for storage of bits. In fact, the standard makes no guarantees on how the bits are actually stored. Bitset is used when you need, what else, a set of bits, as for example, flag manipulation. Please tell me how you are going to store a binary image 2.25 MB in size in a bit set. There is nothing more optimized for space allocation than an array of unsigned char.

Brian Neal 2009-10-13 00:42:16

Read the line about optimizing space allocation: http://www.cplusplus.com/reference/stl/bitset/

Jacob 2009-10-13 02:53:19

Jacob, this is silly. You claim that bitset is useful for storing binary data. This is absurd. Bitset is not a container, and it has no suitable constructors for being initialized from raw data, unlike vector or string. Are you seriously telling me you would construct a string of ASCII 1's and 0's from 2.25 MB of binary data in order to construct a bitset??? That's a pretty big string. Think about it. Bitset was not meant for this purpose. The C++ standard does not even specify how bitset internally stores data, unlike vector, which the standard guarantees to be contiguous.

Brian Neal 2009-10-13 03:01:58

There is no more compact way to store data in memory in C++ than with an array of unsigned char. The standard guarantees that you can treat the memory inside of vector<unsigned char> as a contiguous array. You cannot (portably) do that with bitset. You can't (portably) memcpy raw data into a bitset either.

Brian Neal 2009-10-13 03:11:16

`bitset` is efficient at storing binary data - I never said `bitset` was an STL container. And creating that "pretty big string" (which would use `unsigned char`, btw) is trivial. Also, everything I've seen till now (sample code on my compiler, Googling and Effective STL (pg.70)) indicates that bitset **does** store binary data effectively. And yes, there *is* a better way to store binary data, and it's `bitset` - have you tried it out on your compiler? It's only two lines of code.

Jacob 2009-10-13 03:56:35

To initialize a 2.25 MB bitset, you need a 10 MB string; each *character* in the string represents just *one bit* in the bitset. Also, you need to know how many bits you'll need *at compile time*. There are just two ways of extracting a bitset's contents en masse: to_ulong is useless if you have more bits than fit in a long, and to_string returns a string of zeroes and ones that can't easily be used in any other data type. So, yes, if all you want to do is *store* a preset amount of data, bitset might be OK. If you want the data back, or if the size is uncertain, then it's a lousy choice.

Rob Kennedy 2009-10-13 07:07:31

Agreed, if the size is uncertain, it's lousy, but getting the data back is `not` since it's the same as storing the data, you can use `bitset::to_string`. And yes, you need a 10 MB string - that's the whole point of using bitset. Suppose you have a array of bits which you've obtained as unsigned chars after some logical operation perhaps, and it's 10MB and you want to store it in memory - what do you do? `bitset`!

Jacob 2009-10-13 11:50:08

Ha-ha, you keep messing with your 10 MB string and I'll use my 2 MB vector<unsigned char>. I still have absolutely no clue why you feel bitset is good for "storing" data. Why is it better than vector? And what the heck are you supposed to do with it while it is in bitset?And yes I have tried to use bitset for binary data. I actually wrote my own implementation of bitset and gave it constructors and accessors to get the raw data in and back out for embedded systems. But I need it because I was using it as it was intended, as a set of bit flags, not storage.

Brian Neal 2009-10-13 23:19:46

The fact that bitset doesn't provide (begin, end) constructors and raw data accessors makes it absolutely terrible for storing data. Your only way in or out for large numbers of bits is string? You also cannot say it is optimized for storage. As I have said several times, the standard does not guarantee how bitset should store data, unlike vector. For all you know, your bitset may store 1 bit in every byte for speed. I know of no implementation that actually does this, but that's why you can't count on it or portably memcpy it around. P.S. Don't rely on cplusplus.com for everything.

Brian Neal 2009-10-13 23:36:25

I don't think you understand what I'm saying. Your 2MB vector<unsigned char> which is supposed to represent 2Mbits can be more efficiently stored on *most implementations* (could you point out an implementation which performs so poorly? I can't find one!) using bitset. How? You throw it in to the constructor and poof! you get a bitset which has stored your data by possibly a factor of 8. Also, all I've said, *repeatedly* is, **storage**. Nothing about accessors, etc. etc.

Jacob 2009-10-14 05:49:53

@Jacob: I think you have a communications problem here with Brian. If you read a 1024x768@24 bit raw image you will have 2.25MBytes of information. The most that a bitset can pack the data is one bit for each element, and at that level it will require exactly 2.25MBytes of memory, just as a vector of bytes. Bitset will be an advantage if each of your original elements is a bit (at this point you can note that `std::vector<bool>` is an specialization that is optimized for space, not that the standard committee is happy about it), so at that point it won't even take more memory than a bitset.

David Rodríguez - dribeas 2009-10-14 06:32:11

... Now, if your intended use is testing flags, using a vector of bytes will be more cumbersome as it will require extracting each byte and then testing each bit for reading, extracting the byte, setting the bit and inserting the result back for setting a bit. At that point using a bitset or vector<bool> will simplify user code. But the thing is that if the elements you work with are not bits but rather bytes, then a vector is more efficient cpu wise than a bitset and is not less efficient memory wise. In most cases, when people talk about storing binary data they refer to bytes, not bits.

David Rodríguez - dribeas 2009-10-14 06:35:38

Answer 2

A:

pure binary data: std::vector or something else, but not string. If your data contains a 0, string regards it as a terminator, hence you cannot store/access any data after it.

stijn 2009-10-12 18:48:13

std::string copes well with \0.

liori 2009-10-12 18:49:32

really? I have to admit I had no idea..

stijn 2009-10-12 18:55:14

It does, but you're still basically right. Don't use a string for non-string data.

jalf 2009-10-12 22:54:54

You had no idea but you answered anyway. Got to love stackoverflow.

Brian Neal 2009-10-13 23:20:54

well I thought strlen on binary data would mess up things, and the string class we use here uses strlen, so I incorrectly assumed std::string would mess that up as well..

stijn 2009-10-14 06:50:58

Answer 3

+8 A:

Both are correct and equally efficient. Using one of those instead of a plain array is only to ease memory management and passing them as argument.

I use vector because the intention is more clear than with string.

Edit: C++03 standard does not guarantee std::basic_string memory contiguity. However from a practical viewpoint, there are no commercial non-contiguous implementations. C++0x is set to standardize that fact.

fnieto 2009-10-12 18:49:37

There's no way std::string is "correct" for unqualified "binary data".

Dan 2009-10-13 03:02:38

from Sgi: "The basic_string class represents a Sequence of characters. It contains all the usual operations of a Sequence, and, additionally, it contains standard string operations such as search and concatenation.". Why is that incorrect? I agree it is not the best aproach (as I state in my answer) but it is not incorrect.

fnieto 2009-10-13 08:47:18

So string works just as well as the vector because it in a sense extends the functionality of a vector yet the only functionality I will need ([] or the like) is contained in both? (Yes I realize that string doesn't actually inherit from vector.)

kalaxy 2009-10-13 19:26:07

Yes, but conceptually is a worse option and have methods that could not have sense for a buffer. If you only want memory management and operator[], why to use a class so complex as std::string.

fnieto 2009-10-13 21:24:39

Answer 4

+3 A:

Is one more efficient than the other?

This is the wrong question.

Is one a more 'correct' usage?

This is the correct question.
It depends. How is the data being used? If you are going to use the data in a string like fashon then you should opt for std::string as using a std::vector may confuse subsequent maintainers. If on the other hand most of the data manipulation looks like plain maths or vector like then a std::vector is more appropriate.

Martin York 2009-10-12 18:57:44

Answer 5

A:

Compare this 2 and choose yourself which is more specific for you. Both are very robust, working with STL algorithms ... Choose yourself wich is more effective for your task

Davit Siradeghyan 2009-10-12 19:30:27

Answer 6

+11 A:

You should prefer std::vector over std::string. In common cases both solutions can be almost equivalent, but std::strings are designed specifically for strings and string manipulation and that is not your intended use.

std::string offers methods that you will probably not want to use and just make the interface cumbersome for plain data storage. Some of the methods will have 'strange' behavior. For example, std::string::compare will determine two not bitwise exact strings as equals if the differences happen in equivalent characters with respect to the character traits. This will make the free function operator== return true for strings that are not bitwise equals. Say that the default character traits determine that 'a' and 'á' are equivalent, then std::string("a") == std::string("á") while this may be sensible to do with strings, it sure is not with binary data.

While I cannot recall a real example where this happens, there is no reason to prefer a more complex solution to a problem that can be solved by another container and that could fail in some weird, hard to debug way.

David Rodríguez - dribeas 2009-10-12 20:23:33

"Say that the default character traits determine that 'a' and 'á' are equivalent" That is a bad asumption. See the answer I wrote as continuation to this comment.

fnieto 2009-10-13 07:59:38

I rechecked, and you are right in that the standard does define the specialization `char_traits<char>` and with the standard specialization, assignment, comparisons and ordering are defined as the equivalent for the built-in char type.

David Rodríguez - dribeas 2009-10-13 08:25:52

So with default char_traits std::string would compare no differently than the corresponding std::vector?

kalaxy 2009-10-13 22:57:30

@kalaxy: correct. Anyway, each class was meant for a purpose, and `std::vector` better suites what you want from a buffer, so if only because of the intention is clearer (as fnieto points out in his answer) I would prefer `std::vector`

David Rodríguez - dribeas 2009-10-14 06:20:25

Answer 7

A:

Personally I prefer std::string because string::data() is much more intuitive for me when I want my binary buffer back in C-compatible form. I know that vector elements are guaranteed to be stored contiguously exercising this in code feels a little bit unsettling.

This is a style decision that individual developer or a team should make for themselves.

Oleg Zhylin 2009-10-12 21:47:31

You prefer using a string for non-string data? Rather than using the container *designed* for contiguous storage of data of any type?

jalf 2009-10-12 22:54:21

Lets not forget that this is the matter of style. Perfectly workable and standard compliant code for binary buffers can be created with either of these classes. I would argue that vector is not designed to be a binary buffer either. It is compatible, but you will have to revert to algorithms or C tricks to get the job done. Not all string operations are safe, but some of them are quite useful to make the code cleaner and more maintainable.

Oleg Zhylin 2009-10-13 00:32:08

Brian Neal 2009-10-13 00:46:33

No, s.assign(BinaryBuffer, BinaryBufferSize); ?

Oleg Zhylin 2009-10-13 00:58:39

vector<unsigned char> v; v.assign(BinaryBuffer, BinaryBuffer + BinaryBufferSize);

Brian Neal 2009-10-13 01:16:08

Of course vector has a constructor explicity for that purpose too: vector<unsigned char> v(first, last);

Brian Neal 2009-10-13 01:17:02

Thus you have to explicitly parametrize vector with unsigned char and make sure pointer arithmetics works correctly in BinaryBuffer + BinaryBufferSize. Looks like more pitfalls then string option to me. As I said in the beginning, this is clearly a style issue. There's no such thing as "universal style". Teams or individual developers should decide which option they like better and adhere to that.

Oleg Zhylin 2009-10-13 11:30:23

Um, string is already parameterized by char, did you notice? So typedef your vector<unsigned char> if that makes you feel weird.String is meant for strings of characters, not raw binary data. String is a much more heavy-weight solution.

Brian Neal 2009-10-13 23:22:44

And what do you mean by making sure pointer arithmetic works correctly? Vector uses the 2-iterator (begin, end) idiom like the rest of the STL (and string). Hardly more pitfalls than string.

Brian Neal 2009-10-13 23:24:36

Pointer arithmetic may play tricks if BinaryBuffer is not (unsigned char*). Could you please elaborate on what makes string _much_ more heavyweight?

Oleg Zhylin 2009-10-14 00:38:02

Answer 8

+1 A:

This is a comment to dribeas answer. I write it as an answer to be able to format the code.

This is the char_traits compare function, and the behaviour is quite healthy:

static bool
lt(const char_type& __c1, const char_type& __c2)
{ return __c1 < __c2; }

template<typename _CharT>
int
char_traits<_CharT>::
compare(const char_type* __s1, const char_type* __s2, std::size_t __n)
{
  for (std::size_t __i = 0; __i < __n; ++__i)
if (lt(__s1[__i], __s2[__i]))
  return -1;
else if (lt(__s2[__i], __s1[__i]))
  return 1;
  return 0;
}

fnieto 2009-10-13 08:01:30

Is this behavior well defined in the standard?

gnud 2009-10-13 08:03:30

+1: @gnud: Not in general, but fnieto is right (I just checked it) in that the standard defines the specialization of traits for char, where `assign`, `eq` and `lt` must be defined as builtin operators =, == and < for type `char`.

David Rodríguez - dribeas 2009-10-13 08:22:04

ansaurus

tags:

views:

answers:

vector <unsigned char> vs string for binary data

related questions