views:

391

answers:

10

I know it's ridiculous but I need it for storage optimization. Is there any good way to implement it in C++ ?

It has to be flexible enough so that I can use as normal data type e.g Vector< int20 >, operator overloading etc..

+5  A: 

No... you can't do that as a single value-semantic type... any class data must be a multiple of the 8-bit character size (inviting all the usual quips about CHAR_BITS etc).

That said, let's clutch at straws...

Unfortunately, you're obviously handling very many data items. If this is more than 64k, any proxy object into a custom container of packed values will probably need a >16 bit index/handle too, but still one of the few possibilities I can see worth further consideration. It might be suitable if you're only actively working with and needing value semantic behaviour for a small subset of the values at one point in time.

struct Proxy
{
    Int20_Container& container_;  // might not need if a singleton
    Int20_Container::size_type index_;
    ...
};

So, the proxy might be 32, 64 or more bits - the potential benefit is only if you can create them on the fly from indices into the container, have them write directly back into the container, and keep them short-lived with few concurrently. (One simple way - not necessarily the fastest - to implement this model is to use an STL bitset or vector as the Int20_Container, and either store 20 times the logical index in index_, or multiply on the fly.)

It's also vaguely possible that although your values range over a 20-bit space, you've less than say 64k distinct values in actual use. If you have some such insight into your data set, you can create a lookup table where 16-bit array indices map to 20-bit values.

Tony
"value-semantic type"? Does that exclude bitfields or something?
Potatoswatter
@Potatoswatter: No, bitfields have value semantics. It means a type behaves as you'd expect for say an int or std::string in terms of observable behaviour when copying, assigning, operator==, going out of scope etc.. As distinct from e.g. pointer-to-value semantics. See http://www.parashift.com/c++-faq-lite/value-vs-ref-semantics.html if interested.
Tony
Well, since bitfields don't have to be a multiple of 8 bits, maybe you should change the answer.
Potatoswatter
I think I've lost you :-). The question: "It has to be flexible enough so that I can use as normal data type e.g Vector< int20 >, operator overloading etc..". That's another way of saying it needs value semantics. You could use a 20 bit field, but it won't prevent the object containing it from being padded out to 32 bits.
Tony
@Tony: see my answer.
Potatoswatter
+1  A: 

You can use the union keyword to create a bit field (see this site for some nice samples). I've used it way back when bit fields were a necessity. Otherwise, you can create a class that holds 3 bytes, but through bitwise operations exposes just the most significant 20.

Traveling Tech Guy
But 2 bytes is just 16 bit, how to obtain 20 bits from it.
Gunner
WTF? an 2 bytes value can only hold 16 bit (assuming 8bit bytes).
BatchyX
Sorry, meant 3 bytes. Corrected.
Traveling Tech Guy
but then it cost 3bytes storage... I ended up using 24-bit integer
iKid
So use the union... (hint: it would still cost you 3 bytes, and maybe even 4, since the compiler aligns memory allocation in 4-byte blocks - depending, of course on the compiler and architecture). But why do you care so much? What environment are you developing for, where 4 bits are at a premium?
Traveling Tech Guy
+4  A: 

Use a class. As long as you respect the copy/assign/clone/etc... STL semantics, you won't have any problem.

But it will not optimize the memory space on your computer. Especially if you put in in a flat array, the 20bit will likely be aligned on a 32bit boundary, so the benefit of a 20bit type there is useless.

In that case, you will need to define your own optimized array type, that could be compatible with the STL. But don't expect it to be fast. It won't be.

BatchyX
A: 

As far as I know that isn't possible.

The easiest option would be to define a custom type, that uses an int32_t as the backing storage, and implements appropriate maths as override operators.

For better storage density, you could store 3 int20 in a single int64_t value.

Douglas Leeder
+8  A: 

If storage is your main concern, I suspect you need quite a few 20-bit variables. How about storing them in pairs? You could create a class representing two such variables and store them in 2.5+2.5 = 5 bytes.

To access the variables conveniently you could override the []-operator so you could write:

int fst = pair[0];
int snd = pair[1];

Since you may want to allow for manipulations such as

pair[1] += 5;

you would not want to return a copy of the backing bytes, but a reference. However, you can't return a direct reference to the backing bytes (since it would mess up it's neighboring value), so you'd actually need to return a proxy for the backing bytes (which in turn has a reference to the backing bytes) and let the proxy overload the relevant operators.

As a metter of fact, as @Tony suggest, you could generalize this to have a general container holding N such 20-bit variables.

(I've done this myself in a specialization of a vector for efficient storage of booleans (as single bits).)

aioobe
I thought of this solution before. But still not figure out how to represent it so that it's easy to perform operation like a+b, or Vector<int20> ..
iKid
Updated my answer :-)
aioobe
Once you start thinking about proxies, seems to me you're better off having a custom container of an arbitrary number of 20-bit values and a proxy into that without the intermediate `pair` representation complicating things.
Tony
This is basically about manually coding bitfields…
Potatoswatter
Vector of booleans is already specialized (by STL developers) to store values in bits :)
n0rd
@n0rd: and most people agree it was a mistake :-)
Matthieu M.
@n0rd, this was for a C++ lab course :-)
aioobe
(much of the discussion of Potatoswatter's solution is relevant to this one too)
Tony
@Matthieu: it was a mistake that it's a specialisation of `vector`, because it doesn't really satisfy the requirements of `vector` or even of `sequence`. There's nothing wrong with the principle of a variable-length `bitset` with an iterator.
Steve Jessop
@Steve: I don't think I ever said anything against that :), a `bit_vector` would have been very useful.
Matthieu M.
A: 

Just an idea: use optimized storage (5 bytes for two instances), and for operations, convert it into 32-bit int and then back.

alxx
+1  A: 

Your not going to be able to get exactly 20 bits as a type(even with a bit packed struct), as it will always be aligned (at smallest grainularity) to a byte. Imo the only way to go, if you must have 20 bits, is to create a bitstream to handle the data(which you can overload to accept indexing etc)

Necrolis
+3  A: 

You can use C++ std::bitset. Store everything in a bitset and access your data using the correct index (x20).

Ugo
A: 

While its possible to do this a number of ways. One possibilty would be to use bit twidling to store them as the left and right parts of a 5 byte array with a class to store/retrieve which converts yoiur desired array entry to an array entry in byte5[] array and extracts the left ot right half as appropriate.

However on most hardware requires integers to be word aligned so as well as the bit twiddling to extract the integer you would need some bit shifiting to align it properly.

I think it would be more efficient to increase your swap space and let virtual memory take care of your large array (after all 20 vs 32 is not much of a saving!) always assuming you have a 64 bit OS.

James Anderson
20 vs 32 is a 37.5% savings. If his program is memory-bandwidth constrained chugging through this array, he will pretty much get that much speed boost. I would call 37% faster significant.
Potatoswatter
+2  A: 

Use a bitfield. (I'm really surprised nobody has suggested this.)

struct int20_and_something_else {
    int less_than_a_million : 20;
    int less_than_four_thousand : 12; // total 32 bits
};

This only works as a mutual optimization of elements in a structure, where you can spackle the gaps with some other data. But it works very well!

If you truly need to optimize a gigantic array of 20-bit numbers and nothing else, there is:

struct int20_x3 {
    int one : 20;
    int two : 20;
    int three : 20; // 60 bits is almost 64

    void set( int index, int value );
    int get( int index );
};

You can add getter/setter functions to make it prettier if you like, but you can't take the address of a bitfield, and they can't participate in an array. (Of course, you can have an array of the struct.)

Use as:

int20_x3 *big_array = new int20_x3[ array_size / 3 + 1 ];

big_array[ index / 3 ].set( index % 3, value );
Potatoswatter
The poster's basically wanting `int20_t data[HUGE_NUM]`. You're offering `Twenty_Twelve data[HUGE_NUM]`. That won't save memory, or allow `X n = data[20] + data[50]; my_vector_of_X.push_back(n)`. What will all the "twelves" be used for? One of us is confused, as I can't see the relevance of this.
Tony
@Tony: I offer two solutions. One stores twenty bits in an easily-addressed structure (provided there are other scraps of data, which is often the case), and the other stores 3x20 bits in a structure that demands you divide the array index by 3. Really, you can't see how this solves the problem?
Potatoswatter
Not, I can't. It's basically like aioobe's solution. The client code has to keep providing that index % 3 value all over the place which prevents you thinking in terms of a discrete value-semantic variable. .set isn't "=". Having to put x.get(2) + x.get(1) differs from x + y. How does your suggestion really improve on directly and repeatedly indexing into a single "Container big_array"?
Tony
@Tony: Except I've provided a concrete solution which can be used immediately. Additional abstraction can get rid of all that syntactic ugliness, but C provides the framework and a bitfield is already a limited proxy object for a range of bits. In practical use, most people choose to deal with code as in my answer, as fully capable proxies are more trouble than extra keystrokes in a couple inner loops. Anyway, to say it's impossible is just incorrect.
Potatoswatter
@Potatoswatter: good point - you've got a simple working and workable solution there in a couple lines, whereas a value-semantic proxy is probably more like 50 lines and in this case will necessarily make functional trade-offs, and a custom container's maybe 20 lines if backed by an STL bit container. No good answer here, and I've no problem with yours except that the whole question was about striving for those value semantics, examining how close we could come and at what cost.
Tony
@Tony: I do like to use this site for programming exercise, but this is one case where I'd say that it's just too much trouble. Premature abstraction is the flip side of premature optimization. Also, this being applicable to inner loops, you would have to code the proxy to a high performance standard. Pretty code and micro-optimized code (use of bitfields is micro-optimization) are mutually exclusive.
Potatoswatter
If i'm not wrong, this struct still take 64bit instead of 60bit. But that come really close to what I need. Many thanks to those devote your time to this question!
iKid
The reason I want it to be more generic like a + b because like that I would not need to refactor the code which using primitive data like integer 32-bit, 64-bit and 24-bit that I have implemented using template
iKid
@iKid: that's a solid reason. C++ does generally provide the facilities for drop-in replacement with arbitrary user-defined-types, just unfortunate about the data size issue....
Tony