views:

177

answers:

7

Hello, I have this project I'm working on. The following conditions apply

  1. In this project I need to create one huge array (hopefully I will be able to create one as big as ~7.13e+17, but this target is still up ahead.)
  2. Each cell inside the array can contain one of the three values: 0,1,2
  3. I'm using C++ as my language.

I tried using the normal dynamic array command

int * p;
int i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int[i];

But as far as I understand, this array makes an array with a possible maximum size of the maximum size of int. If I change my code and use the following code

long long * p;
long long i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) long long [i];

Then each cell in the array is of type "long long" making the array very memory heavy. Is there any way to create an array using long long to determine the number of cells in the array and have every cell of size int?

Thanks a lot, Uriel.

EDIT: for further information.

  1. This problem is mainly theoretical, it is part of my Master's thesis. I would still like this program to work as best as it can.
  2. My current step is to make this work for an array with 2.56e+09 items, quick calculation shows we are talking about an array that's at least 0.6 Gigabytes, something my system should be able to cope with. Yet I cannot achieve this goal with my current coding solution even though the amount of space required is really at 4.5GB.
+1  A: 

As all of your values are smaller than 255, you might want to make this an array of char. In any case, the pointer type does not dictate the maximum allocatable size for same.

Hasturkun
+1  A: 

Since there is a finite list of value it might be possible to just use a char array. One byte can hold the three different values very easily.

Values:
0 -> 0
1 -> 1
2.2 -> 2

Storing values:

char values[i];
values[i] = 0;
values[i] = 1;
values[i] = 2;  // really the 2.2 value

Retrieving values:

int zero = values[i] - 0;
int one  = values[i] - 0;
double two_point_two values[i] - 0;
if (two_point_two == 2)
    two_point_tow = 2.2;

A little extra care is needed to get the last value but the array will be small (1 byte).

Array Allocation:

int main ()
{   
    // static allocation requires a const size
    const int static_array_size = 100;
    char static_array[static_array_size];
    std::cout << "static array size is:" << sizeof(static_array) << std::endl;

    // heap allocation can vary in size (i.e. non const heap_array_size variable)
    int heap_array_size = 200;
    char* heap_array = new char[heap_array_size];
    std::cout << "static array size is:" << sizeof(heap_array_size) << std::endl;
}   
skimobear
If I use "char values[i]" to initialize a char array, don't I need to know the size of the array before I compile my program?
Urielnam
Just added allocation example above
skimobear
+5  A: 

Is there any way to create an array using long long to determine the number of cells in the array and have every cell of size int?

There is no reason the type of the array has to be the same as the type of the variable used to specify the size. So use long long for the variable that specifies the size and then int for the type of the array.

int * p;
long long i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int [i];

However, I am concerned when you say you need to create an array "as big as ~7.13e+17". I don't know whether you mean bytes or elements but either way that is incredibly huge for a straight array. That's getting into the realm of petabytes of data.

In a 32-bit program, that's simply not possible. In theory you could have an array up to a couple gigabytes (though in practice considerably less most times).

In a 64-bit program you could in theory allocate an array that large, as far as I know. However, I'm skeptical that most machines could actually handle it. Since this amount of data would far exceed the RAM in the machine, the operating system would be forced to push most of this array into a page file. But a petabyte sized page file would far exceed the harddrive space on most typical machines right now.

Either way, you will probably need to seriously consider a different scheme than just allocating that whole huge array at once.

TheUndeadFish
+4  A: 

Since you want a to maximize packing density, you'd probably be best off using bit fields:

struct item_pack { 
    char a:2;
    char b:2:
    char c:2;
    char d:2;
};

Then you can create an array of these and use proxy objects to support reading and writing individual items -- with the proviso that there are some limitations on how much you can do with proxy objects, so you'll have to be a bit careful about how you try to use this. A bit of looking at some of the articles about vector<bool> should provide some reasonable hints -- most of its characteristics stem from this general type of implementation. Despite the shortcomings as a general purpose container, this can work within limits, and provides much tighter packing of information than most of the obvious alternatives.

Jerry Coffin
+1  A: 

But as far as I understand, this array makes an array with a possible maximum size of the maximum size of int. If I change my code and use the following code

That's absolutely wrong ! The size of the array is completely independent from the maximum value of the type of the array.

So there is no need to make it a long long array. Instead you should even make it a char array or even less than that.

If you need to store only three different values, you should play with bits inside a char or any other type. Then make an array of these.

A char is typically 1 byte, so 8 bits. To store 3 values, you need 2 bits; you can therefore store 4 values in a char.

Using binary masks you should work out a way to optimize that.

Cedric H.
+2  A: 

In this project I need to create one huge array (hopefully I will be able to create one as big as ~7.13e+17, but this target is still up ahead.)

That calls to create a dedicated structure, a la digital tree (or b-tree) with key being the index, to avoid doing large allocations.

Large allocations and especially reallocations might cause unnecessary memory fragmentation. If you split large array into smaller chunks, then not only array extension becomes easy, but also presentation of sparse array becomes possible.

N.B. ~7.13e+17 is about 60 bit long. Do you even have hardware which can support that much RAM? It's not that I'm following industry closely, but I heard briefly about NUMA archs with 58 bit address bus - but nothing about 60+ bit archs.

Each cell inside the array can contain one of the three values: 0, 1, 2.2.

If cell might contain only 3 values (2.2 can be represented as 2) that makes it 2 bits of information. That means that you can pack into the uint32_t 16 values and into the uint64_t 32 values.

You can try to find some existing digital tree implementation (or roll your own) and use as the key upper bits of index. Remaining bits of the original index would be an index into the tree leaf which would be an array with packed values. To exemplify using std::map in place of the trie, untested:

enum {
   LS_BITS = 16,
   MS_BITS = 64-LS_BITS
};

enum {
   VALUE_BITS = 2,
   VALUE_MASK = ((1<<VALUE_BITS)-1)
};

// this represents an array of `1<<LS_BITS` values
struct leaf_node {
   uint64_t packed_data[ ((1<<LS_BITS)*VALUE_BITS) / (sizeof(uint64_t)*8) ];
};

// that should be a trie, to provide faster look-up
typedef std::map< uint64_t, leaf_node > big_array_type;

void
big_array_set_value( big_array_type &b, uint64_t index, uint64_t value )
{
   leaf_node &n = b[index >> LS_BITS];
   uint64_t li = index & ((1<<LS_BITS)-1);
   li *= VALUE_BITS;   // convert into bit offset
   uint64_t &x = n.packed_data[ li / (sizeof(uint64_t)*8) ];
   li %= (sizeof(uint64_t)*8);
   x = (x & (VALUE_MASK<<li)) | (value << li);
}

int
big_array_get_value( big_array_type &b, uint64_t index, uint64_t value )
{
   leaf_node &n = b[index >> LS_BITS];
   uint64_t li = index & ((1<<LS_BITS)-1);
   li *= VALUE_BITS;   // convert into bit offset
   uint64_t &x = n.packed_data[ li / (sizeof(uint64_t)*8) ];
   li %= (sizeof(uint64_t)*8);
   return (x >> li) & VALUE_MASK;
}

This way one still wastes 0.5 bit of information since storage is 2 bits what allows to 4 values, yet only 3 are used. That can be also improved, but at much higher access performance cost.

Dummy00001
+1  A: 

The size used to specify the size of an array needs to be type size_t. The type used in the new expression is the type of the array elements. Regardless of the type of i in your example, it will be converted to size_t to create the array.

Now on a 32-bit machine, the maximum size_t is around 4e+9, so making an array of size 1e+17 is right out. On a 64-bit machine, size_t can theoretically go up to around 1e+19, but there's no way you can have anywhere near that amount of memory, so the allocation will fail.

So instead you need some kind of sparse data structure as others have discussed. The key here is to decide which of your 3 values is the most common, and only store values for where the array is one of the other 2 values. You can use a std::map to hold these values (even supports using the [index] syntax), or a variety of others, depending on exactly what you're trying to do and the details of you're data.

Chris Dodd