views:

1840

answers:

7

Assuming Visual C/C++ 6, I have a complex data structure of 22399 elements that looks like this:

{
{ "(SAME", "AS", "U+4E18)", "HILLOCK", "OR", "MOUND"},
{ "TO", "LICK;", {1, 1, 0}, "TASTE,", "A", "MAT,", "BAMBOO", "BARK"},
{ "(J)", "NON-STANDARD", "FORM", "OF", "U+559C", ",", {1, 1, 0}, "LIKE,", "LOVE,", "ENJOY;", {1, 1, 4}, "JOYFUL", "THING"},
{ "(AN", "ANCIENT", {1, 2, 2}, {1, 2, 3}, "U+4E94)", "FIVE"}, 
...
}

What's the best way to declare this? I've tried things like

char * abbrevs3[22399][] = { ... };

and

char * abbrevs3[22399][][] = { ... };

but the compile whinges something chronic.

EDIT: The data is a database of descriptions of certain Unihan characters. I've been exploring various ways of compacting the data. As it stands you have 22399 entries, each of which may contain a varying number of strings, or triplets of { abbrev marker, line where last seen, element of that line where last seen }.

By the way Greg's talking, I may need to have each line contain the same number of elements, even if some of them are empty strings. Is that the case?

EDIT #2: And it occurs to me that some of the numeric values in the triplets are way outside the limits of char.

+3  A: 

In C, you can only leave out the first dimension when declaring an array:

char * abbrevs3[][22399] = { ... };

This is because the compiler wants to know how big each "row" is, so that it can lay out the "columns" properly. I put the dimensions in quotes because you are free to interpret the dimensions in whatever way you wish, but that is the usual convention for a two-dimensional array.

That said, it is unclear what your data structure actually is or what you're trying to initialise it to. Your sample data doesn't seem to have any kind of pattern to it.

Greg Hewgill
+4  A: 

I would look into storing the data in XML or some other structured form, then reading and parsing it instead of making the initialization in the code. The penalty that you pay at initialization will more than be made up in the ease of understanding and increase in maintainability of your code. I'd also consider designing a specific data structure to hold each entry.

[EDIT] The example below attempts to replicate your subsequent description:

enum EntryType { string = 0, triple = 1 };

typedef struct {
   enum EntryType entry_type;
   union {
      char** string;
      int[3] *triple;
   }
} Entry;

typedef struct {
   Entry *entries;
} Abbreviation;

Abbreviation *abbrevs3;

abbrevs3 = parseAbbreviationData("path-to-abbreviations/abbrevs.xml");
tvanfosson
I agree - an explicit structure definition plus file initialisation is the way to go. I also agree with both answe so far in their remarks that we can't tell what your actual data structure is supposed to be, so it's hard to give a more specific answer.
Greg Whitfield
Not just me then - OP's description seems unrelated to the example.
fizzer
Seriously, make a data structure and populate it at run-time. Making enormous obtusely-initialized data structures in bad coding style.
Nick
+1  A: 

I think the question here is whether you can statically declare a multi-dimensional array of C style strings where there are a different number of strings on each row. So, something like this:

const char * arr[][3] =
    {
    {"bla", "bla", "bla"},
    {"bla", "bla" }
    };

In some languages this is referred to as a "jagged array." In C and C++ you can do this, though the compiler will want to allocate space to store all the rows as though they're the same length, so you'll end up not initializing the 3rd item of the second array. When I tested this out on gcc the third item in that array was set to NULL, but I don't know if you can count on that.

I don't think you'll be able to get the compiler to accept arrays declared like {1,2,3} as C style strings. Even if it did, and you treated these as strings, you'd have a problem since they're not null terminated.

I'd agree with the other posters, a better approach is probably to store this data in XML, yaml, or possibly in the database you're taking them from, and access them there. If you do need to create these statically in a source file you'll probably be better off declaring a structure that makes sense for your data and initializing an array of those. Something like:

typedef struct
{
  const char * somestring;
  const char * someotherstring;
  const unsigned int triple[3];
} Abbreviation;

const Abbreviation abb[] =
  {
    {"First Thing", "Second String", {1,2,3} },
    {"Other Thing", "Some String", {4,5,6} }
  };
ryan_s
The first example compiles fine. You're right that each line has three entries, and if any line needs to go to, say, 100 entries, then more space will be wasted on the others.
Jonathan Leffler
Aaah, good point. I'll make a correction.
ryan_s
A: 

Ryan_s's answer is the best for me. Yes, I was wanting to do a ragged array. And all the other comments about putting the data somewhere external are all appropriate. Thanks everyone.

boost
A: 

The saga is not over yet it seems. I eventually ended up turning everything into a ragged array of int. But with that is lost the idea of items in a line which the self-referential mechanism behind the triplets was depending on.

Am now looking into using Euphoria rather than C, because of its superior support for ragged arrays. One can build standard DLLs with Euphoria and, once I figure out how to hand back a variant array of BSTR and write a Typelib ...

Mind you, I suppose I could stick with C and store triplets as just three ints in a row, and store the strings as pointers cast as integers. And that would save me a rather large rewrite of the VBScript that built the self-referential dictionary in the first place.

boost
+2  A: 

I just read your new posts and re-read the original post, and I think I just fully understood the goal here. Sorry it took so long, I'm kind of slow.

To paraphrase the question, on line 4 of the original example:

{ "(AN", "ANCIENT", {1, 2, 2}, {1, 2, 3}, "U+4E94)", "FIVE"},

You'd want to translate the triples into references to strings used earlier, in an attempt to compress the data. That line becomes:

{ "(AN", "ANCIENT", "FORM", "OF", "U+4E94)", "FIVE"},

If the goal is compression I don't think you'll see much gain here. The self-referencing triples are each 3 bytes, but the strings that are being substituted out are only 8 bytes total, counting null terminators, and you only save 2 bytes on this line. And that's for using chars. Since your structure is so big that you're going to need to use ints for references, your triple is actually 12 bytes, which is even worse. In this case you'll only ever save space by substituting for words that are 12 ascii characters or more.

If I'm totally off base here then feel free to ignore me, but I think the approach of tokenizing on spaces and then removing duplicate words is just kind of a poor man's Huffman compression. Huffman where the alphabet is a list of longest common substrings, or some other standard text compression method would probably work well for this problem.

If for some reason this isn't an option though, I think I would get a list of all unique words in your data and use that as a lookup table. Then store all strings as a list of indexes into that table. You'd have to use two tables, but in the end it might be simpler, and it would save you the space being used by the leading 1's you're using as the "abbrev marker" now. Basically, your abbreviation markers would become a single index instead of a triplet.

So,

const char * words[] = {
    "hello", "world", "goodbye", "cruel"
    };

const int strings[] = {
    { 0, 1 },
    { 2, 3, 1 }
    };

You'd still lose a lot of space if your strings aren't of roughly uniform length though.

ryan_s
+1  A: 

The original data is about 1.7MB which was derived from 2 other files, one from my employer and the other (Unihan.txt, at about 30MB) from the Unicode Consortium. Using the dictionary look-up technique, using a dictionary of the top 128 longest and most frequently occurring words, only brings the data size down to 1.5MB. I could probably improve that by being more intelligent with my word detection, which currently is just a VBScript Split() on space.

I don't have any figures for how small I get with the quasi-Huffman approach, but my guess is that it's slightly less than 1MB. I was wanting to have all of this in the binary rather than as a separate file (despite what others may say about bad practice etc.) As it stands, however, it's all getting just a bit too hard, at least in C. If I can figure out how to create variant arrays of BSTR in Euphoria ...

EDIT: I have used the dictionary lookup with respect to standard UCNs and that works well due to the repetitive nature of glyph descriptions. The problem with the Unihan is that you end up with a description of what the glyph means; there's a qualitative (and quantitative!) difference between "VULGAR FRACTION ONE QUARTER" and "A KIND OF PUNISHMENT IN HAN DYNASTY, NAME OF CHESSMEN IN CHINESE CHESS GAME(SIMPLIFIED FORM, A VARIANT U+7F75) TO CURSE; TO REVILE; TO ABUSE, TO SCOLD"

Thus the move away from the dictionary look-up and toward some more-powerful "compression" technique.

(And before anyone says, "so what's the big deal with 1.7MB?", I come from an era where 16K RAM was a lot. And I have space constraints in any case.)

boost