ansaurus

Question

Best data structure in C for these two situations?

Answer 1

+1 A:

For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).

This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).

If you had more time, you could even work out how to not have a different struct for each different map...

I think this solution probably applies to both your problems.

Good luck.

For clarity:

First we have a simple array of student records

typedef
struct student_s {
   char ssn[10]; // nul terminated so we can use str* functions 
   char name[100];
   float GPA;
   ...
} student;
student slist[MAX_STUDENTS];

which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.

If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:

typedef
struct str_map {
   char* key;
   student *data;
} smap;
smap skey[MAX_STUDENTS]

and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)

This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.

dmckee 2009-03-22 04:18:01

I'm not sure I understand what you are suggesting...

Nazgulled 2009-03-22 04:49:39

Answer 2

+3 A:

The right answer lies anywhere between a balanced search tree and an array.

The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.

dirkgently 2009-03-22 04:36:47

The right answer to "How do I get fast searches?" lies...

dmckee 2009-03-22 04:40:13

@dmckee: Not sure I understand the point of your comment.

dirkgently 2009-03-22 04:41:52

He has to be able to search a single data set on multiple keys.That complicates the problem if his data set is big enough for linear time to be a problem. Which does nothing to detract for your point about the size of the data set being an important consideration.

dmckee 2009-03-22 04:49:27

@dmckee: I could figure that much out. And I left my answer vague intentionally.

dirkgently 2009-03-22 05:01:13

Answer 3

+1 A:

Outside of a homework question, you'd use a relational database for this. But that probably doesn't help you…

The first thing you need to figure out, as others have already pointed out, is how much data you're handling. An O(n) brute-force search is plenty fast as long a n is small. Since a trivial amount of data would make this a trivial problem (put it in an array, and just brute-force search it), I'm going to assume the amount of data is large.

Storing Cities

First, your search requirements appear to require the data sorted in multiple ways:

Some city unique identifier (name?)
Number of visitors

This actually isn't too hard to satisfy. (1) is easiest. Store the cities in some array. The array index becomes the unique identifier (assumption: we aren't deleting cities, or if we do delete cities we can just leave that array spot unused, wasting some memory. Adding is OK).

Now, we also need to be able to find most & fewest visits. Assuming modifications may happen (e.g., adding cities, changing number of visitors, etc.) and borrowing from relational databases, I'd suggest creating an index using some form of balanced tree. Databases would commonly use a B-Tree, but different ones may work for you: check Wikipedia's article on trees. In each tree node, I'd just keep a pointer (or array index) of the city data. No reason to make another copy!

I recommend a tree over a hash for one simple reason: you can very easily do a preorder or reverse order traversal to find the top or bottom N items. A hash can't do that.

Of course, if modifications may not happen, just use another array (of pointers to the items, once again, don't duplicate them).

Linking Cities to Profiles

How to do this depends on how you have to query the data, and what form it can take. The most general is that each profile can be associated with multiple cities and each city can be associated with multiple profiles. Further, we want to be able to efficiently query from either direction — ask both "who visits Phoenix?" and "which cities does Bob visit?".

Shamelessly lifting from databases again, I'd create another data structure, a fairly simple one along the lines of:

struct profile_city {
    /* btree pointers here */
    size_t profile_idx; /* or use a pointer */
    size_t city_idx;    /* for both indices */
};

So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city 1) as well, you'd add another one, so you'd have two of them for Bob.

Now, you have a choice: you can store these either in a tree or a hash. Personally, I'd go with the tree, since that code is already written. But a hash would be O(n) instead of O(log*n*) for lookups.

Also, just like we did for the city visit count, create an index for city_idx so the lookup can be done from that side too.

Conclusion

You now have a way to look up the 5 most-visited cities (via an in-order traversal on the city visit count index), and find out who visits those cities, by search for each city in the city_idx index to get the profile_idx. Grab only unique items, and you have your answer.

Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

derobert 2009-03-22 08:44:27

I had 2 weeks for this but I though had more, the fault is mine... :/

Nazgulled 2009-03-22 14:59:25

Answer 4

A:

The time to do this is almost up but this was the first phase, like a chekpoint, I still have to do it for the next one...

This probably sounds very stupid but I still don't know what to do. :(

Nazgulled 2009-03-22 23:18:14

ansaurus

tags:

views:

answers:

Best data structure in C for these two situations?

Storing Cities

Linking Cities to Profiles

Conclusion

related questions