views:

816

answers:

4

I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...

There's 2 things I need to do, they will probably use different data structures.

  1. I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.

  2. I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.

This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.

As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.

+1  A: 

For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).

This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).

If you had more time, you could even work out how to not have a different struct for each different map...

I think this solution probably applies to both your problems.

Good luck.


For clarity:

First we have a simple array of student records

typedef
struct student_s {
   char ssn[10]; // nul terminated so we can use str* functions 
   char name[100];
   float GPA;
   ...
} student;
student slist[MAX_STUDENTS];

which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.

If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:

typedef
struct str_map {
   char* key;
   student *data;
} smap;
smap skey[MAX_STUDENTS]

and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)

This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.

dmckee
I'm not sure I understand what you are suggesting...
Nazgulled
+3  A: 

The right answer lies anywhere between a balanced search tree and an array.

The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.

dirkgently
The right answer to "How do I get fast searches?" lies...
dmckee
@dmckee: Not sure I understand the point of your comment.
dirkgently
He has to be able to search a single data set on multiple keys.That complicates the problem if his data set is big enough for linear time to be a problem. Which does nothing to detract for your point about the size of the data set being an important consideration.
dmckee
@dmckee: I could figure that much out. And I left my answer vague intentionally.
dirkgently
+1  A: 

Outside of a homework question, you'd use a relational database for this. But that probably doesn't help you…

The first thing you need to figure out, as others have already pointed out, is how much data you're handling. An O(n) brute-force search is plenty fast as long a n is small. Since a trivial amount of data would make this a trivial problem (put it in an array, and just brute-force search it), I'm going to assume the amount of data is large.

Storing Cities

First, your search requirements appear to require the data sorted in multiple ways:

  1. Some city unique identifier (name?)
  2. Number of visitors

This actually isn't too hard to satisfy. (1) is easiest. Store the cities in some array. The array index becomes the unique identifier (assumption: we aren't deleting cities, or if we do delete cities we can just leave that array spot unused, wasting some memory. Adding is OK).

Now, we also need to be able to find most & fewest visits. Assuming modifications may happen (e.g., adding cities, changing number of visitors, etc.) and borrowing from relational databases, I'd suggest creating an index using some form of balanced tree. Databases would commonly use a B-Tree, but different ones may work for you: check Wikipedia's article on trees. In each tree node, I'd just keep a pointer (or array index) of the city data. No reason to make another copy!

I recommend a tree over a hash for one simple reason: you can very easily do a preorder or reverse order traversal to find the top or bottom N items. A hash can't do that.

Of course, if modifications may not happen, just use another array (of pointers to the items, once again, don't duplicate them).

Linking Cities to Profiles

How to do this depends on how you have to query the data, and what form it can take. The most general is that each profile can be associated with multiple cities and each city can be associated with multiple profiles. Further, we want to be able to efficiently query from either direction — ask both "who visits Phoenix?" and "which cities does Bob visit?".

Shamelessly lifting from databases again, I'd create another data structure, a fairly simple one along the lines of:

struct profile_city {
    /* btree pointers here */
    size_t profile_idx; /* or use a pointer */
    size_t city_idx;    /* for both indices */
};

So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city 1) as well, you'd add another one, so you'd have two of them for Bob.

Now, you have a choice: you can store these either in a tree or a hash. Personally, I'd go with the tree, since that code is already written. But a hash would be O(n) instead of O(log*n*) for lookups.

Also, just like we did for the city visit count, create an index for city_idx so the lookup can be done from that side too.

Conclusion

You now have a way to look up the 5 most-visited cities (via an in-order traversal on the city visit count index), and find out who visits those cities, by search for each city in the city_idx index to get the profile_idx. Grab only unique items, and you have your answer.

Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

derobert
I had 2 weeks for this but I though had more, the fault is mine... :/
Nazgulled
A: 

The time to do this is almost up but this was the first phase, like a chekpoint, I still have to do it for the next one...

This probably sounds very stupid but I still don't know what to do. :(

Nazgulled