Now I am trying to come up with a data structure for implementing a telephone adreess book. Say there are two infos : name, telephone numbers. The same name can have more than one number. Now i would like to come up with a data structure that will help me in fetching the number given a name and fetching the name, given a number. I have to do this in the least possible time. I was thinking of maintaining a map (name to number) and then for a given number somehow reverse index into this map and then find the name. I am not really sure whats the best way to do this. Can you suggest an ideal data structure. Will using some reverse index help. If yes then how do i use it in this context.
You could use a pair of Dictionaries (aka maps): One for looking up numbers using names (name -> List<number>
) and one for looking up names using numbers (number -> name
).
Assuming that prefix matches are desired, I'd suggest using a Patricia trie. Assuming that a name and a phone number can never collide - i.e. you won't have anyone named 435-9876 in your directory - then you can insert pairs with the phone number being indexed by the trie as well as pairs indexed by name. If for some strange reason name-number collisions are possible then you could add a special character to prefix phone numbers so that it is longer possible for them to collide.
Insertions will cost O(s)
Lookups will cost O(s)
where s is the length of the search string/inserted key.
This solution will also allow you to support linux shell style auto-complete where you can determine how many entries there are that match a particular name as its typed, or even display all the matches on demand.
EDIT:
A patricia trie is like a regular tree except that branches in a patricia trie occur only when there are at least two different keys sharing a prefix that haven't been separated by a previous branch.
- Internal nodes indicate the index at which keys differ, while leaf nodes store actual keys.
- All children of a node share the parent's prefix.
- Branches from a node are 'labeled' with the character which occurs in the position specified by the parent.
- A child cannot specify a smaller index than its parent.
Thus if Jane and Jamie are the only keys stored in the trie it has three nodes - the parent represents the shared prefix 'Ja', one branch leads to the sub-tree containing all strings starting with 'Jan' - and since there is only a single one it is stored in the leaf node 'Jane' while the other branch leads to the sub-tree containing all strings starting with 'Jam', again with their being only one.
[2]
n m
'Jane' 'Jamie'
If you then added James, you'd get
[2]
n m
'Jane' [3]
e i
'James' 'Jamie'
And then adding Jamelia
[2]
n m
'Jane' [3]
e i
[4] 'Jamie'
l s
'Jamelia' 'James'
Searching is straightforward. Starting at the root, the index specified is checked. If no child is labeled with the actual value specified, in our example for instance if Jasper is being searched for, as the character in position 2 is s and not n or m the search returns that the value isn't there. Otherwise the appropriate branch is followed, until either the value is found, its proven that it is not there, or multiple matches remain. E.g. a search for Jam stops at the node labeled [3]. There are several node matching, but the key Jam is not present. In this case you can traverse the sub-tree rooted at [3] to build a list of partial matches. Note that this only applies if you are doing partial matches, if the entire key was specified, the result of searching for 'Jam' would be no match.
The cost of searching is in the order of the length of the key, s because as you can see there are at most s levels before the key is either found or shown to not be there. Similarly, the cost of insertion is also in the order of the length of the key to be inserted as insertion is carried out by doing a search, and then adding a node branching at the longest prefix in the tree.
You can find a java implementation here and what appears to be a C++ implementation here.
I've never actually used any of the implementations I've pointed you too, so can't vouch for their veracity. If you decide to implement the data structure yourself, I'd advise you to use a binary alphabet, that way its a bit easier! Here is an article describing the algorithm.
I would maintain two (hash) maps.
The first map would have Names as the keys, and Sets (or lists or vectors) as the values, this list would contain all of the phone numbers for a specific person.
Map<String,Set<String>> NamesToNumbers;
The second map would have Phone Numbers as the keys, and Names as the values.
Map<String,String> NumbersToNames;
Additionally I would create a simple class with these two maps as private members, the purpose of this class would be to keep the two maps in sync, and pass all put(), remove(), etc. calls to both maps in the correct Key/Value format.
Psuedo code for how id write a put() method in the simple class:
public void put(String Name,String PhoneNumber)
{
Set NumbersForName = NamesToNumbers.get(Name);
if (NumbersForName == null)
{
NumbersForName = new Set();
NamesToNumbers.put(Name,NumbersForName);
}
NumbersForName.put(PhoneNumber);
NumbersToNames.put(PhoneNumber,Name);
}
The cost of a look up will be O(1), and the cost of an insertion will be O(1) unless there is a hash collision.
If you are using Java 5+ you should check out Google Collections, they have a nifty Bi-map class which is essentially what I am trying to describe.
I'd set up a conceptual database with Name x Number pairs. Each pair gets an id. Unless you can guarantee John Smith won't live in two different places, you need a unique id.
so you have (in perl)
$db{$id} = [name, number]
Then overlay that with two hashes that return sets of ids.
$name_index[$name] = [$id1, $id2, $id3]
$number_index[$number] = #as above
This will require a certain amount of fiddlyness for updating, but will return results fast.
You can do this in whichever language that has a hash/map construct.