views:

1005

answers:

14

I was asked to stay away from HashMap or any sort of Hashing.

The question went something like this -

Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?

Why is using Maps a bad idea for such a scenario?

What changes would you make to sell your solution to Amazon?

+6  A: 

A B-Tree in my opinion. Does that still count as a Map?

Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.

Edison Gustavo Muenz
BST was my answer too but the interviewer told me that everyone guesses BST when asked not to use maps. Is there really a better solution than BST or was I just BSed ?
maximus
I vote for BSed. BST works extremely well with numbers, because all it's doing is less than/greater than. I think he's just got a favorite algorithm and wants you to guess it, since BST is a perfectly valid (and real world) answer. Fun fact: std::map is usually implemented as a BST. So does that make it a "Map?"
Toji
Oh, and what in the world is he looking for when asking: "What changes will you do to sell your solution to Amazon?" Does he mean "Would you do it differently if selling it to a third party?" or "Would you do it differently if selling to a third party with extreme needs?" If the latter, the answer is: Don't have me implement it, use a database instead! :) I'm not sure I really follow this guy's train of logic.
Toji
He did not even answer this for me .. :( Simply Left it at there .. the only hint he gave me was -- how many product IDs could you have if you have only 2 decimal Product Ids - on which I replied 100 (0 - 99) .
maximus
edit: Decimal Ids upto 2 Digits (0-99)
maximus
BST (Binary Search Tree) != B-Tree
FogleBird
I agree with Toji when he told he guesses the interviewer wanted maximus to avoid to focus on map, or any other kind of Binary search tree to find the solution he tough was the best. SGI's STL Map are implemented using red black tree as far as I know, which is a BST amongst others.
yves Baumes
@Toji: to be precise STL Maps are R-D trees (otherwise know as balanced BST's) which have most operations guaranteed to be O(log N)
ldog
A: 

The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.

All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.

mfeingold
A: 

20 decimal PRODUCT IDs, along with Product Description

Simple linear search would be very good...

I would create one simple array with ids. And other array with data.

Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.

Artyom
And when is 2432902008176640000 = 20! considered small amount of keys for a linear search? :) Thats a total overkill.
AlexKR
@Alex Yep I guess too.
yves Baumes
+2  A: 

Best/efficient for what? Would have been my answer.

E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.

Of course the solution is pretty useless for any real application, but so is the question.

Jens Schauder
+1  A: 

There is an interesting alternative to B-Tree: Radix Tree

Kamarey
Kamarey your solution is close to mine. The wikipedia page you are pointing out explain well why a patricia trie (or radix tree) is welcomed here, provided that the product ID is bounded to **20 characters** . See the section titled "Comparison to other data structures" on the wikipedia's radix tree page.
yves Baumes
A: 

If the size is limited sometimes it's faster to use a sorted list.

When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.

On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.

But as Artyom has pointed out maybe a simple linear search would be much faster in this case.

On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.

Alexander Pogrebnyak
A: 

There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.

Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.

But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)

Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.

CPerkins
I also told him about converting the Decimal number (upto 20 digits) to something smaller and then storing them in an array index on which the response I got was - "How is this not hashing" ?
maximus
Because converting a decimal number to a binary number is reversible whereas hash functions lose information and thus are not reversible.
Paul
+7  A: 

A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).

In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.

In this case I can see only some as already proposed in other answers:

  • Sorted array (doing a binary search)
  • Hasmap
  • trie

With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:

  • "1"
  • "2"
  • "3"
    ...
  • "123"
  • "124"
  • "1234567"

Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities. Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .

While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.

Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.

yves Baumes
Notice that in spite of the correct points you raise in favor of the trie, the main reason hashes are chosen over tries are cache locality.
Anna
@Anna true, I expect an hashtable not to have too much collision thus having very short list. In practice it is worth to test both.
yves Baumes
+1  A: 

I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.

If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.

You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.

I wouldn't do it this way, but perhaps that is the answer they were looking for.

Asaf

Asaf R
Thats an interesting approach.
maximus
I don't see how this could work reliably in general... what if the product ID, when cast to a pointer, points to some other critical piece of (non-product) data that your app is using? If you then write product info to that location, your app will be corrupted; if you don't, you can't store information for that product ID.
Jeremy Friesner
@Jeremy - It's not a very feasible solution, but with 20 decimal digits and a 64bit address space you are left with some "extra space". Make a handcrafted process that fulfills requests for ID resolution and have no other app activity in that process.It's possible to create hierarchies to have less memory in each process. I wouldn't do it if I were paid to, maybe if my life depended on it. Maybe.
Asaf R
+4  A: 

Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.

What you might do in the response to such question is to explain that since you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.

So, let's say you have M = 10^20 products with their numbers and descriptions. You can partition this set to the groups of N subsets. Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.

To illustrate this idea, consider a smaller example of only 20 elements. I would like you to imagive the file system with directories "1", "2", "3", "4". In each directory you store the product descriptions as files in the following way:

folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20

Then your search would only need two steps to find the file. First, you search for a correct folder by dividing 20 / 5 (your M/N). Then, you use the given ID to read the product description stored in a file.

This is just a very rough description, however, the idea is very intuitive. So, perhaps this is what your interviewer wanted to hear.

As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.

AlexKR
Thats a good take too. So locating the correct folder will be in constant time but locating the file will be o(n) where n represents the number of files stored in every folder.
maximus
@Alex your idea is close to the Trie data structure (see my answer). But according to me using a Trie is better choice: in your exemple you deepen the folder level up to a point there is only one file per folder. Have a look to the wikipedia page for Trie implementation. Nice reading :-)
yves Baumes
The idea of my posting was to introduce the way of storing data without use of any data structure at all. Trie is good but as far as I understand the original poster had restriction of not able to use any data structures. The disadvantage of the Trie as well as any Map or HashTable is that you need to store this in a distributed way (I think thats why Amazon was mentioned). In my post you see that all you need to have is a distributed file system. So, about the O(n) in the folder, you are right, you can partition your set as deep as possible until you have 1 file only.
AlexKR
+1: This is the first solution that came to mind when I read the problem, and probably what I would give top credit for in an interview -- use the file system.
John Pirie
It's kind of a B-Tree using folders instead of a data structure
DaClown
the interviewer probably wanted to have a algorithmic-kind-of answer. I wouldn't expect such answer if I were an interviewer for a developer job. But who knows.
yves Baumes
A: 

I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.

They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.

Einstein
A: 

Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.

Ellery Newcomer
+1  A: 

I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.

Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.

erickson
A: 

I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.

If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.

Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?

log2(10^20) = 66.43.

Here's a slightly evil proposal.

OK, 2^64 bits can fit inside a memory space.

Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?) Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.

Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.

Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)

Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.

Of course paging will start to be a bitch at some point!

Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

Matt Kennel