tags:

views:

76

answers:

4

Hi,

I have a collection of strings that I want to filter. They'll be in this pattern:

xxx_xxx_xxx_xxx

so always a sequence of letters or numbers separated by three underscores. The max length of each string will be 60 characters. I might have a few million of these in my collection.

What data structure could I use to efficiently do something like this:

Get all strings starts with: "abc_123_456"

Get all strings starts with: "def_999_888"

etc..

for example, I could do this:

List<String> matched = new ArrayList<String>();
for (String it : strings) {
    if (it.startsWith(match)) {
        matched.add(it);
    }
}

but that would take a long time if my collection is on the order of millions of strings, and worse yet if the number of matched strings is also high.

The high-level problem is that I want to answer the following question for an app I'm writing: "which of my friends have recommended product A for product B?". I could store this information in a sql table and run the following statement:

select recommender from recs where username='me' and prodIdA='a' and prodIdB='b';

I'm curious if something custom in java/C/C++ could run faster, using encoded flat strings like I have above:

myusername_prodIdA_prodIdB_recommenderusername

The idea being that you could do a starts-with operation on the whole collection of encoded strings to get your answer.

I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though,

Thanks

+2  A: 

To do that in Java, you can use a Trie structure.

That being said, I don't think it's a good idea. Dumping "a few million" records in the memory won't always work.

That's what databases are for; with the right design and proper indexing you can have very good performance with the DB alone.

NullUserException
Yeah was just thinking about it because I was wondering if a specific custom implementation could be better than using mysql etc.
A: 

I think you are looking for a SortedMap.

"headMap(K toKey) Returns a view of the portion of this map whose keys are strictly less than toKey."

Sean McCauliff
A: 

I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though

If only for the sake of curiosity, you can put all existing different "myusername_prodIdA_prodIdB" combinations in hashtable. And for each combination store a list of relevant results.

So, the structure would look like Map<String, List<String>> and used like hash.get("def_999_888"). Constant time (O(1))

You can get rid of inner list and optimize it in many ways, but this is the idea.

Nikita Rybak
I think he/she needs to pull many of these records.
NullUserException
@NullUserException That's why map has List<String> values, and not String.
Nikita Rybak
A: 

The first thing that comes to mind for me is pre-processing the strings into some sort of data structure so that they could be searched for efficiently. If you're going to be calling the search function many times, I think it'd be good for you to put all of the strings into a hash table for a constant-time look up. It'd take more processing power to construct your array of strings, but it'd trivialize the task of searching for them.

T.K.