views:

60

answers:

3

I have a large (more than 100K objects) collection of Java objects like below.

public class User
{
   //declared as public in this example for brevity...
   public String first_name;
   public String last_name;
   public String ssn;
   public String email;
   public String blog_url;
   ...
}

Now, I need to search this list for an object where at least 3 (any 3 or more) attributes match those of the object being searched.

For example, if I am searching for an object that has

 first_name="John",
 last_name="Gault",
 ssn="000-00-0000",
 email="[email protected]", 
 blog_url="http://myblog.wordpress.com"

The search should return me all objects where first_name,last_name and ssn match or those where last_name, ssn, email and blog_url match. Likewise, there could be other combinations.

I would like to know what's the best data-structure/algorithm to use in this case. For an exact search, I could have used a hashset or binary search with a custom comparator, but I am not sure what's the most efficient way to perform this type of search.

P.S.

  • This is not a homework exercise.

  • I am not sure if the question title is appropriate. Please feel free to edit.

EDIT Some of you have pointed out the fact that I could use ssn (for ex.) for the search as it is more or less unique. The exmaple above is only illustrative of the real scenario. In reality, I have several objects where some of the fields are null so I would like to search on other fields.

+1  A: 

Basically, search for results where ANY of the attributes matches the attribute in the query. This should narrow down the search space to quite a small number of entries. From those results, look for entries that match your criteria. This means you need to go through and count how many attributes match, if this is more than 3 then you've got a match. (This process is relatively slow and you wouldn't want to do it over your whole database.)

In this case, a potential optimisation would be to remove first_name and last_name from the initial filter phase, since they are much more likely to get you multiple results for a query than the other attributes (e.g. a lot of people called "John").

Since three attributes are required to match, removing two from the filter phase won't affect the final outcome.

Artelius
+2  A: 

I don't think that there are any specific data structures to make this kind of matching / comparison fast.

At the simple level of comparing two objects, you might implement a method like this:

public boolean closeEnough(User other) {
    int count = 0;
    count += firstName.equals(other.firstName) ? 1 : 0;
    count += lastName.equals(other.lastName) ? 1 : 0;
    count += ssn.equals(other.ssn) ? 1 : 0;
    count += email.equals(other.email) ? 1 : 0;
    ...
    return count >= 3;
}

To do a large scale search, the only way I can think of that would improve on a simple linear scan (using the method above) would be

  1. create a series of multimaps for each of the properties,
  2. populate them with the User records

Then each time you want to do a query:

  1. query each multimap to get a set of possible candidates,
  2. iterate all of the sets using closeEnough() to find the matches.

You could improve on this by treating the SSN, email address and blog URL properties differently to the name properties. Multiple users with matches on the first three properties should be a rare occurrence, compared with (say) finding multiple users called "John". The way that you have posed the question requires at least 1 of SSN, email or URL to match (to get 3 matches), so maybe you could not bother indexing the name properties at all.

Stephen C
A: 

Just a thought; if you are searching for someone with SSN, you should be able to narrow it down really quickly with that, since only one person is supposed to have one specific SSN.

moowiz2020
Email and blog_url are also fairly unlikely to be shared between several people.
Artelius
@moowiz2020 and @Artelius, good points. But this is just an example to illustrate the problem. In reality, the items I am searching for are not so unique or always available (e.g., for some users, ssn is null). Perhaps I should have chosen a better example.
Rahul