tags:

views:

47

answers:

2

In a script I am working on, I calculate how relevant every item in one array is to each item in another array by comparing similarities in keywords and keyphrases. End the end, I select the top 4 most relevant items for each item in that second array.

I know this is a very vague background, but is there any way to avoid making the algorithm O(n^2) (comparing every item in one array for every item in another), or if there are more efficient ways of calculating relevancy?

+3  A: 

Maybe you can group your job titles / job opening in category.

Use a list of the most frequent words ans only search matches among items having these words.

I mean no need to compare a "Java programmer" with a "C++ job opening" but among the "java" keyword you can still compare "programmer" and "project leader".

Do you see what I mean ?

But please, give us an example, it easier to answer when we know what we are talking about.

Loïc Février
+1 for "it easier to answer when we know what we are talking about." .. -1 for answering when not knowing "what we are talking about." ;)
belisarius
@belisarius : Well he has given some insight in the comment to his question, so I know it is job titles/job openings. More details can help but the answer is still valid, trying to group things together is always good to reduce complexity. + this was a little to long for a comment, easier to read this way
Loïc Février
Was kidding. +1 for the categorisation. As I said in a previous comment <i>"If no other property is known, comparing n to m requires O(n m). If you need to do better than that you need something else. Example: hierarchical arrangements of keywords."</i> OR categories
belisarius
I apologize for the lack of details. I'm calculating how relevant it is by matching keywords and keyphrases from multiple people's past job titles against job opening titles. Keywords are weighted by how often they appear throughout each person's work history, and keyphrases are matched by matching two keywords with possible wildcard words in between. If two past job titles were "Sales Representative" and "Director of Sales", it's organized in an array as (keywords => ("sales" => 2, "director" => 1, "representative" => 1)). I then go through each job opening and match keywords in the titles.
Nick
The big issue that I'm wondering I can avoid is the process of matching jobs though. If I'm matching 10 people to 100 jobs, I go to the first person, then go through all 100 jobs and assign a relevance score, sort by which one is the most relevant, and then pick the top 4. Then repeating the process for the other 9 people. I'm wondering if the "selecting the top 4" part will prevent me from ever being better than O(n^2).
Nick
Please edit your question to add these informations and then you can delete these 2 comments. Can you give your exact formula ? Or a readl example with, let's say 3 peoples and 5 jobs opening or something like that.
Loïc Février
+1  A: 

Use an inverted index (Hash Table) to get it down to O(n). Put all the items in the first list in one hash table. Then iterate through all the items in the second list, looking up each item in the hash table.

What I don't know is how you are defining similar. If similarity is simply that the items in the two lists are equal, then this will work. However if similarity is more complex, then you may need to build multiple hash tables for each type of similarity possible. For example, you could have one hash table that keys off of the phonetic spelling of a word, and one that keys off of the exact string of the word.

If you have one list that is large like a Job Openings list, and you want to query the list for candidate skills, you should really use a search engine. A search engine is just a set of hash tables keyed off of keywords. There is no sense rebuilding a search engine when you can use one that has already been built. First you index all the job openings, then you query the search engine using words from a candidate's resume. A popular open source search engine that you may want to look into is Solr.

Jon Snyder