ansaurus

Question

Document Similarity: Comparing two documents efficiently

Answer 1

+2 A:

A Python point: adict.has_key(k) is obsolete in Python 2.X and vanished in Python 3.X. k in adict as an expression has been available since Python 2.2; use it instead. It will be faster (no method call).

An any-language practical point: iterate over the shorter dictionary.

Combined result:

if len(doca_dic) < len(docb_dict):
    short_dict, long_dict = doca_dic, docb_dic
else:
    short_dict, long_dict = docb_dic, doca_dic
similarity = 0
for x in short_dict:
    if x in long_dict:
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += short_dict[x] * long_dict[x]

And if you don't need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the docb = cursor2.fetchall(), replace all following code by this:

similarity = 0
for b_token, b_value in docb:
    if b_token in doca_dic:
        similarity += doca_dic[b_token] * b_value

Alternative to the above code: This is doing more work but it's doing more of the iterating in C instead of Python and may be faster.

similarity = sum(
    doca_dic[k] * docb_dic[k]
    for k in set(doca_dic) & set(docb_dic)
    )

Final version of the Python code

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
# Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
if len(doca) < len(docb):
    short_doc, long_doc = doca, docb
else:
    short_doc, long_doc = docb, doca
long_dict = dict(long_doc) # yes, it should be that simple
similarity = 0
for key, value in short_doc:
    if key in long_dict:
        similarity += long_dict[key] * value

Another practical point: you haven't said which part of it is slow ... working on the dicts or doing the selects? Put some calls of time.time() into your script.

Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.

C:\junk\so>sqlite3
SQLite version 3.6.14
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table atable(docid text, token text, score float,
    primary key (docid, token));
sqlite> insert into atable values('a', 'apple', 12.2);
sqlite> insert into atable values('a', 'word', 29.67);
sqlite> insert into atable values('a', 'zulu', 78.56);
sqlite> insert into atable values('b', 'apple', 11.0);
sqlite> insert into atable values('b', 'word', 33.21);
sqlite> insert into atable values('b', 'zealot', 11.56);
sqlite> select sum(A.score * B.score) from atable A, atable B
    where A.token = B.token and A.docid = 'a' and B.docid = 'b';
1119.5407
sqlite>

And it's worth checking that the database table is appropriately indexed (e.g. one on token by itself) ... not having a usable index is a good way of making an SQL query run very slowly.

Explanation: Having an index on token may make either your existing queries or the "do all the work in the DB" query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don't have a usable index, the DB will read ALL the rows in your table -- not good.

Creating an index: create index atable_token_idx on atable(token);

Dropping an index: drop index atable_token_idx;

(but do consult the docs for your DB)

John Machin 2010-03-13 12:05:13

+1 John it was a big mistake! :)

TheMachineCharmer 2010-03-13 12:28:03

"And if you don't need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query."I don't. How do I iterate over them as they come out of my query? (sorry if this is obvious) Thank you so much for your help. I must say this site and the people on it are awesome.

seanieb 2010-03-13 12:45:20

@seanieb: see my updated answer.

John Machin 2010-03-13 13:20:23

Wow, thank you. One question, "it's worth checking that the database table is appropriately indexed (e.g. one on token by itself)", I don't understand this. sorry, yet again its probably very basic.

seanieb 2010-03-13 13:32:07

@seanieb: again, see my updated answer, which also has a better response to the "if you don't need the dictionaries" issue.

John Machin 2010-03-13 14:01:40

Answer 2

+1 A:

What about pushing some of the work on the DB?

With a join you can have a result that is basically

    Token    A.tfidf_norm B.tfidf_norm
-----------------------------------------
    Apple      12.2          11.00
       ...
    Word       29.87         33.21
    Zealot      0.00         11.56
    Zulu       78.56          0.00

And you just have to scan the cursor and do your operations.

If you don't need to know if one word is in one document and missing in the other one you don't need an outer join, and the list will be the intersection of the two sets. The example I included above assigns automatically a "0" for words missing from one of the two documents. See what your "matching" functions requires.

p.marino 2010-03-13 12:09:44

Answer 3

A:

One sql query can do the job:

SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?

Just replace the '?' with the 2 document id respectively.

Xianhua 2010-08-12 13:09:09

ansaurus

tags:

views:

answers:

Document Similarity: Comparing two documents efficiently

related questions