ansaurus

Question

Difflib.SequenceMatcher isjunk optional parameter query: how to ignore whitespaces, tabs, empty lines?

Answer 1

A:

I haven't used Difflib.SequenceMatcher, but have you considered pre-processing the files to remove all blank lines and whitespace (perhaps via regular expressions) and then doing the compare?

Ben Hoffstein 2008-09-29 03:47:19

Answer 2

A:

Using your sample strings:

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

Interestingly if ' ' is also included as junk:

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

Looks like the new lines are having a much greater affect than the spaces.

mhawke 2008-09-29 06:43:27

Answer 3

+2 A:

If you match all whitespaces the similarity is better:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, difflib is not ideal to such a problem because these are two nearly identical documents, but typos and such produce differences for difflib where a human wouldn't see many.

Try reading up on tf-idf, Bayesian probability, Vector space Models and w-shingling

I have written a an implementation of tf-idf applying it to a vector space and using the dot product as a distance measure to classify documents.

Florian Bösch 2008-09-29 07:17:02

Answer 4

A:

Given the texts above, the test is indeed as suggested:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, to speed up things a little, you can take advantage of CPython's method-wrappers:

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

This avoids many python function calls.

ΤΖΩΤΖΙΟΥ 2008-09-29 11:48:48

ansaurus

tags:

views:

answers:

Difflib.SequenceMatcher isjunk optional parameter query: how to ignore whitespaces, tabs, empty lines?

related questions