views:

51

answers:

1

Hi everyone, wondering about the best way to approach this particular problem and if any libraries (python preferably, but I can be flexible if need be).

I have a file with a string on each line. I would like to find the longest common patterns and their locations in each line. I know that I can use SequenceMatcher to compare line one and two, one and three, so on and then correlate the results, but if there something that already does it?

Ideally these matches would appear anywhere on each line, but for starters I can be fine with them existing at the same offset in each line and go from there. Something like a compression library that has a good API to access its string table might be ideal, but I have not found anything so far that fits that description.

For instance with these lines:

\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e

I would want to see that 0-1, and 10-12 match in all lines at the same position and line1[4,5] matches line2[5,6] matches line3[7,8].

Thanks,

A: 

Is your problem performance?

How big is your input?

Is the minimum strings length to match 2?

Note that your example is not correct I think as the results you expect do not match the sample strings you provided.

Philippe Ombredanne