ansaurus

Question

Comparing text blocks for similar content

Answer 1

+1 A:

if you need to compare this lists only once, i'd suggest converting docs to txt and then you'll be able to compare using regex. otherwise you'll need to use third party software to access info in the docs... like here maybe http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php

kgb 2010-09-23 08:56:37

not a public app. so I would only be too happy to copy paste into a form.

abel 2010-09-23 08:59:04

Answer 2

+2 A:

Create an array of both (possibly using the file() function, depending on the format of the text, or possibly just an explode() on content), and use array_diff().

Wrikken 2010-09-23 09:04:41

Answer 3

+1 A:

$oldList = file('oldList.txt');
$newList = file('newList.txt');
$list = array_udiff($newList, $oldList, 'compare');

function compare($new, $old) {
    similar_text($old, substr($new, 3), $percent);
    return $percent >= 80 ? 0 : 1;
}

This is my basic idea. To find all texts similar by 80% and remove them from the $newList. You should adjust the percentage to satisfy your needs. The M/s is removed by substr($new, 3).

nikic 2010-09-28 14:22:45

@nikic thanks for the code. I get a 60s timeout when comparing the two blocks(each block is around 80kb)

abel 2010-10-04 11:11:00

Add set_time_limit(500); to the beginning of the code

Joyce Babu 2010-10-04 11:22:39

@nikic I added a var_dump($list); the output is posted in the original question

abel 2010-10-04 11:32:00

Answer 4

+1 A:

If there are no key fields for uniquely identifying the records, I think you will have to use something like similar_text or levenshtein.

$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
   $line = trim(substr($line, 3));
   foreach($arOld as $old){
    similar_text($line, $old, $percentage);
    if ($percentage < 60){
        echo $line;
    }
   }
}

Joyce Babu 2010-10-04 09:42:30

Undefined variable new on line 6

abel 2010-10-04 11:15:46

It is $line, not $new. Sorry.

Joyce Babu 2010-10-04 11:19:31

no problemo ! ...

abel 2010-10-04 11:21:32

I ran the script using samples from the orig post. the output is posted in the orig question

abel 2010-10-04 11:26:51

On second thought, it is not going to work. It requires a little modification to work. Now it will print lots of lines.

Joyce Babu 2010-10-04 11:27:01

yes it does print out a lot of lines. The principle would be to match everyword from one text block with all the words of the second word block and then echo those which match. However company namess are multiple words....

abel 2010-10-04 11:35:02

Answer 5

+1 A:

Try this

set_time_limit(500)
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
    if(substr($line, 0, 3) === 'M/s '){
        $line = trim(substr($line, 3));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }
}

Joyce Babu 2010-10-04 11:29:53

updated original post with out put.

abel 2010-10-04 11:37:22

Can you answer my comment on the oringal post?

Joyce Babu 2010-10-04 11:41:15

updated the code to check only lines beginning with M/S. Also fixed an error.

Joyce Babu 2010-10-04 11:48:39

no output from the new code! I added a echo "Yes"; after line 5

abel 2010-10-04 12:35:02

... to check if the 'if cond' ever matches, even though there is an M/s at the beginning of many lines

abel 2010-10-04 12:42:37

Oops! 'M/s ' is 4 characters. You need to change substr($line, 0, 3) to substr($line, 0, 4)

Joyce Babu 2010-10-04 12:50:04

@joyce Babu Nice work. The work is not done yet, but enjoy the bounty!

abel 2010-10-05 11:51:17

Thanks. I can't think of a perfect solution without a unique field.

Joyce Babu 2010-10-05 15:07:54

ansaurus

tags:

views:

answers:

Comparing text blocks for similar content

related questions