+1  A: 

if you need to compare this lists only once, i'd suggest converting docs to txt and then you'll be able to compare using regex. otherwise you'll need to use third party software to access info in the docs... like here maybe http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php

kgb
not a public app. so I would only be too happy to copy paste into a form.
abel
+2  A: 

Create an array of both (possibly using the file() function, depending on the format of the text, or possibly just an explode() on content), and use array_diff().

Wrikken
+1  A: 
$oldList = file('oldList.txt');
$newList = file('newList.txt');
$list = array_udiff($newList, $oldList, 'compare');

function compare($new, $old) {
    similar_text($old, substr($new, 3), $percent);
    return $percent >= 80 ? 0 : 1;
}

This is my basic idea. To find all texts similar by 80% and remove them from the $newList. You should adjust the percentage to satisfy your needs. The M/s is removed by substr($new, 3).

nikic
@nikic thanks for the code. I get a 60s timeout when comparing the two blocks(each block is around 80kb)
abel
Add set_time_limit(500); to the beginning of the code
Joyce Babu
@nikic I added a var_dump($list); the output is posted in the original question
abel
+1  A: 

If there are no key fields for uniquely identifying the records, I think you will have to use something like similar_text or levenshtein.

$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
   $line = trim(substr($line, 3));
   foreach($arOld as $old){
    similar_text($line, $old, $percentage);
    if ($percentage < 60){
        echo $line;
    }
   }
}
Joyce Babu
Undefined variable new on line 6
abel
It is $line, not $new. Sorry.
Joyce Babu
no problemo ! ...
abel
I ran the script using samples from the orig post. the output is posted in the orig question
abel
On second thought, it is not going to work. It requires a little modification to work. Now it will print lots of lines.
Joyce Babu
yes it does print out a lot of lines. The principle would be to match everyword from one text block with all the words of the second word block and then echo those which match. However company namess are multiple words....
abel
+1  A: 

Try this

set_time_limit(500)
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
    if(substr($line, 0, 3) === 'M/s '){
        $line = trim(substr($line, 3));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }
}
Joyce Babu
updated original post with out put.
abel
Can you answer my comment on the oringal post?
Joyce Babu
updated the code to check only lines beginning with M/S. Also fixed an error.
Joyce Babu
no output from the new code! I added a echo "Yes"; after line 5
abel
... to check if the 'if cond' ever matches, even though there is an M/s at the beginning of many lines
abel
Oops! 'M/s ' is 4 characters. You need to change substr($line, 0, 3) to substr($line, 0, 4)
Joyce Babu
@joyce Babu Nice work. The work is not done yet, but enjoy the bounty!
abel
Thanks. I can't think of a perfect solution without a unique field.
Joyce Babu