ansaurus

Question

stristr and speed

Answer 1

A:

Try sorting the files first (espacially the large one). Then you only need to check the first few characters of each line in b, and stop (go to the next line in a) when you're past that prefix. Then you can even make an index of where in the file each characters is the first (a starts on line 0, b starts on line 1337, c on line 13986 and so on).

Emil Vikström 2010-09-08 04:55:04

Answer 2

A:

Try using ob_flush() and flush() in loop.

foreach($small_list as $one_line)
{
 if(stristr($big_list, $one_line) != FALSE) 
    {
    fwrite($fh, $one_line);
    echo "record found: " . $one_line ."<br>";
    }  
       @ob_flush();
        @flush();
        @ob_end_flush(); 
}

Jet 2010-09-08 04:55:22

How will that speed up the search?

Emil Vikström 2010-09-08 04:56:14

Answer 3

A:

Build arrays with hashes as indices:

Read in file a.csv line by line and store in a_hash[md5($line)] = array($offset, $length) Read in file b.csv line by line and store in b_hash[md5($line)] = true

By using the hashes as indices you will automagically not wind up having duplicate entries.

Then for every hash that has an index in both a_hash and b_hash read in the contents of the file (using offset and length you stored in a_hash) to pull out the actual line text. If you're paranoid about hash collisions then store offset/length for b_hash as well and verify with stristr.

This will run a lot faster and use up far, far, FAR less memory.

If you want to reduce memory requirement further and don't mind checking duplicates then:

Read in file a.csv line by line and store in a_hash[md5($line)] = false
Read in file b.csv line by line, hash the line and check if exists in a_hash.
If a_hash[md5($line)] == false write to c.csv and set a_hash[md5($line)] = true

Some example code for the second suggestion:

$a_file = fopen('a.csv','r');
$b_file = fopen('b.csv','r');
$c_file = fopen('c.csv','w+');

if(!$a_file || !$b_file || !$c_file) {
    echo "Broken!<br>";
    exit;
}

$a_hash = array();

while(!feof($a_file)) {
    $a_hash[md5(fgets($a_file))] = false;
}
fclose($a_file);

while(!feof($b_file)) {
    $line = fgets($b_file);
    $hash = md5($line);
    if(isset($a_hash[$hash]) && !$a_hash[$hash]) {
        echo 'record found: ' . $line . '<br>';
        fwrite($c_file, $line);
        $a_hash[$hash] = true;
    }
}

fclose($b_file);
fclose($c_file);

Mike 2010-09-08 05:04:09

This went a bit above my head, do you know a good resource where I could learn how to do this properly?

Mike 2010-09-08 05:09:13

Added an example for you. Seems to work fine, but I've haven't exactly done extensive debugging. Should be enough to let you see what's happening and will run in teeny amounts of space compared to your original.

Mike 2010-09-08 05:37:03

Wow, that managed to do the whole 65mb file in about 45 seconds... Thanks so much, you just saved me a really really late night. Also automagically is my new favorite word.

Mike 2010-09-08 05:41:59

ansaurus

tags:

views:

answers:

stristr and speed

related questions