views:

136

answers:

2

Hi!

I have a 1.2GB file that contains a one line string. What I need is to search the entire file to find the position of an another string (currently I have a list of strings to search). The way what I'm doing it now is opening the big file and move a pointer throught 4Kb blocks, then moving the pointer X positions back in the file and get 4Kb more.

My problem is that a bigger string to search, a bigger time he take to got it.

Can you give me some ideas to optimize the script to get better search times?

this is my implementation:

function busca($inici){
        $limit = 4096;

        $big_one    = fopen('big_one.txt','r');
        $options    = fopen('options.txt','r');

        while(!feof($options)){
            $search = trim(fgets($options));
            $retro  = strlen($search);//maybe setting this position absolute? (like 12 or 15)

            $punter = 0;
            while(!feof($big_one)){
                $ara = fgets($big_one,$limit);

                $pos = strpos($ara,$search);
                $ok_pos = $pos + $punter;

                if($pos !== false){
                    echo "$pos - $punter - $search : $ok_pos <br>";
                    break;
                }

                $punter += $limit - $retro;
                fseek($big_one,$punter);
            }
            fseek($big_one,0);
        }
    }

Thanks in advance!

+1  A: 
$big_one    = fopen('big_one.txt','r');
$options    = fopen('options.txt','r');  

while(!feof($options))
{
  $option = trim(fgets($options));
  $position = substr($big_one,$option);

  if($position)
    return $position; //exit loop
}

the size of the file is quite large though. you might want to consider storing the data in a database instead. or if you absolutely can't, then use the grep solution posted here.

Sev
maybe inserting it in blocks of 4Kb for example?that foreach is for split the string? or what?
Marc
+4  A: 

Why don't use exec + grep -b?

exec('grep "new" ext-all-debug.js -b', $result);
// here we have looked for "new" substring entries in the extjs debug src file
var_dump($result);

sample result:

array(1142) {
    [0]=>  string(97) "3398: * insert new elements. Revisiting the example above, we could utilize templating this time:"
    [1]=>  string(54) "3910:var tpl = new Ext.DomHelper.createTemplate(html);"
    ...
}

Each item consists of string offset in bytes from the start of file and the line itself, separated with colon.
So after this you have to look inside the particular line and append the position to the line offset. I.e.:

[0]=>  string(97) "3398: * insert new elements. Revisiting the example above, we could utilize templating this time:"

this means that "new" occurrence found at 3408th byte (3398 is the line position and 10 is the position of "new" inside this line)

zerkms
+1. When you're dealing with files this large, it's better to leave this sort of work to tools that were built for the job.
Frank Farmer
I'm agree with the idea, but I need the correct way to launch grep. What's the correct sentence to search for a string inside a file with grep? can it returm me just the position of the match?Thanks
Marc
@Marc: I've updated the answer
zerkms
Thank you very much zerkms!, I gonna made benchmarks to tell you how it improve the performance.
Marc
@zerkms The problema I have now is with grep and the output, he give me all the entire line, and all I have in this hughe file is in one line, then he give me the number position with a very laaaaaaarge output that I can't manage.Know you how to just output the first position and then quit grep? (something like -q with output or -m 1 without the entire line).Thanks in advance
Marc
:-S don't know then :-(
zerkms