views:

50

answers:

3

How to search text in some files like PDF, doc, docs or txt using PHP? I want to do similar function as Full Text Search in MySQL, but this time, I'm directly search through files, not database.

The search will do searching in many files that located in a folder. Any suggestion, tips or solutions for this problem?

I also noticed that, google also do searching through the files.

A: 

If you are under a linux server you may use

grep -R "text to be searched for" ./   // location is everything under the actual directory

called from php using exec resulting in

cmd = 'grep -R "text to be searched for" ./';
$result = exec(grep);
print_r(result);
Thariama
+1  A: 

For searching PDF's you'll need a program like pdftotext, which converts content from a pdf to text. For Word documents a simular thingy could be available (because of all the styling and encryption in Word files).

An example to search through PDF's (copied from one of my scripts (it's a snippet, not the entire code, but it should give you some understanding) where I extract keywords and store matches in a PDF-results-array.):

foreach($keywords as $keyword)
{
    $keyword = strtolower($keyword);
    $file = ABSOLUTE_PATH_SITE."_uploaded/files/Transcripties/".$pdfFiles[$i];

    $content    = addslashes(shell_exec('/usr/bin/pdftotext \''.$file.'\' -'));
    $result     = substr_count(strtolower($content), $keyword);

    if($result > 0)
    {
        if(!in_array($pdfFiles[$i], $matchesOnPDF))
        {
            array_push($matchesOnPDF, array(                                                    
                    "matches"   => $result,
                    "type"      => "PDF",
                    "pdfFile"   => $pdfFiles[$i]));
        }
    }
}
Ben Fransen
+2  A: 

Depending on the file type, you should convert the file to text and then search through it using i.e. file_get_contents() and str_pos(). To convert files to text, you have - beside others - the following tools available:

  • catdoc for word files
  • xlhtml for excel files
  • ppthtml for powerpoint files
  • unrtf for RTF files
  • pdftotext for pdf files
cweiske
Nice answer, might come in handy someday for me too ;) I only knew about pdftotext (as you can see in my answer.. ;)) +1
Ben Fransen