views:

74

answers:

6

What I would like to script: a PHP script to find a certain string in loads of files

Is it possible to read contents of thousands of text files from another ftp server without actually downloading those files (ftp_get) ?

If not, would downloading them ONCE -> if already exists = skip / filesize differs = redownload -> search certain string -> ... be the easiest option?

+3  A: 

If URL fopen wrappers are enabled, then file_get_contents can do the trick and you do not need to save the file on your server.

<?php

$find = 'mytext'; //text to find
$files = array('http://example.com/file1.txt', 'http://example.com/file2.txt');  //source files

foreach($files as $file)
{
     $data = file_get_contents($file);
     if(strpos($data, $find) !== FALSE)
           echo "found in $file".PHP_EOL;
}
?>

[EDIT]: If Files are accessible only by FTP:

In that case, you have to use like this:

 $files = array('ftp://user:[email protected]/path/to/file', 'ftp://user:[email protected]/path/to/file2');
shamittomar
You are assuming files are available through `http` which the OP didn't specify or mention. :S
Andreas Bonini
@Andreas, updated the answer.
shamittomar
+1  A: 

If you are going to store the files after you download them, then you may be better served to just download or update all of the files, then search through them for the string.

The best approach depends on how you will use it.

If you are going to be deleting the files after you have searched them, then you may want to also keep track of which ones you searched, and their file date information, so that later, when you go to search again, you won't waste time searching files that haven't changed since the last time you checked them.

When you are dealing with so many files, try to cache any information that will help your program to be more efficient next time it runs.

James Black
A: 

If you can connect via SSH to that server, and if you can install new PECL (and PEAR) modules, then you might consider using PHP SSH2. Here's a good tutorial on how to install and use it. This is a better alternative to FTP. But if it is not possible, your only solution is file_get_content('ftp://domain/path/to/remote/file');.

** UPDATE **

Here is a PHP-only implementation of an SSH client : SSH in PHP.

Yanick Rochon
+1  A: 

PHP's built-in file reading functions, such as fopen()/fread()/fclose() and file_get_contents() do support FTP URLs, like this:

<?php
$data = file_get_contents('ftp://user:[email protected]/dir/file');
// The file's contents are stored in the $data variable

If you would need to get a list of the files in the directory, you might want to check out opendir(), readdir() and closedir(), which I'm pretty sure supports FTP URLs.
An example:

<?php
$dir = opendir('ftp://user:[email protected]/dir/');
if(!$dir)
  die;
while(($file = readdir($dir)) !== false)
  echo htmlspecialchars($file).'<br />';
closedir($dir);
Frxstrem
A: 

With FTP you'll always have to download to check.

I do not know what kind of bandwidth you're having and how big the files are, but this might be an interesting use-case to run this from the cloud like Amazon EC2, or google-apps (if you can download the files in the timelimit).

In the EC2 case you then spin up the server for an hour to check for updates in the files and shut it down again afterwards. This will cost a couple of bucks per month and avoid you from potentially upgrading your line or hosting contract.

Peter Tillemans
A: 

If this is a regular task then it might be worth using a simple queue system so you can run multiple processes at once (will hugely increase speed) This would involve two steps:

  1. Get a list of all files on the remote server
  2. Put the list into a queue (you can use memcached for a basic message queuing system)
  3. Use a seperate script to get the next item from the queue.

The procesing script would contain simple functionality (in do while loop)

ftp_connect
do 

  item = next item from queue

  $contents = file_get_contents;

  preg_match(.., $contents);

while (true);

ftp close

You could then in theory fork off multiple processes through the command line without needing to worry about race conditions.

This method is probabaly best suited to crons/batch processing, however it might work in this situation too.

Kieran Allen