tags:

views:

87

answers:

7

I want to synchronize two directories. And I use file_get_contents($source) === file_get_contents($dest) to compare two files. Is there any problem to do this?

A: 

No, that should work

Kristoffer S Hansen
+8  A: 

Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.

You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.

Tatu Ulmanen
Thanks a lot. I'll try it.
xdazzyy
`sha1_file()` has to hash the whole file. Surely it's no faster than a manual comparison.
Oli
@Oli, I assume that hashing the files and comparing 40 characters to other 40 characters is faster than comparing the whole file contents to each other.
Tatu Ulmanen
I'm not sure that's a fair assumption. Say you have two files, 2M chars long and the first characters are different. Hashing would read 4M chars, build two hashes, then compare 1 to 40 chars (depending on the similarity of the hashes). Direct comparison would read 2 chars and return. Extreme case but direct comparison will always read less data if the files are equal.
Oli
+1  A: 

Seems a bit heavy. This will load both files completely as strings and then compare.

I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.

Oli
+1  A: 

There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.

I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.

Sam152
A: 

Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.

http://php.net/sha1_file

http://php.net/md5_file

if (sha1_file($source) == sha1_file($dest)) {
    /* ... */
}
David Gonrab
+3  A: 
  • Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
  • Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
  • Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.
Piskvor
+1 for the file size check. Although, if there are text files, look out for automatic line ending conversion, which can be a hassle.
Boldewyn
@Boldewyn: Good point - Windows line ends are 2 bytes, UNIX/Mac are 1 byte. However, in this case, it would mean that one of the files *has* been changed (with the exception of FTP, where all sorts of crazy things happen).
Piskvor
That's exactly the problem. If one of the folders is on the other end of a pipe that does line ending conversion, all text files will always be different.
Boldewyn
A: 

I would rather do something like this:

function files_are_equal($a, $b)
{
  // Check if filesize is different
  if(filesize($a) !== filesize($b))
      return false;

  // Check if content is different
  $ah = fopen($a, 'rb');
  $bh = fopen($b, 'rb');

  $result = true;
  while(!feof($ah))
  {
    if(fread($ah, 8192) != fread($bh, 8192))
    {
      $result = false;
      break;
    }
  }

  fclose($ah);
  fclose($bh);

  return $result;
}

This checks if the filesize is the same, and if it is it goes through the file step by step.

  • Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
  • Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...
Svish