views:

306

answers:

3

The data set is 97984 files in 6766 folders with 2,57 GB. A lot of them are binary files.

For me this does not sound so much. The daily data change rate is in the hundreds of KB on maybe 50 files. But I'm scared that subversion will become extremely slow.

It was never fast anyway and the last time at v1.2 the recommendation was splitting it into multiple repositories. No, I don't like this.

Is there way that I can tell Subversion or any other free open source version control to trust the file modified time/file size to detect file changes and not compare all the files? With this and putting the data on a fast modern SSD it should run fast, say, less then 6 seconds for a complete commit (that's 3x more then getting the summary from the Windows Explorer properties dialog).

+3  A: 

I think the best way is to try for yourself. Mercurial will work fine, since it doesn't compare the file content if the mtime isn't changed, as you wanted.

Here are the timings (not on ssd):

Data size - 2.3Gb (84000 files in 6000 directories, random textual data)
Checkout time (hg update from the null rev to tip) - 1m5s
status time (after changing 1800 files ~35MB) - 3s
commit time (after the same change) - 11s

If you want to avoid a full tree scan during commit, you could try the inotify extension (use the "tip" version where all known bugs should be fixed).

You need to be aware that cloning such a repo might be painful for your users since they will have to transfer quite a lot of data.

EDIT: I missed the (implicit) fact that you were running it on windows, so inotify won't work (hopefully it will be ported to windows in the future, but that's not the case right now).

EDIT 2: added timings

tonfa
+3  A: 

Is there way that i can tell subversion or any other free open source version control to trust the file modified time/file size to detect file changes and not compare all the files.

I think subversion already does this. Look at this piece of code in libsvn_wc questions.c (rev39196):

  if (! force_comparison)
    {
      svn_filesize_t translated_size;
      apr_time_t last_mod_time;

      /* We're allowed to use a heuristic to determine whether files may
         have changed.  The heuristic has these steps:


         1. Compare the working file's size
            with the size cached in the entries file
         2. If they differ, do a full file compare
         3. Compare the working file's timestamp
            with the timestamp cached in the entries file
         4. If they differ, do a full file compare
         5. Otherwise, return indicating an unchanged file.

I sampled a few places where this function is called, and the force_comparison parameter was always FALSE. I only spent a few minutes looking though.

Wim Coenen
+3  A: 

I've just done a benchmark on my machine to see what this is like:

Data size - 2.3Gb (84000 files in 6000 directories, random textual data)
Checkout time 14m
Changed 500 files (14M of data changes)
Commit time 50seconds

To get an idea of how long it would take to manually compare all those files, I also ran a diff against 2 exports of that data (version1 against version2).

Diff time: 55m

I'm not sure if an ssd would get that commit time down as much as you hope, but I was using a normal single sata disk to get both the 50 seconds and 55minutes comparisons.

To me, these times strongly suggest that the contents of the files are not being checked by svn by default.

This was with svn 1.6.

Jim T
Thanks for this. 50 seconds for an empty file cache commit sounds nice. If you still have the data can you try what an empty commit would take. Maybe measuring the first and an immediate seconds when the file system cache is hot - the later should be comparable with SSD data. One of the problems of subversion and diff is that they are still single threaded. Hope this will be changed in 2.0
Lothar
svn ci for a null commit takes 30 seconds with a hot cache. The initial svn status just after checkout (cold) took 1m 21s with 500 changed files (14m of changes).
Jim T
Might also be worth noting that after 30 commits of 38M changes I had linear commit times (30th commit took same amount of time as 1st)Also, the repo has 38 revisions of these 84000 files, the repo only contains 1100 files.
Jim T
make that 110 files, I had 990 files from a transaction I aborted
Jim T
hmm, on the downside, after a reboot (cold cache) a null commit took 4m 32s
Jim T
second null commit too 30 seconds again
Jim T
could you put your creation script online, so we could test with other scm's?
tonfa
Good idea, although I'm sure any modern system will perform admirably. Commented values are the values to create the initial large data: http://pastebin.com/f1570cb55
Jim T
Just had a quick look at git - <1 second for a null commit to the local repo. Normal largish commit took 20 seconds, so still faster than svn.
Jim T