ansaurus

Question

Can I use Subversion for a multi gigabyte data set?

Answer 1

+3 A:

I think the best way is to try for yourself. Mercurial will work fine, since it doesn't compare the file content if the mtime isn't changed, as you wanted.

Here are the timings (not on ssd):

Data size - 2.3Gb (84000 files in 6000 directories, random textual data)
Checkout time (hg update from the null rev to tip) - 1m5s
status time (after changing 1800 files ~35MB) - 3s
commit time (after the same change) - 11s

If you want to avoid a full tree scan during commit, you could try the inotify extension (use the "tip" version where all known bugs should be fixed).

You need to be aware that cloning such a repo might be painful for your users since they will have to transfer quite a lot of data.

EDIT: I missed the (implicit) fact that you were running it on windows, so inotify won't work (hopefully it will be ported to windows in the future, but that's not the case right now).

EDIT 2: added timings

tonfa 2009-09-08 23:11:46

Answer 2

+3 A:

Is there way that i can tell subversion or any other free open source version control to trust the file modified time/file size to detect file changes and not compare all the files.

I think subversion already does this. Look at this piece of code in libsvn_wc questions.c (rev39196):

  if (! force_comparison)
    {
      svn_filesize_t translated_size;
      apr_time_t last_mod_time;

      /* We're allowed to use a heuristic to determine whether files may
         have changed.  The heuristic has these steps:


         1. Compare the working file's size
            with the size cached in the entries file
         2. If they differ, do a full file compare
         3. Compare the working file's timestamp
            with the timestamp cached in the entries file
         4. If they differ, do a full file compare
         5. Otherwise, return indicating an unchanged file.

I sampled a few places where this function is called, and the force_comparison parameter was always FALSE. I only spent a few minutes looking though.

Wim Coenen 2009-09-09 00:20:30

Answer 3

+3 A:

I've just done a benchmark on my machine to see what this is like:

Data size - 2.3Gb (84000 files in 6000 directories, random textual data)
Checkout time 14m
Changed 500 files (14M of data changes)
Commit time 50seconds

To get an idea of how long it would take to manually compare all those files, I also ran a diff against 2 exports of that data (version1 against version2).

Diff time: 55m

I'm not sure if an ssd would get that commit time down as much as you hope, but I was using a normal single sata disk to get both the 50 seconds and 55minutes comparisons.

To me, these times strongly suggest that the contents of the files are not being checked by svn by default.

This was with svn 1.6.

Jim T 2009-09-09 13:09:26

Thanks for this. 50 seconds for an empty file cache commit sounds nice. If you still have the data can you try what an empty commit would take. Maybe measuring the first and an immediate seconds when the file system cache is hot - the later should be comparable with SSD data. One of the problems of subversion and diff is that they are still single threaded. Hope this will be changed in 2.0

Lothar 2009-09-09 15:04:20

svn ci for a null commit takes 30 seconds with a hot cache. The initial svn status just after checkout (cold) took 1m 21s with 500 changed files (14m of changes).

Jim T 2009-09-09 15:29:38

Might also be worth noting that after 30 commits of 38M changes I had linear commit times (30th commit took same amount of time as 1st)Also, the repo has 38 revisions of these 84000 files, the repo only contains 1100 files.

Jim T 2009-09-09 15:33:16

make that 110 files, I had 990 files from a transaction I aborted

Jim T 2009-09-09 15:36:14

hmm, on the downside, after a reboot (cold cache) a null commit took 4m 32s

Jim T 2009-09-09 16:08:36

second null commit too 30 seconds again

Jim T 2009-09-09 16:09:55

could you put your creation script online, so we could test with other scm's?

tonfa 2009-09-09 16:23:13

Good idea, although I'm sure any modern system will perform admirably. Commented values are the values to create the initial large data: http://pastebin.com/f1570cb55

Jim T 2009-09-09 16:56:34

Just had a quick look at git - <1 second for a null commit to the local repo. Normal largish commit took 20 seconds, so still faster than svn.

Jim T 2009-09-10 15:07:18

ansaurus

tags:

views:

answers:

Can I use Subversion for a multi gigabyte data set?

related questions