Scalable (half-million files) version control system

views:

440

answers:

+11 Q:

Scalable (half-million files) version control system

We use SVN for our source-code revision control and are experimenting using it for non-source-code files.

We are working with a large set (300-500k) of short (1-4kB) text files that will be updated on a regular basis and need to version control it. We tried using SVN in flat-file mode and it is struggling to handle the first commit (500k files checked in) taking about 36 hours.

On a daily basis, we need the system to be able to handle 10k modified files per commit transaction in a short time (<5 min).

My questions:

Is SVN the right solution for my purpose. The initial speed seems too slow for practical use.
If Yes, is there a particular svn server implementation that is fast? (We are currently using the gnu/linux default svn server and command line client.)
If No, what are the best f/oss/commercial alternatives

Thanks

Edit 1: I need version control because multiple people will be concurrently modifying the same files and will be doing manual diff/merge/resolve-conflicts in the exact same way as programmers edit source code. Thus I need a central repository to which people can check in their work and check out others work. The work-flow is virtually identical to a programming workflow except that the users are not programmers and the file content is not source-code.

Update 1: Turns out that the primary issue was more of a filesystem issue than an SVN issue. For SVN, committing a single directory with half-million new files did not finish even after 24 hours. Splitting the same across 500 folders arranged in a 1x5x10x10 tree with 1000 files per folder resulted in a commit time of 70 minutes. Commit speed drops exponentially over time for single folder with large number of files. Git seems a lot faster. Will update with times.

+11 A:

As of July 2008, the Linux kernel git repo had about 260,000 files. (2.6.26)

http://linuxator.wordpress.com/2008/07/22/5-things-you-didnt-know-about-linux-kernel-code-metrics/

At that number of files, the kernel developers still say git is really fast. I don't see why it'd be any slower at 500,000 files. Git tracks content, not files.

jonescb 2010-03-31 18:07:57

To reaffirm this: I just tested a commit which essentially rewrote all the contents of an enormous repository (26000 files, 5GB). It took about 6 minutes, mostly I/O-limited over a not-that-fast network mount. In your use case, the diffs are more like 50MB, so you should see much faster commit times. (Your initial commit could still take a while - wild guess five minutes to an hour depending on your system.)

Jefromi 2010-03-31 18:35:42

Be aware. Git has a steep learning curve for programmers and can be baffling to non-coders. I now use git all the time and couldn't work without it, but it took me a few months to get comfy. Make sure you are ready to sink some hours into training your non-programmer colleagues if you commit to Git-- no pun intended :)

AndyL 2010-04-01 13:13:15

@Andy Thanks for that valuable comment about Git's learning curve.

hashable 2010-04-02 04:57:13

+3 A:

git is your best bet. You can check in an entire operating system (two gigabytes of code in a few hundred thousand files) and it remains usable, although the initial checkin will take quite a while, like around 40 minutes.

Andrew McGregor 2010-03-31 18:09:04

only 40 mins? Wow!

Lucas B 2010-03-31 18:53:10

Presuming the system has fast disk, yes. I suppose SSD would be the way to go for ultimate speed of revision control systems.

Andrew McGregor 2010-04-01 14:45:53

@Andrew Thanks for that tip. Yes. Using an SSD as the SVN server HDD would speed up things.

hashable 2010-04-02 04:38:26

@hashable: you'd have to research that. I think that the harddisk on the client is more critical than that in the server, when using SVN.

Sander Rijken 2010-04-02 12:52:11

The client would be more critical with git too.

Andrew McGregor 2010-04-02 15:04:12

+3 A:

for svn "flat file mode" meaning FSFS I presume:
- make sure you're running the latest svn. FSFS had sharding added in ~1.5 IIRC which will be a night/day difference at 500k files. The particular filesystem you run will also have a huge effect. (Don't even think about this on NTFS.)
- You're going to be IO-bound with that many file transactions. SVN is not very effecient with this, having to stat files in .svn/ as well as the real files.
git has way better performance than svn, and you owe it to yourself to at least compare

Nathan Kidd 2010-03-31 18:13:26

@Nathan Yes. I believe we are using version 1.6.x of SVN.

hashable 2010-03-31 19:16:41

and with the number of files, svn 1.7 will have much better support by scrapping the .svn directories that have a significant impact with a very large number of files. Of course, this isn't out yet.

gbjbaanb 2010-03-31 19:59:32

sharding will help you when you have a large number of revisions, it doesn't improve anything for the number of files. It's the revisions that are sharded in the repository.

Sander Rijken 2010-04-02 12:53:57

@Sander: Right, good point. I guess I was imagining "updating on a regular basis" as individual commits, but that's not so likely with that number of files. The real slow-down is client side.

Nathan Kidd 2010-04-02 13:23:55

Do you really need a file system with cheap snapshots, like ZFS? You could configure it to save the state of the filesystem every 5 minutes to avail yourself of some level of change history.

joeforker 2010-03-31 18:21:32

Your answer sounds like a question (typo?). Anyway, good pointer!

paprika 2010-03-31 18:49:55

It's called the Socratic method ;-)

joeforker 2010-03-31 19:37:18

+3 A:

for such short files, i'd check about using a database instead of a filesystem.

Javier 2010-03-31 18:39:54

+2 A:

is SVN suitable? As long as you're not checking out or updating the entire repository, then yes it is.

SVN is quite bad with committing very large numbers of files (especially on Windows) as all those .svn directories are written to to update a lock when you operate on the system. If you have a small number of directories, you won't notice, but the time taken seems to increase exponentially.

However, once committed (in chunks, directory by directory perhaps) then things become very much quicker. Updates don't take so long, and you can use the sparse checkout feature (very recommended) to work on sections of the repository. Assuming you don't need to modify thousands of files, you'll find it works quite well.

Committing 10,000 files - again, all at once is not going to be speedy, but 1,000 files ten times a day will be much more manageable.

So try it once you've got all files in there, and see how it works. All this will be fixed in 1.7, as the working copy mechanism is modified to remove those .svn directories (so keeping locks is simpler and much quicker).

gbjbaanb 2010-03-31 20:08:46

It's not really the large number of files, it's the large number of directories that impacts the performance the most.

Sander Rijken 2010-04-01 18:06:28

@gbjbaanb @Sander Too many files in a single folder seems to be the problem. Please look at Update 1.

hashable 2010-04-02 04:50:30

I was refering to the slowdowns described by @gbjbaanb caused by .svn directories. That slowdown is caused by have many directories, not by having many files. Even locking the working copy before the operation and unlocking it afterwards takes a lot of time if there are many directories.

Sander Rijken 2010-04-02 12:50:42

too many files in 1 directory... try your timing with the virus checker turned off. That .svn directory needs to be updated when you commit per file. Not good. Also, post on the svn dev mailing list with your timings - you may get some help there, or at least prompt someone to take a look what's going on.

gbjbaanb 2010-04-03 23:55:01

Is there any reason you need to commit 10k modified files per commit? Subversion would scale much better if every user checks in his/her own file right away. Then that one time a day you need to 'publish' the files, you can tag them very fast and run the published version from the tag

Sander Rijken 2010-04-01 18:09:34

@Sander 10k is the upper bound. A user cannot check-in just a file at a time due to inter-file dependencies.

hashable 2010-04-02 04:54:40

Do you mean that by manually doing their work, they produce up to 10k files that need to be one commit? That sounds pretty much impossible unless the files are generated, in which case it's generally better to store the source files in source control.

Sander Rijken 2010-04-02 12:47:55

@Sander The manual work is not done at a file level. Small edits (to the information represented in all the files collectively) can result in several files being modified. Yes. for the upper bound case of 10000 file modifications, the changes are likely to be due to programmatic file modification. (There is both human and automatic editing of the files.)

hashable 2010-04-02 13:40:41

+3 A:

I recommend Mercurial, as it still leads git in the usability department (git's been getting better, but, eh).

bzr has made leaps forward in usability as well.

Paul Nathan 2010-04-01 22:18:54

ansaurus

tags:

views:

answers:

Scalable (half-million files) version control system

related questions