views:

153

answers:

2

I'd like to profile some VCS software, and to do so I want to generate a set of random files, in randomly arranged directories. I'm writing the script in Python, but my question is briefly: how do I generate a random directory tree with an average number of sub-directories per directory and some broad distribution of files per directory?

Clarification: I'm not comparing different VCS repo formats (eg. SVN vs Git vs Hg), but profiling software that deals with SVN (and eventually other) working copies and repos.

The constraints I'd like are to specify the total number of files (call it 'N', probably ~10k-100k) and the maximum depth of the directory structure ('L', probably 2-10). I don't care how many directories are generated at each level, and I don't want to end up with 1 file per dir, or 100k all in one dir.

The distribution is something I'm not sure about, since I don't know whether VCS' (SVN in particular) would perform better or worse with a very uniform structure or a very skewed structure. Nonetheless, it would be nice if I could come up with an algorithm that didn't "even out" for large numbers.

My first thoughts were: generate the directory tree using some method, and then uniformly populate the tree with files (treating each dir equally, with no regard as to nesting). My back-of-the-envelope calcs tell me that if there are 'L' levels, with 'D' subdirs per dir, and about sqrt(N) files per dir, then there will be about D^L dirs, so N =~ sqrt(N)*(D^L) => D =~ N^(1/2L). So now I have an approximate value for 'D', how can I generate the tree? How do I populate the files?

I'd be grateful just for some pointers to good resources on algorithms I could use. My searching only found pretty applets/flash.

+4  A: 

Why not download some real open source repos and use those?

Have you thought about what goes into the files? is that random data too?

gnibbler
I like this. For example, Keith Packard used the source code for Mozilla to test Subversion and Git: http://keithp.com/blogs/Repository_Formats_Matter/ He didn't just take the current checkout, he used an importer to get the whole history. Conclusion: "The Mozilla CVS repository was 2.7GB, imported to Subversion it grew to 8.2GB. Under Git, it shrunk to 450MB. Given that a Mozilla checkout is around 350MB, it’s fairly nice to have the whole project history (from 1998) in only slightly more space."
steveha
Files will contain header info plus random data (which will be "churned").Two reasons for generating it (I freely admit this is no iron clad argument):1. Practically speaking, I can't download more than about 100s of MB of data. Unless I find something within "free download" range of my ISP (possible, but unlikely), I'll hit my 2GB data limit pretty quickly. Especially if I want to test on 10k-100k+ of files.2. Eventually I'll want to compare performance across VCS methods with the same data (which I guess is also possible with a real repo...)Still, certainly worth consideration.
detly
Find someone local with a copy of the source? I'm sure there are people in your area who develop on large projects... testing on fake data is not going to give you real results... you'll find anomalies that only show up when using real data.
Paul McMillan
Who is your isp? Do they have linux repos for free? you could download package sources. They won't have history though.
gnibbler
I can't see how I could find a ~100k file SVN repo hosted on some computer that happens to share my ISP without resorting to spamming everyone I know, and asking them to do the same... There may be "people in [my] area who develop on large projects", but I don't know their names.Look, I understand that I should test the software on real repos, and I do. But I know for a fact that there are certain problems that scale with the number of items in a repo, and it's these areas of the code I want to examine.
detly
gnibbler, yes. That's worth a shot. Download the entire Debian source repo and import it into SVN. I could still randomly modify and ignore certain files/globs.
detly
A: 

Your question is fairly long and involved, but I think it boils down to asking for a random number generator with certain statistical properties.

If you don't like python's random number generator, you might look at some of the other statistical packages on pypi, or if you want something a little more heavy duty, perhaps the python bindings for the GNU Scientific Library.

http://sourceforge.net/projects/pygsl/

http://www.gnu.org/software/gsl/

Paul McMillan
It's more what to do with the random number. Perhaps get a random number with, say, an exponential distribution, average of D. Create this many directories. For each directory, descend into it, repeat until I reach L levels. Then populate with files.
detly