views:

1875

answers:

8

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.

  • The content of the files should be neither completely random nor uniform.
    A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
  • The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
  • I'd like to keep the number of files at a manageable level, let's say o(10).

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.

For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.

Neither of these is quite satisfactory for me.

I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.

The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.

What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.

Currently I have an approach that sort of works but it takes too long to run.

Has anyone else solved this?

Is there a much faster way to write a text file than via StreamWriter?

Suggestions?

EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

+10  A: 

You could always code yourself a little web crawler...

UPDATE Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".

A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.

Benjol
That way you would probably get the most "natural" data.
0xA3
And you could download the images too.
Benjol
+1. This was my first thought, although I doubt this approach falls into the 'fast' category.
Kirschstein
This is a good approach for getting "natural" text, but if I have to download 8gb of web pages, it will be slower than the approach I have now. I like the idea though.
Cheeso
Yeah, 8Gb could take a while... You could multi-thread it, but even so.
Benjol
Yes, this isn't going to lend it's self to speed very well. Not sure why it's been voted up so much! Really, the best approach is here is some sort of machine learning one.
Noldorin
I doubt multi-threading would increase the bandwidth that much... ;)
E Dominique
I'm quite surprised about all the upvotes too, it was a joke really.
Benjol
That's a shame you meant it as a joke becasue it's a great idea.
Matthew Whited
@Matthew: Benjol is right - this would the best solution *if* time wasn't a problem.
Noldorin
+4  A: 

I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.

Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

Noldorin
ok, I will research. It's interesting, 8gb of data *used to be* enormous but these days with web traffic history stores, commodity multi-TB disk arrays, S3, and so on, 8gb is not really enormous any more.
Cheeso
Yeah, that's probably true. Still, in terms of computational and I/O time, it's significant even nowadays.
Noldorin
True. on Markov Chains - I don't think I want to write a new implementation. The impl I found, http://blog.figmentengine.com/2008/10/markov-chain-code.html, gave very good output, but was *very* slow.
Cheeso
@Chesso: Looks like the overhead for that linked code is related to the use of IEnumerable<T> and Queues, which *are* going to be quite significant. If you write a stream-based approach, the performance ought to be very good.
Noldorin
Yes, that's exactly what I found. Much, much faster.
Cheeso
In the end, I wrote an implementation of a Markov Chain in C#. it works great.
Cheeso
+14  A: 

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).

You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.

Edit

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

Sam Saffron
I was going to mention looking into Project Gutenberg as well. Most of the plain text files are already compressed, so it would be a quick download. http://www.gutenberg.org/catalog/
Mark Rushakoff
@Mark, good point, Ill add a link, thanks !
Sam Saffron
There is alread a compression benchmark which uses a portion of the Wikipedia dump: http://cs.fit.edu/~mmahoney/compression/textdata.html
CesarB
+1  A: 

For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.

For a more structured approach you could use a Markov chain trained on some large free english text.

Jack Ryan
I took this approach, selecting single words at random from Lorem Ipsum, but it was excrutiatingly slow to generate large text files this way. The Markov chain approach seems like it is tilted towards the strict "naturalness" of the text, which for me is less important than the speed of generation.
Cheeso
Markov chains are definitely the right way to go for this. They will produce both high quality output and do so *very* quickly.
Noldorin
+1  A: 

Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.

Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.

Kevin
+3  A: 

I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.

You could then use a similiar approach for binary files by looking for .exes or .dlls.

Kirschstein
+1  A: 

Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.

R Ubben
A: 

Thanks for all the quick input. I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.

  • To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff.
  • I randomly select and download one document of that subset.
  • I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
  • I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. It generates a text one word at a time.
  • For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.

UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.

I haven't yet figured out how to do the binary generation, but it will likely be something analogous.

Thank you all, again, for all the helpful ideas.

Cheeso