ansaurus

Question

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

Answer 1

+10 A:

You could always code yourself a little web crawler...

UPDATE Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".

A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.

Benjol 2009-06-24 11:16:50

That way you would probably get the most "natural" data.

0xA3 2009-06-24 11:20:04

And you could download the images too.

Benjol 2009-06-24 11:21:29

+1. This was my first thought, although I doubt this approach falls into the 'fast' category.

Kirschstein 2009-06-24 11:23:21

This is a good approach for getting "natural" text, but if I have to download 8gb of web pages, it will be slower than the approach I have now. I like the idea though.

Cheeso 2009-06-24 11:24:57

Yeah, 8Gb could take a while... You could multi-thread it, but even so.

Benjol 2009-06-24 11:46:49

Yes, this isn't going to lend it's self to speed very well. Not sure why it's been voted up so much! Really, the best approach is here is some sort of machine learning one.

Noldorin 2009-06-24 11:50:28

I doubt multi-threading would increase the bandwidth that much... ;)

E Dominique 2009-06-24 12:01:32

I'm quite surprised about all the upvotes too, it was a joke really.

Benjol 2009-06-24 12:05:47

That's a shame you meant it as a joke becasue it's a great idea.

Matthew Whited 2009-06-24 13:19:14

@Matthew: Benjol is right - this would the best solution *if* time wasn't a problem.

Noldorin 2009-06-24 17:11:49

Answer 2

+4 A:

I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.

Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

Noldorin 2009-06-24 11:18:10

ok, I will research. It's interesting, 8gb of data *used to be* enormous but these days with web traffic history stores, commodity multi-TB disk arrays, S3, and so on, 8gb is not really enormous any more.

Cheeso 2009-06-24 11:28:58

Yeah, that's probably true. Still, in terms of computational and I/O time, it's significant even nowadays.

Noldorin 2009-06-24 12:56:55

True. on Markov Chains - I don't think I want to write a new implementation. The impl I found, http://blog.figmentengine.com/2008/10/markov-chain-code.html, gave very good output, but was *very* slow.

Cheeso 2009-06-24 15:12:28

@Chesso: Looks like the overhead for that linked code is related to the use of IEnumerable<T> and Queues, which *are* going to be quite significant. If you write a stream-based approach, the performance ought to be very good.

Noldorin 2009-06-24 19:28:03

Yes, that's exactly what I found. Much, much faster.

Cheeso 2009-06-24 21:40:36

In the end, I wrote an implementation of a Markov Chain in C#. it works great.

Cheeso 2010-01-07 23:41:11

Answer 3

+14 A:

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).

You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.

Edit

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

Sam Saffron 2009-06-24 11:19:26

I was going to mention looking into Project Gutenberg as well. Most of the plain text files are already compressed, so it would be a quick download. http://www.gutenberg.org/catalog/

Mark Rushakoff 2009-06-24 11:32:40

@Mark, good point, Ill add a link, thanks !

Sam Saffron 2009-06-24 11:49:13

There is alread a compression benchmark which uses a portion of the Wikipedia dump: http://cs.fit.edu/~mmahoney/compression/textdata.html

CesarB 2009-06-24 13:28:42

Answer 4

+1 A:

For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.

For a more structured approach you could use a Markov chain trained on some large free english text.

Jack Ryan 2009-06-24 11:20:06

I took this approach, selecting single words at random from Lorem Ipsum, but it was excrutiatingly slow to generate large text files this way. The Markov chain approach seems like it is tilted towards the strict "naturalness" of the text, which for me is less important than the speed of generation.

Cheeso 2009-06-24 11:22:52

Markov chains are definitely the right way to go for this. They will produce both high quality output and do so *very* quickly.

Noldorin 2009-06-24 11:27:43

Answer 5

+1 A:

Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.

Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.

Kevin 2009-06-24 11:20:21

Answer 6

+3 A:

I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.

You could then use a similiar approach for binary files by looking for .exes or .dlls.

Kirschstein 2009-06-24 11:31:33

Answer 7

+1 A:

Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.

R Ubben 2009-06-24 14:02:43

Answer 8

A:

Thanks for all the quick input. I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.

To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff.
I randomly select and download one document of that subset.
I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. It generates a text one word at a time.
For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.

UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.

I haven't yet figured out how to do the binary generation, but it will likely be something analogous.

Thank you all, again, for all the helpful ideas.

Cheeso 2009-06-24 18:39:44

ansaurus

tags:

views:

answers:

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

related questions