tags:

views:

215

answers:

2

I'm about to release a FOSS data generator that can generate random yet meaningful data in CSV format. Rather belatedly, I guess, I need to poll the state of the art for such products - because if there is a well known and useful existing tool, I can write my work off to experience. I am aware of of a couple of SQL Server specific tools, but mine is not database specific.

So, links? And if you have used such a product, what features did you find it was missing?

Edit: To add a bit more info on my tool (Ooh, Matron!) it is intended to allow generation of any kind of random data from existing data files, and supports weighting. It is XML based (sorry, folks) and lets you say things like:

<pick distribute="20,80" >
  <datafile  file="femalenames.dat"/>
  <datafile  file="malenames.dat"/>
<pick/>

to select female names about 20% of the time and male names 80% of the time.

But the purpose of this question is not to describe my product but to get info on other tools.

Latest: If anyone is interested, they can get the alpha of my data generator at http://code.google.com/p/csvtest

+1  A: 

That can be a one-liner in R where I use the littler scripting front-end:

# generate the data as a one-liner from the command-line
# we set the RNG seed, and draw from a bunch of distributions
# indented just to fit the box here
edd@ron:~$ r -e'set.seed(42); write.csv(data.frame(y=runif(10), x1=rnorm(10),    
                x2=rt(10,4), x3=rpois(10, 0.4)), file="/tmp/neil.csv", 
                quote=FALSE, row.names=FALSE)'
edd@ron:~$ cat /tmp/neil.csv
y,x1,x2,x3
0.914806043496355,-0.106124516091484,0.830735621223563,0
0.937075413297862,1.51152199743894,1.6707628713402,0
0.286139534786344,-0.0946590384130976,-0.282485683052060,0
0.830447626067325,2.01842371387704,0.714442314565005,0
0.641745518893003,-0.062714099052421,-1.08008578470128,0
0.519095949130133,1.30486965422349,2.28674786332467,0
0.736588314641267,2.28664539270111,-0.73270267483628,1
0.134666597237810,-1.38886070111234,-1.45317770550920,1
0.656992290401831,-0.278788766817371,-1.01676025893376,1
0.70506478403695,-0.133321336393658,0.404860813371462,0
edd@ron:~$

You have not said anything about your data-generating process, but rest assured that R can probably cope with just about any requirement, including multivariate normal, t, skew-t, and more. The (six different) random-number generators in R are also of very high quality.

R can also write to DBs, or read parameters from it, and if it needs to be on Windoze then the Rscript front-end could be used instead of littler.

Dirk Eddelbuettel
I am aware of R - I've actually answered a couple of questions here on it. The aim of my product is to be much simpler than writing an R program.
anon
So why not take R as a given -- and let your request be reduced to one line of code? But if you don't want that, can you make it clearer why you need to re-invent / re-program subsets of what R already does for you?
Dirk Eddelbuettel
As I said - to make things simpler. I think it is fair to say that even R users don't find it too easy to use. And see my edit to my question.
anon
+2  A: 

I asked a similar question some months ago:

Tools for Generating Mock Data?

I got some sincere suggestions, but most were not suitable for my needs. Either expensive (non-free) software, or else not flexible enough w.r.t. data types and database structure, or range of mock data, or way too slow (e.g. the Rails ActiveRecord solution).

Features I was looking for were:

  • Generate mock data to fill existing database tables
  • Quick to generate > 1 million rows
  • Produce either SQL script format or flat file suitable for importing
  • Scriptable command-line interface, not a GUI
  • Not dependent on Microsoft Windows environment

Nice-to-have features:

  • Extensible/configurable
  • Open-source, free license
  • Written in a dynamic language like Perl/PHP/Python
  • Point it at a database and let it "discover" the metadata
  • Integrated with testing tools (e.g. DbUnit)
  • Option to fill directly into the database as it generates data

The answer I accepted as Databene Benerator. Though since asking the question, I admit I haven't used it very much.

I was surprised that even when asking the community, the range of tools for generating mock data was so thin. This seems like a niche waiting to be filled! I'll be interested to see what you release.

Bill Karwin
Thanks - very useful
anon
Also, it will do all of your prime requirements, with the possible exception of "quick to generate", (because I don't know what you may mean by "quick"). But it won't (currently) do any of your "nice to haves" except for the FOSS requirement.
anon