views:

55

answers:

4

Is there a tool (for PosgreSQL, ideally), which can make a small, but consistent sample of a big database?

The this is we need a testing database, but we don't want to fully copy the production one. First, because it is too big and second, the nature of testing implies that the testing database will be recreated several times in the process.

Obviously, you can not simply take random rows from some tables, because this will violate the hell out of foreign keys and what not. So, I wonder is there a tool available that can do that?

A: 

You can use pg_dump --schema-only to dump only the schema of the database. Then use pg_restore to load the dump into a new database. From there you have a few options:

  1. Create your data by hand; this will allow you to cover edge cases but will take a while if you want to test on a lot of data.

  2. Script a few queries to import random sections of each table on your database. As you said, this will violate foreign key constraints, but when it does just ignore the failure. Keep track of the number of successes and keep going until you have as many data items as you want. This may not work depending on your schema, however, if you have very restrictive constraints, because it might take too long to hit on succeeding data.

kerkeslager
A: 

I once built such a tool for the IDMS system.

I was in the process of making it work for SQL systems too when the managers of the company we were mergered into told me I could not continue wasting my time on such futile and unnecessary pieces of software.

Until this day, I have still neither seen nor heard of any commercially available thing that matches what I achieved way back then.

Erwin Smout
A: 

Back in my Oracle days we would have a test database with a very small auto generated set of data. At the time it was about a 5th of the production database size. We would than copy the stats from the production database and put them into our test database to make it think it had billions of rows in tables when in reality it only had 500,000. This allowed us to get explain plans in test that would we would get in production. It has it's values, but doesn't solve all your question and I'm not sure how easy or even feasible it is to mess with PostgreSQL's stats.

StarShip3000
+1  A: 

What about generating some mock data with a tool like databene benerator, just as much as you want, and store them for reuse.

Pascal Thivent
Looks very promising. I think I'll try this approach, thanks!
maksymko
@maksymko You're welcome. I'm a very satisfied user of Benerator, I'm pretty sure you'll like it.
Pascal Thivent