For generating sample data, I use simple Python applications.
Considerations:
Simple to modify and configure.
A repeatable set of data that you can for performance testing and get consistent results.
Follow all of the DB referential integrity rules and constraints.
Realistic data.
The first two indicate that you want to produce script files that will load your data. The third is tougher. There are ways to discover the database metadata and constraints. Looking at 3 and 4 together, you don't want simple reverse engineering -- you want something you can control to produce realistic values.
Generally, you want to build an entity model of your own so that you can be sure you have ranges and key relationships correct.
You can do this three ways.
Generate CSV files of data which you can load manually. Nice repeatable test data.
Generate SQL scripts which you can run. Nice repeatable data, also.
Use an ODBC connection to generate data directly into the database. I actually don't like this as much, but you might.
Here's a stripped-down one-table-only version of a data generator that writes a CSV file.
import csv
import random
class SomeEntity( list ):
titles = ( 'attr1', 'attr2' ) # ... for all columns
def __init__( self ):
self.append( random.randrange( 1, 10 ) )
self.append( random.randrange( 100, 1000 ) )
# ... for all columns
myData = [ SomeEntity() for i in range(10000) ]
aFile= open( 'tmp.csv', 'wb' )
dest= csv.writer( aFile )
dest.writerow( SomeEntity.titles )
dest.writerows( myData )
aFile.close()
For multiple entities, you have to work out the cardinality. Instead of generating random keys, you want to make a random selection from the other entities. So you might have ChildEntity picking a random element from ParentEntity to assure that the FK-PK relationship was correct.
Use random.choice(someList)
and random.shuffle(someList)
to assure referential integrity.