There are a number of problems you will face. Probably the biggest is that the different DBMS all have different preferred load formats - so any single data format will require some gymnastics for one or more of the DBMS - unless you generate INSERT statements directly.
Informix prefers a format that can be loosely characterized as 'delimited fields with backslash used as an escape and (unescaped) newline indicating end of record'. The default delimiter is the pipe symbol '|
', but that can be changed to suit. For example:
100|Some string|2008-09-12|23.145|wc -l \| sort -n|76:34:45.219
The date formats are reasonably flexible, fortunately. If you want the gruesome details, download the source code for SQLCMD (not the Microsoft usurper - the original one) from the IIUG source code archive and read the file unload.format
. Loading CSV data is not a trivial exercise in Informix - though I do have a Perl script called csv2unl
to largely automate the conversion from CSV to Informix UNLOAD format (should be available at the IIUG web site too).
As @St3fan suggested, any of the major scripting languages could be used to generate the data. I'd use Perl but that's mainly because I learned Perl a long time ago and am therefore most comfortable with it.
Another issue to consider is whether you are generating data for a single table (or a set of unrelated tables) or for a set of related tables. For example, it is relatively easy to generate data for a single table; it is much harder to generate data for two tables that share a common field (for example, an Orders table and an OrderItems table).
Even so, 2 TB is a moderately daunting task. Even if each row is 1 KB, you will need to generate about 2 billion rows of data. You will need to load the data in chunks - not all in a single transaction. You'll probably want to create the indexes after the load - but that places the onus on you to ensure that the data in the tables is valid (no inappropriate duplicates). If you are using a SERIAL column (Informix-speak for an autogenerated value), you will probably need to use BIGSERIAL (or perhaps SERIAL8 - it depends on the version of Informix you are using, but it should be IDS 11.50, in which case BIGSERIAL is the better choice).
@dotIN asks about timing...how long to load?
Let's review some fundamentals...what is a decent write rate for writing to disk? 100 MB/s sustained? Let's use it as a starting point.
At 100 MB/s of writing data, it is going to take:
2,000,000 MB / 100 MB/s = 20,000 s
which is approximately 6 hours.
I think that's a very high rate; further, you have to get the data to the DBMS (so you have to have statements executing at a rate corresponding to 100 MB/s), plus you have to worry about database logging of the activity, and so on and so forth. If the load can be spread effectively over multiple disks, you may be able to get close to that. However, it is very easy to be I/O bound, especially if your machine doesn't have multiple separately addressable disk drives (a single multi-terabyte RAID drive, for example, with a single I/O channel).
If each row is loaded via an individual INSERT statement, then you have to execute a vast number of statements per second. That's another performance inhibitor. You've not stated exactly how you are doing the loading, but when handling vast quantities, you have to be extremely careful, and it takes skill and experience to get the utmost performance out of any one of the DBMS - let alone all of them. And note that configurations that speed the load performance of terabytes of data do not necessarily lead to good performance when you are no longer seeking to load but to extract information.
And there is mention of blobs; those have special constraints and require careful handling on each system, and generally complicate the story. (For example, in IDS, you would need a separate Smart BlobSpace for the smart blobs - BLOB or CLOB types - from the DBSpace where you store the data. If you are using old-fashioned BYTE or TEXT blobs, you probably want to use an appropriate BlobSpace - as distinct from a Smart BlobSpace - with a page size configured to suit the data you are storing. You probably don't want to be storing the BYTE or TEXT blobs IN TABLE - it works, but it hammers the logging system, which is why BlobSpaces are available as an option in the first place.)