views:

408

answers:

6

I am a new to Informix and as part of my testing activity I am in need of creating 2TB+ size data for Oracle, Informix & Sybase. Is there a database-neutral way of doing this?

I am looking for any freeware or open source tools as well; I can find a few for Oracle but hardly any for Informix & Sybase.

+1  A: 

I've done this sort of thing many times with some simple Python, Perl or Ruby script to either generate SQL statements or some CSV style file that a database-specific tool can import.

Two terabytes is a lot though. You might want to do it in batches.

St3fan
I tried doing it with perl but its taking about 5-6 days to populate about 1TB of data, then I used BLOB for Oracle but this is also taking 4-5 days for 2TB, maybe I am not doing ding it the optimal wayI would like to do this in a day or 2 is it possible ?
dotIN
A: 

There are a number of problems you will face. Probably the biggest is that the different DBMS all have different preferred load formats - so any single data format will require some gymnastics for one or more of the DBMS - unless you generate INSERT statements directly.

Informix prefers a format that can be loosely characterized as 'delimited fields with backslash used as an escape and (unescaped) newline indicating end of record'. The default delimiter is the pipe symbol '|', but that can be changed to suit. For example:

100|Some string|2008-09-12|23.145|wc -l \| sort -n|76:34:45.219

The date formats are reasonably flexible, fortunately. If you want the gruesome details, download the source code for SQLCMD (not the Microsoft usurper - the original one) from the IIUG source code archive and read the file unload.format. Loading CSV data is not a trivial exercise in Informix - though I do have a Perl script called csv2unl to largely automate the conversion from CSV to Informix UNLOAD format (should be available at the IIUG web site too).

As @St3fan suggested, any of the major scripting languages could be used to generate the data. I'd use Perl but that's mainly because I learned Perl a long time ago and am therefore most comfortable with it.

Another issue to consider is whether you are generating data for a single table (or a set of unrelated tables) or for a set of related tables. For example, it is relatively easy to generate data for a single table; it is much harder to generate data for two tables that share a common field (for example, an Orders table and an OrderItems table).

Even so, 2 TB is a moderately daunting task. Even if each row is 1 KB, you will need to generate about 2 billion rows of data. You will need to load the data in chunks - not all in a single transaction. You'll probably want to create the indexes after the load - but that places the onus on you to ensure that the data in the tables is valid (no inappropriate duplicates). If you are using a SERIAL column (Informix-speak for an autogenerated value), you will probably need to use BIGSERIAL (or perhaps SERIAL8 - it depends on the version of Informix you are using, but it should be IDS 11.50, in which case BIGSERIAL is the better choice).


@dotIN asks about timing...how long to load?

Let's review some fundamentals...what is a decent write rate for writing to disk? 100 MB/s sustained? Let's use it as a starting point.

At 100 MB/s of writing data, it is going to take:

2,000,000 MB / 100 MB/s = 20,000 s

which is approximately 6 hours.

I think that's a very high rate; further, you have to get the data to the DBMS (so you have to have statements executing at a rate corresponding to 100 MB/s), plus you have to worry about database logging of the activity, and so on and so forth. If the load can be spread effectively over multiple disks, you may be able to get close to that. However, it is very easy to be I/O bound, especially if your machine doesn't have multiple separately addressable disk drives (a single multi-terabyte RAID drive, for example, with a single I/O channel).

If each row is loaded via an individual INSERT statement, then you have to execute a vast number of statements per second. That's another performance inhibitor. You've not stated exactly how you are doing the loading, but when handling vast quantities, you have to be extremely careful, and it takes skill and experience to get the utmost performance out of any one of the DBMS - let alone all of them. And note that configurations that speed the load performance of terabytes of data do not necessarily lead to good performance when you are no longer seeking to load but to extract information.

And there is mention of blobs; those have special constraints and require careful handling on each system, and generally complicate the story. (For example, in IDS, you would need a separate Smart BlobSpace for the smart blobs - BLOB or CLOB types - from the DBSpace where you store the data. If you are using old-fashioned BYTE or TEXT blobs, you probably want to use an appropriate BlobSpace - as distinct from a Smart BlobSpace - with a page size configured to suit the data you are storing. You probably don't want to be storing the BYTE or TEXT blobs IN TABLE - it works, but it hammers the logging system, which is why BlobSpaces are available as an option in the first place.)

Jonathan Leffler
+1  A: 
  1. You must decide how repetitive the data can be.
  2. Load your distinct data in whatever tables needed.
  3. Grow that data exponentially with

    INSERT INTO my_table SELECT * FROM my_table;

If you need unique primary key fields, replace those with sequences in relevant inserts for Oracle and whatever the equivalent is for other DBs.

If your hardware can't handle the load of doubling 100G+ of data, do it in smaller batches. Use WHERE rownum < 100000... for Oracle and whatever the equivalent is for other DBs.

jva
+1  A: 

Working across multiple databases in this way is a non-trivial task. Frankly, 2TB is right at the very top end of what is achievable with these products (unless you are using Sybase IQ - which you didn't mention). If you are doing data warehousing or reporting from this data then you might want to reconsider you product choices.

{It would be easier to give you advice if you explained why you want to load 2TB of test data. Also, why these databases? "Tricks" that work for loading in Oracle will be different for Sybase. Anyway here's my generic advice…}

First, review your DDL and completely remove any and all constraints and auto incrementing values. The DB spends a lot of CPU and IO cycles checking these values when you're doing any kind of insert, so get rid of them. It will be faster to reapply them later anyway, if necessary.

Second, generate a 1 column table for each column that you want to have in your final table. For instance, if this is an address table, you might have:

First_Name, Last_Name, Address_1, etc.

Populate each of this tables with a small sample of the values you expect in the real data, say 10 rows per table.

Now for the magic: you cross join all of these 1-column tables together in a cartesian product. This will give you 1 row for every possible combination of your 1 column tables and thus "inflate" them to the size you require.

Example Query: (syntax may vary per db)

SELECT *
  FROM      First_Name 
 CROSS JOIN Last_Name 
 CROSS JOIN Address_1 
       …
 CROSS JOIN Post_Code

You can calculate how much data will be generated by multiplying the row counts.

    10 tables w/ 10 rows 
  = 10^10 
  = 10,000,000,000 
  = 10 billion rows

Then multiple your rows count by the average row size to get a total data volume, excluding db overhead.

    (128 byte rows * 10 billion rows) / 1024^4 (Terabyte)
   = 1.164 Terabytes of sample data.

The Easy Way

Download a trial copy of Benchmark Factory from Quest. This will allow you to push several different benchmark data sets into your database and run them. Not cheap though if you want to keep using it.

Joe Harris
For Oracle, there is a free 'Swingbench' data generator (http://www.dominicgiles.com/datagenerator.html)
Gary
A: 

This isn't a problem for any programming language that allows database access.

Geo
That is a gross over-simplification and isn't very helpful. Handling 2 TB of data is a non-trivial exercise anywhere.
Jonathan Leffler
What's the problem with keeping things simple?
Geo
The fastest way to load/generate terabytes of data will be with a database specific product. Anything that tries to pipe that much data through a general (eg JDBC) connection will be slow.
Gary
A: 

I have a 2TB+ (nrows=10M, rowsize=2048) ascii test file (with pipe delimiters) which has unique: fullnames, adresses, phone numbers and a variety of other datatypes, like DATE, SMALLINT, DECIMAL (9,2), etc. for testing/benchmarking purposes.

Problem is. how can i get it to you?

Frank Computer