views:

254

answers:

5

I have a database which I regularly need to import large amounts of data into via some python scripts. Compacted, the data for a single months imports takes about 280mb, but during the import file size swells to over a gb.

Given the 2gb size limit on mdb files, this is a bit of a concern. Apart from breaking the inserts into chunks and compacting inbetween each, are there any techniques for avoiding the increase in file size?

Note that no temporary tables are being created/deleted during the process: just inserts into existing tables.

And to forstall the inevitable comments: yes, I am required to store this data in Access 2003. No, I can't upgrade to Access 2007.

If it could help, I could preprocess in sqlite.

Edit:

Just to add some further information (some already listed in my comments):

  • The data is being generated in Python on a table by table basis, and then all of the records for that table batch inserted via odbc
  • All processing is happening in Python: all the mdb file is doing is storing the data
  • All of the fields being inserted are valid fields (none are being excluded due to unique key violations, etc.)

Given the above, I'll be looking into how to disable row level locking via odbc and considering presorting the data and/or removing then reinstating indexes. Thanks for the suggestions.

Any further suggestions still welcome.

+3  A: 

A common trick, if feasible with regard to the schema and semantics of the application, is to have several MDB files with Linked tables.

Also, the way the insertions take place matters with regards to the way the file size balloons... For example: batched, vs. one/few records at a time, sorted (relative to particular index(es)), number of indexes (as you mentioned readily dropping some during the insert phase)...

Tentatively a pre-processing approach with say storing of new rows to a separate linked table, heap fashion (no indexes), then sorting/indexing this data is a minimal fashion, and "bulk loading" it to its real destination. Similar pre-processing in SQLite (has hinted in question) would serve the serve purpose. Keeping it "ALL MDB" is maybe easier (fewer languages/processes to learn, fewer inter-op issues [hopefuly ;-)]...)

EDIT: on why inserting records in a sorted/bulk fashion may slow down the MDB file's growth (question from Tony Toews)
One of the reasons for MDB files' propensity to grow more quickly than the rate at which text/data added to them (and their counterpart ability to be easily compacted back down) is that as information is added, some of the nodes that constitute the indexes have to be re-arranged (for overflowing / rebalancing etc.). Such management of the nodes seems to be implemented in a fashion which favors speed over disk space and harmony, and this approach typically serves simple applications / small data rather well. I do not know the specific logic in use for such management but I suspect that in several cases, node operations cause a particular node (or much of it) to be copied anew, and the old location simply being marked as free/unused but not deleted/compacted/reused. I do have "clinical" (if only a bit outdated) evidence that by performing inserts in bulk we essentially limit the number of opportunities for such duplication to occur and hence we slow the growth.

EDIT again: After reading and discussing things from Tony Toews and Albert Kallal it appears that a possibly more significant source of bloat, in particular in Jet Engine 4.0, is the way locking is implemented. It is therefore important to set the database in single user mode to avoid this. (Read Tony's and Albert's response for more details.

mjv
It is possible, but awkward as copies of the database are made regularly and moved to different file locations. I will go this route if required, but avoiding it is one of the major reasons for the question.
mavnn
Interesting point on pre sorting in sqlite.
mavnn
Could you be more specific as to how batched vs one/few records makes a difference ti Access databases? As well as presorting.
Tony Toews
@Tony Toews. See edit. The explanation is an educated guess, certainly backed by hands-on experience. I'm sure we could find online authoritative info about the inner working of the Jet Engine which would shed further light on the topic. Maybe part of the reason why there's relatively little documentation about this and also why the engine is not improved in this area, is that this only affect a small number of applications. If you're "small data" no problem, if you're "big data" you use "real" dbmses, leaving only medium sized app like this one to suffer ;-)
mjv
I see where you are coming from however I disagree with it and I I like Albert's explanation a lot better.
Tony Toews
@Tony Toews Agreed on Albert excellent explanation about exclusive access/locking. Bloat happens too in exclusive access mode, at least it did in earlier Jet Engine version. (What goes without saying also goes when we say it and we/I should have stressed the OP about the interest of accessing the database in exclusive / no locks mode). On your disagreement, is it about the technical merit of the explanation or on the opinion that MDB is not a "real" dbms and receives relatively little engineering efforts (compared to say MSSQL).
mjv
<smile> My disagreement was with the technical merits of your argumen and not with your comment with respect to "real" dmsses. Your attitude there was reasonably sane and expressed in a way similar to what I would've said. Unlike others on this online forum. That said, as I stated in my reply, I saw a lot more bloat in Jet 4.0/Access 2000 systems than I did in Jet 3.51/Access 97. It was the same app which I was working with for about five years on and off.
Tony Toews
Tony, gotcha. I amended my prose accordingly. You and Albert seem to have more recent exposure to Jet/MDB; I still think that bulk processing could also provide a worthy improvement (but then again I saw the OP's added description on how individual rows are fed with ODBC!... so who cares...). In case, if Albert or yourself feel that I'm off-base, I'll be pleased to delete my post for clarity sake. cheers.
mjv
A: 

I find I am able to link from Access to Sqlite and to run a make table query to import the data. I used this ODBC Driver: http://www.ch-werner.de/sqliteodbc/ and created User DNS.

Remou
Unfortunately I don't have admin rights to install the driver. +1 for the help though.
mavnn
Why the down votes? This post relates to a comment by the op "if, for example, there is a way of bulk transferring from sqlite it might be more efficient. No processing is being done in Access."
Remou
+2  A: 

One thing to watch out for is records which are present in the append queries but aren't inserted into the data due to duplicate key values, null required fields, etc. Access will allocate the space taken by the records which aren't inserted.

About the only significant thing I'm aware of is to ensure you have exclusive access to the database file. Which might be impossible if doing this during the day. I noticed a change in behavior from Jet 3.51 (used in Access 97) to Jet 4.0 (used in Access 2000) when the Access MDBs started getting a lot larger when doing record appends. I think that if the MDB is being used by multiple folks then records are inserted once per 4k page rather than as many as can be stuffed into a page. Likely because this made index insert/update operations faster.

Now compacting does indeed put as many records in the same 4k page as possible but that isn't of help to you.

Tony Toews
+2  A: 

Are you sure row locking is turned off? In my case, turning off row locking reduced bloat by over a 100 megs when working on a 5 meg file. (in other words the file barley grew after turning off row locking to about 6 megs). With row locking on, the same operation results in a file well over 100 megs in size.

Row locking is a HUGE source of bloat during recordset operations since it pads each record to a page size.

Do you have ms-access installed here, or are you just using JET (JET is the data engine that ms-access uses. You can use JET without access).

Open the database in ms-access and go:

Tools->options On the advanced tab, un-check the box: [ ] Open databases using record level locking.

This will not only make a HUGE difference in the file growth (bloat), it will also speed things up by a factor of 10 times.

There also a registry setting that you can use here.

And, Are you using odbc, or an oleDB connection?

You can try:

Set rs = New ADODB.Recordset With rs .ActiveConnection = RsCnn .Properties("Jet OLEDB:Locking Granularity") = 1

Try the setting from accesss (change the setting), exit, re-enter and then compact and repair. Then run your test import…the bloat issue should go away.

There is likely no need to open the database using row locking. If you turn off that feature, then you should be able to reduce the bloat in file size down to a minimum.

For furher reading and an example see here: http://stackoverflow.com/questions/1118219/does-acedao-support-row-level-locking/1121472#1121472

Albert D. Kallal
Ahh, now that could make a lot of difference too. And that was new in Jet 4.0 so that could correlate with my experience. Now I thought that the first user accessing the BE MDB specified the record level locking and all subsequent users in used that users options. Or something like that.
Tony Toews
Thank you for this: it made a huge impact to both speed and file size during the imports. Compacting afterwards made almost no difference at all.
mavnn
+1  A: 

Is your script executing a single INSERT statement per row of data? If so, pre-processing the data into a text file of many rows that could then be inserted with a single INSERT statement might improve the efficiency and cut down on the accumulating temporary crud that's causing it to bloat.

You might also make sure the INSERT is being executed without transactions. Whether or not that happens implicitly depends on the Jet version and the data interface library you're using to accomplish the task. By explicitly making sure it's off, you could improve the situation.

Another possibility is to drop the indexes before the insert, compact, run the insert, compact, re-instate the indexes, and run a final compact.

David-W-Fenton
No, records are batch inserted per table. (See edits to original question).
mavnn
When you say "batch inserted" do you mean a single INSERT statement, of multiple INSERT statements processed in a batch?
David-W-Fenton