views:

71

answers:

3

I have a large SQL server table that looks something like this:

ImageId int
Page int
FSPath varchar(256)
ImageFrame int
...

The table stores an entry for each page of a number of image files. This is done to enable the table to represent images where each page is represented by a different file, and multi-page image files that contain the pages within the same file. When I am dealing with a multi-page setup, the value of the FSPath column is exactly duplicated for each page within the same document which is eating up a lot of space (this table alone is currently ~5GB). It seems very wasteful to be duplicating the data in this way, but I haven't been able to find an alternate solution that I'm satisfied with.

The usage pattern for this table is dominated by lookups based on the primary key (ImageId/Page) for the path (and other columns) but I also need to be able to efficiently handle insertion of new data and occasional deletion.

If I create a lookup table for the path elements and insert an path element id into the pages table I would need to index it both by the Id and by the path, which would hurt the scenario where there is a distinct path piece for each page, and complicate the insertion of new data where the path may or may not exist in the lookup table. Furthermore, deletion of any row in the main pages table would require that I clean up the associated path entry if it is no longer used.

I had been hoping that I could create an updateable view of the joined tables and let SQL Server do the magic for me, but I get the message: View or function 'Scrap.dbo.PageView' is not updatable because the modification affects multiple base tables. Trying to perform an insert.

Is there a reasonable way to do this that I'm just missing, or am I out of luck?

+1  A: 

This doesn't attack the duplicate-entries problem, because I can't 100% get my head anound your schema, but here's one idea I had to cut down on potential size, assuming you're storing path information as well as filesize.

What's the file system look like? If it's a deep directory tree, is there any way you can abstract this to a separate lookup table, instead of storing the path information each time? For example, something like:

Table PATHS:

ID    PATHNAME        PARENT
int   varchar(128)    int, FK on PATHS.ID
---   ------------    --------------------
1     /               NULL
2     images          1
3     dir1            2
4     dir2            2

Or for even faster reconstruction of the path, you just store the whole thing, as long as you're just saving each path once. That way you don't have to worry about recursing back through to the root to assemble the path each time:

ID    PATHNAME
int   varchar(128)
---   ------------
1     /
2     /images
3     /images/dir1
4     /images/dir2

Then you can change your table's definition to be:

ImageId int
Page int
FileName varchar(256)
Path int, FK to PATHS.ID
ImageFrame int
...

and perhaps save a bit of space, especially if it's very deep.

lc
A: 

I think you are right - creating a new table just to store the unique paths may actually take more space than leaving the duplicated data in the original table. Plus it complicates the queries and updates. This depends on the amount of duplicates of course.

DJ
A: 

I am confused about the actual problem? Are you having problems with performance or is 5 gigs really that expensive? If performance is the problem a smaller table may not be the solution. I would investigate changing FSPath to char(256). It will take up more space but your data will line up better on the hard drive and should help performance. I would also support changing the schema as you described but it if it's not feasible because consumers can't/won't change code you may have to build some test to show that change is worth it.

jms
I'm having some performance issues due to the disk paging that occurs when I'm loading these entries in. I did investigate a fixed width char approach, but considering my average path length is 25 chars now, switching to char(256) would almost 10x my storage requirements and cause more disk faults.