views:

65

answers:

4

I'm storing some very basic information "data sources" coming into my application. These data sources can be in the form of a document (e.g. PDF, etc.), audio (e.g. MP3, etc.) or video (e.g. AVI, etc.). Say, for example, I am only interested in the filename of the data source. Thus, I have the following table:

DataSource
Id (PK)
Filename

For each data source, I also need to store some of its attributes. Example for a PDF would be "numbe of pages." Example for audio would be "bit rate." Example for video would be "duration." Each DataSource will have different requirements for the attributes that need to be stored. So, I have modeled "data source attribute" this way:

DataSourceAttribute
Id (PK)
DataSourceId (FK)
Name
Value

Thus, I would have records like these:

DataSource->Id = 1
DataSource->Filename = 'mydoc.pdf'

DataSource->Id = 2
DataSource->Filename = 'mysong.mp3'

DataSource->Id = 3
DataSource->Filename = 'myvideo.avi'

DataSourceAttribute->Id = 1
DataSourceAttribute->DataSourceId = 1
DataSourceAttribute->Name = 'TotalPages'
DataSourceAttribute->Value = '10'

DataSourceAttribute->Id = 2
DataSourceAttribute->DataSourceId = 2
DataSourceAttribute->Name = 'BitRate'
DataSourceAttribute->Value '16'

DataSourceAttribute->Id = 3
DataSourceAttribute->DataSourceId = 3
DataSourceAttribute->Name = 'Duration'
DataSourceAttribute->Value = '1:32'

My problem is that this doesn't seem to scale. For example, say I need to query for all the PDF documents along with thier total number of pages:

Filename, TotalPages
'mydoc.pdf',  '10'
'myotherdoc.pdf', '23'
...

The JOINs needed to produce the above result are just too costly. How should I address this problem?

A: 

It seems like you want something a bit more losse than a typical Relational db. Sounds like a good candidate for something like Lucene or MongoDB. Lucene is an index engine which allows any type of document to be stored and indexed. MongoDB is in the middle between RDBMS and free-form document storage. JSON in some form or other (MongoDB is a good example) should fit nicely.

cofiem
@cofiem: I'm not sure if I am ready to introduce another technology into the application. Right now, I want to try to solve this via proper data modeling.
StackOverflowNewbie
A: 

This might work, but define too costly...

select 
datasource.id, 
d1.id as d1id, 
d1.value as d1filename,
d2.id as d2id,
d2.value as d2totalpages

 from datasource 
inner join datasourceattribute d1
on datasource.id = d1.datasourceid and d1.name = 'filename'
inner join datasourceattribute d2
on datasource.id = d2.datasourceid and d2.name = 'totalpages'
having d1filename like '%pdf'
Zak
@Zak: My query was basically the same as yours. I ran 388 queries, which took 255.7976 seconds to execute. That's over 4 minutes for such a small number of records, no?
StackOverflowNewbie
what do your table indexes look like?
Zak
also, prefix this query with "explain" then run it in your command line, then post the results...
Zak
@Zak: does this help? http://pastebin.com/a8MZ22wE
StackOverflowNewbie
This should work, but you need some indexes. Perhaps ix_datasourceid_name, or just ix_datasourceid. Add those and repost your explain.
Gary
What Gary said :)
Zak
A: 

Scaling is one of the most common problems with EAV (Entity-Attribute-Value) data structures. In short, you have to ask for the meta data (i.e. locate the attributes) to get to the data. However, here is a query that you can use to get the data you want:

Select DataSourceId 
    , Min( Case When Name = 'TotalPages' Then Value End ) As TotalPages
    , Min( Case When Name = 'BitRate' Then Value End ) As BitRate
    , Min( Case When Name = 'Duration' Then Vlaue End ) As Duration
From DataSourceAttribute
Group By DataSourceId 

In order to improve performance, you'll want an index on DataSourceId and perhaps Name as well. To get to the results you posted, you would do:

Select DataSource.FileName
    , Min( Case When DataSourceAttribute.Name = 'TotalPages' Then Value End ) As TotalPages
    , Min( Case When DataSourceAttribute.Name = 'BitRate' Then Value End ) As BitRate
    , Min( Case When DataSourceAttribute.Name = 'Duration' Then Vlaue End ) As Duration
From DataSourceAttribute
    Join DataSource
        On DataSource.Id = DataSourceAttribute.DataSourceId
Group By DataSource.FileName
Thomas
A: 

Entity-Attribute-Value, just don't!

To be fair, it has its place on occasions but in general I would try to avoid it. The solution in your case would be to do exactly as you mentioned in one of your comments, proper data modeling. In this case, that would mean modeling each datasource individually.

Are there so many sources that creating tables for each is not an option?

Mark Storey-Smith