ansaurus

Question

designing database to hold different metadata information

Answer 1

+2 A:

The product should be typed. e.g. include type_id in the product table, that points to the categories of products you will support, and lets you know which other tables to query against for the appropriate related attributes.

Randy 2010-07-13 20:09:03

Answer 2

+1 A:

You could go with the schema-less approach:

Hold the metadata in a TEXT column as a JSON object (or other serialization, but JSON is better for reasons soon explained).

Advantages to this technique:

Less queries: you get all the information in one query, no need for "in directional" queries (to get meta-meta-data) and joins.
You can add/remove any attributes you want at any time, no need to alter table (which is problematic in some databases, e.g Mysql locks the table, and it takes long time with huge tables)
Since it's JSON, you don't need extra processing on your backend. Your webpage (I assume it's a web application) just reads the JSON as is from your web service and that's it, you can use the JSON object with javascript however you like.

Problems:

Potentially wasted space, if you have 100 books with the same author, an author table with all the books having just the author_id is more economical space wise.
Need to implement indexes. since your metadata is a JSON object you don't have indexes straight away. But it's fairly easy to implement specific index for the specific metadata you need. e.g you want to index by author, so you create a author_idx table with author_id and item_id, when someone searches for author, you can look up this table and the items themselves.

Depending on the scale, this might be an overkill. on smaller scale joins would work just fine.

miedwar 2010-07-13 20:43:14

Answer 3

+8 A:

This is called the Observation Pattern.

alt text

Three objects, for the example

Book
Title = 'Gone with the Wind' 
Author = 'Margaret Mitchell'
ISBN   = '978-1416548898'

Cat
Name = 'Phoebe'
Color = 'Gray'
TailLength = 9 'inch'

Beer Bottle
Volume = 500 'ml'
Color = 'Green'

This is how tables may look like:

Entity
EntityID    Name            Description
   1        'Book'            'To read'
   2        'Cat'             'Fury cat' 
   3        'Beer Bottle'     'To ship beer in'

.

PropertyType
PropertyTypeID   Name        IsTrait         Description
   1            'Height'     'NO'       'For anything that has height' 
   2            'Width'      'NO'       'For anything that has width' 
   3            'Volume'     'NO'       'For things that can have volume'
   4            'Title'      'YES'      'Some stuff has title' 
   5            'Author'     'YES'      'Things can be authored' 
   6            'Color'      'YES'      'Color of things' 
   7            'ISBN'       'YES'      'Books would need this'
   8            'TailLength' 'NO'       'For stuff that has long tails'
   9            'Name'       'YES'      'Name of things'

.

Property
PropertyID   EntityID  PropertyTypeID      
    1           1              4     -- book, title
    2           1              5     -- book, author
    3           1              7     -- book, isbn
    4           2              9     -- cat, name
    5           2              6     -- cat, color
    6           2              8     -- cat, tail length
    7           3              3     -- beer bottle, volume
    8           3              6     -- beer bottle, color

.

Measurement
PropertyID     Unit       Value 
    6          'inch'       9          -- cat, tail length
    7          'ml'        500         -- beer bottle, volume

.

Trait
PropertyID         Value 
    1         'Gone with the Wind'     -- book, title
    2         'Margaret Mitchell'      -- book, author
    3         '978-1416548898'         -- book, isbn
    4         'Phoebe'                 -- cat, name
    5         'Gray'                   -- cat, color
    8         'Green'                  -- beer bottle, color

EDIT:

Jefferey raised a valid point (see comment), so I'll expand the answer.

The model allows for dynamic (on-fly) creation of any number of entites with any type of properties without schema changes. Hovewer, this flexibility has a price -- storing and searching is slower and more complex than in a usual table design.

Time for an example, but first, to make things easier, I'll flatten the model into a view.

create view vModel as 
select 
      e.EntityId
    , x.Name  as PropertyName
    , m.Value as MeasurementValue
    , m.Unit
    , t.Value as TraitValue
from Entity           as e
join Property         as p on p.EntityID       = p.EntityID
join PropertyType     as x on x.PropertyTypeId = p.PropertyTypeId
left join Measurement as m on m.PropertyId     = p.PropertyId
left join Trait       as t on t.PropertyId     = p.PropertyId
;

To use Jefferey's example from the comment

with 
q_00 as ( -- all books
    select EntityID
    from vModel
    where PropertyName = 'object type'
      and TraitValue   = 'book' 
),
q_01 as ( -- all US books
    select EntityID
    from vModel as a
    join q_00   as b on b.EntityID = a.EntityID
    where PropertyName = 'publisher country'
      and TraitValue   = 'US' 
),
q_02 as ( -- all US books published in 2008
    select EntityID
    from vModel as a
    join q_01   as b on b.EntityID = a.EntityID
    where PropertyName     = 'year published'
      and MeasurementValue = 2008 
),
q_03 as ( -- all US books published in 2008 not discontinued
    select EntityID
    from vModel as a
    join q_02   as b on b.EntityID = a.EntityID
    where PropertyName = 'is discontinued'
      and TraitValue   = 'no' 
),
q_04 as ( -- all US books published in 2008 not discontinued that cost less than $50
    select EntityID
    from vModel as a
    join q_03   as b on b.EntityID = a.EntityID
    where PropertyName     = 'price'
      and MeasurementValue < 50 
      and MeasurementUnit  = 'USD'
)
select
      EntityID
    , max(case PropertyName when 'title' than TraitValue else null end) as Title
    , max(case PropertyName when 'ISBN'  than TraitValue else null end) as ISBN
from vModel as a
join q_04   as b on b.EntityID = a.EntityID
group by EntityID ;

This looks complicated to write, but on a closer inspection you may notice a pattern in CTEs.

Now suppose we have a standard fixed schema design where each object property has its own column. The query would look something like:

select EntityID, Title, ISBN
from vModel
WHERE ObjectType       = 'book'
  and PublisherCountry = 'US'
  and YearPublished    = 2008
  and IsDiscontinued   = 'no'
  and Price            < 50
  and Currency         = 'USD'
;

Damir Sudarevic 2010-07-13 23:50:54

Thanks this is a very good answer and a lead. Let's add more to this discussion. In this sort of design how would one account for fast searching? I would imagine that would require a lot of joins ??

John Stewart 2010-07-14 13:23:17

This is a very bad idea, and it will lead to problems further down the road. Please don't do this.

Jeffrey L Whitledge 2010-07-14 15:23:10

For example, try getting all the books from US publishers of academic books published in 2008 that haven't been discontinued and cost less than $50. Good luck! In a properly designed relational database that's a 2-minute task.

Jeffrey L Whitledge 2010-07-14 15:31:25

@Jefferey, for that one uses data warehouse. The initial problem was having DB with "flexible" columns (properties) for different kind of objects, and ability to dynamically accomodate this without schema changes.

Damir Sudarevic 2010-07-14 17:06:12

@John, this would depend on search -- keep in mind that property here is in rows, not columns. To get a "standard table layout" for an entity, pivoting is needed -- which is slow. Essentially, each column that you would normally palace in the WHERE clause requires a sub-query. The easiest thing to do is to make a model, populate and try it out.

Damir Sudarevic 2010-07-14 19:56:07

This technique has pros and cons. The big "pro" is flexibility. The big "con" is that it is complicated to query and it is subject to severe entropy. By the latter I mean that if users can create new properties, you will almost surely get a lot of inconsistent properties and shear junk properties. Even if these are controlled by programmers, you will likely get a lot of junk. I'm not saying never do it, but I'd say it should be a last resort.

Jay 2010-07-15 13:06:10

Did you really mean "fury cat", or "furry cat". Well, perhaps you have a cat like that one my ex-girlfriend used to have, that could properly be called a fury cat.

Jay 2010-07-15 13:18:29

@Jay :) True, though avalanche of properties can be tamed on the application layer using "did you mean...". If there are only few object types and schema changes are ok, Jefferey's suggestion of Supertype-subtype is much better; but, if there are hundreds or thousands of object-types created on fly...

Damir Sudarevic 2010-07-15 13:40:06

@Damir: It also depends on who your users are. If it's 4 people in the marketing department, you have a much better chance to keep things under control that if its 10,000 customers.

Jay 2010-07-15 14:07:41

I'm still not a fan of this method, but your responses are an excellent contribution to this discussion. I have changed my down vote to an up vote.

Jeffrey L Whitledge 2010-07-15 18:20:40

Answer 4

+2 A:

I wasn't going to answer, but right now the accepted answer has a very bad idea. A relational database should never be used to store simple attribute-value pairs. That will cause a lot of problems down the road.

The best way to deal with this is to create a separate table for each type.

Product
-------
ProductId
Description
Price
(other attributes common to all products)

Book
----
ProductId (foreign key to Product.ProductId)
ISBN
Author
(other attributes related to books)

Electronics
-----------
ProductId (foreign key to Product.ProductId)
BatteriesRequired
etc.

Each row of each table should represent a proposition about the real world, and the structure of the tables and their constraints should reflect the realities that are being represented. The closer you can get to this ideal, the cleaner the data will be, and the easier it will be to do reporting and to extend the system in other ways. It will also run more effeciently.

Jeffrey L Whitledge 2010-07-14 15:22:07

Indeed, I do like super type-subtype too -- the problem is what happens when the number of sub-type tables goes into thousands? How to handle cases when new types are added on fly? What would would be recommended in such a case? Native XML storage or ...

Damir Sudarevic 2010-07-14 23:18:18

@Jeffrey L Whitledge: I'm very interested to see what you think about Damir's question above.

bukzor 2010-09-16 16:39:26

@bukzor , @Damir Sudarevic - Indeed, I should have answered that question, and I regret that have not done so. The answer is that the database should be designed to reflect the real-world situation that is known. If new "types" are to be added, then only the things that are consistant across those types can be relationally modeled. Some kind of attribute/value system may be called for, but only for those things that are "soft". Basically, if something can be modified by a user of the system, rather than a programmer, then it must be stored as data rather than as structure.

Jeffrey L Whitledge 2010-09-16 19:21:22

@Jeffrey L Whitledge: Thanks! Based on your feedback, I will store the very-stable metrics as measures, but keep an EAV format as well for unforseen additions. @Damir Have either of you seen prior art for this hybrid-type scheme?

bukzor 2010-09-16 20:23:29

I have a similar problem where I am creating a warehouse management system and different products have different attributes. With the structure you have described, would it be acceptable to allow the admin user of the site to add new product types?

andrew 2010-10-11 21:46:22

Answer 5

+1 A:

In this kind of problem, you have three choices:

Create a table with "generic" columns. For example, if you sell both books and toasters, it is likely that your toasters don't have an ISBN and a title, but they still have some kind of product identifier and description. So give the fields generic names like "product_id" and "description", and for books the product_id is an ISBN, for toasters its the manufacturer's part number, etc.

This works when the real-world entities are all being processed in the same way, at least for the most part, and so must have, if not the "same" data, at least analagous data. This breaks down when there are real functional differences. Like if for toasters we are calcualting watts = volts * amps, it is likely that there is no corresponding calculation for books. When you start creating a pages_volts fields that contains the page count for books and the voltage for toasters, things have gotten out of control.

Use a property/value scheme like Damir suggests. See my comment on his post for the pros and cons there.
What I'd usually suggest is a type/subtype scheme. Create a table for "product" that contains a type code and the generic fields. Then for each of the true types -- books, toasters, cats, whatever -- create a separate table that is connected to the product table. Then when you need to do book-specific processing, process the book table. When you need to do generic processing, process the product table.

Jay 2010-07-15 13:17:30

ansaurus

tags:

views:

answers:

designing database to hold different metadata information

related questions