views:

264

answers:

4

I want to record what various sources have to say about a historical figure. i.e.

  • The website Wikipedia says that Susan B. Anthony was born February 15, 1820 and her favorite color was blue
  • The book Century of Struggle says that Susan B. Anthony was born on February 12, 1820 and her favorite color was red
  • The book History of Woman's Suffrage says that Susan B. Anthony was born on February 15, 1820 and her favorite color was red and she was the second cousin of Abraham Lincoln

I also want researchers to be able to express their confidence, for instance with a percentage, in the individual claims that these sources are making. i.e.

  • User A is 90% confident that Susan B. Anthony was born on February 15, 1820; 75% confident that her favorite color was blue, and 30% confident that she was second cousins with Abraham Lincoln
  • User B is 30% confident that Susan B. Anthony was born on February 12, 1820; 60% confident that her favorite color was blue, and 10% confident that she was second cousins with Abraham Lincoln

I then want each user to have a view of Susan B. Anthony that shows her birthday, favorite color, and relationships that the users thinks are most likely to be true.

I'm also want to use a relational database datastore, and the way that I can think to do this is to create a separate table for every individual type of atomic fact that I want the users to be able to express their confidence in. So for this example there would be eight tables in total, and three separate table for the three separate atomic facts.

Source(id)
Person(id)

Claim(claim_id, source, FOREIGN KEY(source) REFERENCES Source(id) )
Alleged_birth_date(claim_id, person, birth_date, FOREIGN KEY(claim_id) REFERENCES Claim(id), FOREIGN KEY(person) REFERENCES person(id))
Alleged_favorite_color(claim_id, person, color, FOREIGN KEY(claim_id) REFERENCES Claim(id), FOREIGN KEY(person) REFERENCES person(id)) 
Alleged_kinship(claim_id, person, relationship type, kin, FOREIGN KEY(claim_id) REFERENCES Claim(id), FOREIGN KEY(person) REFERENCES Person(id))

User(id)
Confidence_in_claim(user, claim, confidence, FOREIGN KEY(user) REFERENCES User(id), FOREIGN KEY(claim) REFERENCES claim(id))

This feels like it gets very complicated very quickly, as actually want to record a lot of types of atomic facts. Are there better ways to do this?

This is, I think, the same issue that Martin Fowler calls Contradictory Observations.

+2  A: 

RDF is great for this. It's usually described as a format for metadata; but in fact it's a graph model of 'assertions' on triplets.

The whole 'semantic web' idea is to publish lots of facts on RDF, and search engines would be inference engines that traverse the unified graph to find relationships.

There's also some mechanisms to refer to a triplet, so you can say something about an assertion, like it's origin (who says this?), or when it was asserted (when did he said that?), or how much you beleive it to be true, etc.

As a big example, the whole OpenCyc 'commonsense knowledge base' is queryable in RDF

Javier
I looked in to RDF, and I still might go that way, but there does not seem to be anything like an emerging consensus on how to do this. Reification seems precisely like what I want but SPARQL does not like it. Named graphs seem a partial solution, but still does not seem quite there.
fgregg
sure, where there's no consensus you have to make a choice; but not using any standard (even an incomplete standard) means you have to make choices for _everything_. And if some consensus emerges, you might have to adapt; but not so much as if you did it totally different.
Javier
Maybe it would be easier for you to work directly with a graph model of the data? I wrote a response to the post by Fowler mentioned in the question along this line of thought: http://blog.nawroth.com/2009/03/flexibility-in-data-modeling.html
nawroth
@Javier, That's a fair point.
fgregg
@nawroth: RDF is a graph model. the triplet store and the XML are just representations of that. once you internalize them, you're working on graphs
Javier
@Javier: I should have made myself more clear. There are other ways than RDF to work with a graph model, namely graph databases. Dependeing on what you want to do the API of a graphdb could be easier to work with than RDF. If it's a main concern not to tie the backend to a specific implementation, RDF is the way to go I think.
nawroth
A: 

This feels like it gets very complicated very quickly

You're not kidding. Have a look at the work on ontology and knowledge representation.

Charlie Martin
+1  A: 

I think what you want to use is a "property bag". Instead of modeling each individual type of fact that you want to describe, you want to have a table which will contain an ID, a "key" (in this case, the alleged information (such as "kinship")) and a "value" (in this case, the alleged value (such as "Abraham Lincoln)). Then you want to have a second table which ties your claimants to that table, along with a level of confidence that they have in that information. That table would simply have the ID of the source, the ID of the property, and the confidence that the source has in the information. In that way, you can have a source which has either a lot or a little information; you can also model differing sources having differing levels of confidence in a given attribute; there is also no limitation on how many differing types of information you can store.

It's a pretty standard solution for situations such as yours where you have large amounts of optional information that you want to cross-reference.

McWafflestix
+3  A: 

You should try a Star Schema model, centered around a "Fact" table and several "Dimension" tables. This is a well-explored model, and there are many database optimizations for it.

claim_fact(source_id, person_id, user_id, details_id, weight)

Source_dimension(id, name)

Person_dimension(id, name)

User_dimension(id, name)

details_dimension(id, name NOT NULL, color NULLABLE, kinship NULLABLE, birthday NULLABLE)

Every claim would have a source, person, user, and details. NAME values for details would be values such as "kinship", "birthday".

Keep in mind that this is an OLAP schema (rather than an OLTP structure), and being so it is not fully normalized. The benefits to this outweigh any problems you may come across due to redundancy, as queries to star schemas are highly optimized by DBMSs configured for Data Warehousing.

RECOMMENDED READING: The Data Warehouse Toolkit (Kimball, et al.)