views:

35

answers:

2

I wish to save some semantic information about data in a table. How can I save this information in MySQL, such that I can access data and also search for the articles using the semantic data.

For example, I have a article about Apple and Microsoft. The semantic data will be like
Person : Steve Jobs
Person : Steve Ballmer
Company : Apple
Company : Microsoft

I want to save the information without losing the info that Steve Jobs and Steve Ballmer are persons and Apple and Microsoft are companies. I also want to search for articles about Steve Jobs / Apple.

Person and Company are not the only possible types, hence adding new fields is not viable. Since the type of the data is to be saved, I cannot use FullText field type directly.

Update - These are two options that I am considering.

  1. Save the data in a full text column as serialized php array.
  2. Create another table with 3 columns

--

--------------------------------  
| id | subject | object        |
--------------------------------
|  1 | Person  | Steve Ballmer |
|  1 | Person  | Steve Jobs    |
|  1 | Company | Microsoft     |
|  1 | Company | Apple         |
|  2 | Person  | Obama         |
|  2 | Country | US            |
--------------------------------
+1  A: 

You're working on a hard and interesting problem! You may get some interesting ideas from looking at the Dublin Core Metadata Initiative.

http://dublincore.org/metadata-basics/

To make it simple, think of your metadata items as all fitting in one table.

e.g.

Ballmer employed-by Microsoft
Ballmer is-a Person
Microsoft is-a Organization
Microsoft run-by Ballmer
SoftImage acquired-by Microsoft
SoftImage is-a Organization
Joel Spolsky is-a Person
Joel Spolsky formerly-employed-by Microsoft
Spolsky, Joel dreamed-up StackOverflow
StackOverflow is-a Website
Socrates is-a Person
Socrates died-on (some date)

The trick here is that some, but not all, your first and third column values need to be BOTH arbitrary text AND serve as indexes into the first and third columns. Then, if you're trying to figure out what your data base has on Spolsky, you can full-text search your first and third columns for his name. You'll get out a bunch of triplets. The values you find will tell you a lot. If you want to know more, you can search again.

To pull this off you'll probably need to have five columns, as follows:

Full text subject  (whatever your user puts in)
Canonical subject (what your user puts in, massaged into a standard form)
Relation (is-a etc)
Full text object
Canonical object

The point of the canonical forms of your subject and object is to allow queries like this to work, even if your user puts in "Joel Spolsky" and "Spolsky, Joel" in two different places even if they mean the same person.

SELECT * 
  FROM relationships a
  JOIN relationships b (ON a.canonical_object = b.canonical_subject)
 WHERE MATCH (subject,object) AGAINST ('Spolsky')
Ollie Jones
That is an interesting solution Ollie. But my data is not that complicated. It will just have details like a list of people, organziations, events, locations etc mentioned in the article. Thanks.
Joyce Babu
@Joyce, you're right, my idea is overdesigned. But even if your data points simpler than what I mentioned, to make this extensible you need the two forms of each data point -- the canonical and the free-text form. And for extensibility you need a column giving the category. This lets you have Spolsky is-a Person, FogCreek is-a Company, and Stackoverflow is-a Website.
Ollie Jones
+1  A: 

You might want to normalize your data table by making 2 tables.

----------------
| id | subject |
----------------
|  1 | Person  |  
|  2 | Company |
|  3 | Country |
----------------

-----------------------------------  
| id | subject-id | object        |
-----------------------------------
|  1 |          1 | Steve Ballmer |
|  2 |          1 | Steve Jobs    |
|  3 |          2 | Microsoft     |
|  4 |          2 | Apple         |
|  5 |          1 | Obama         |
|  6 |          3 | US            |
-----------------------------------

This allows you to more easily see all the different subject types you have defined.

Gilbert Le Blanc
Performance-wise which is better - Normalized tables or single table?
Joyce Babu
Depends on your indexes. On the subject table, you need a subject unique index. On the object table, you need a subject_id, object, unique index. These indexes are in addition to the clustering indexes on id. With these indexes, the performance should be just as good with the normalized tables.
Gilbert Le Blanc