views:

341

answers:

4

I'm using a DB table with various different entities. This means that I can't have an arbitrary number of fields in it to save all kinds of different entities. I want instead save just the most important fields (dates, reference IDs - kind of foreign key to various other tables, most important text fields etc.) and an additional text field where I'd like to store more complete object data.

the most obvious solution would be to use XML strings and store those. The second most obvious choice would be JSON, that usually shorter and probably also faster to serialize/deserialize... And is probably also faster. But is it really? My objects also wouldn't need to be strictly serializable, because JsonSerializer is usually able to serialize anything. Even anonymous objects, that may as well be used here.

What would be the most optimal solution to solve this problem?

Additional info

My DB is highly normalised and I'm using Entity Framework, but for the purpose of having external super-fast fulltext search functionality I'm sacrificing a bit DB denormalisation. Just for the info I'm using SphinxSE on top of MySql. Sphinx would return row IDs that I would use to fast query my index optimised conglomerate table to get most important data from it much much faster than querying multiple tables all over my DB.

My table would have columns like:

  • RowID (auto increment)
  • EntityID (of the actual entity - but not directly related because this would have to point to different tables)
  • EntityType (so I would be able to get the actual entity if needed)
  • DateAdded (record timestamp when it's been added into this table)
  • Title
  • Metadata (serialized data related to particular entity type)

This table would be indexed with SPHINX indexer. When I would search for data using this indexer I would provide a series of EntityIDs and a limit date. Indexer would have to return a very limited paged amount of RowIDs ordered by DateAdded (descending). I would then just join these RowIDs to my table and get relevant results. So this won't actually be full text search but a filtering search. Getting RowIDs would be very fast this way and getting results back from the table would be much faster than comparing EntityIDs and DateAdded comparisons even though they would be properly indexed.

A: 

Have you looked at NoSql databases?

http://nosql-database.org/

Otherwise, I'd have to say that your datamodel could lead you into difficulties in the future...

Paddy
I'm using this entity conglomerate table for faster fulltext search on common fields, so I'm also bound to those technologies and I can't just use any data store...
Robert Koritnik
+1  A: 

Don't do it. It is a bad idea.

If you really must do it, I would use XML. SQL Server, for example, lets you query against the XMl as long as the field is an XML type.

In a sense, you are doing what object-oriented databases were doing. They have fallen out of favor because, for the most part, ORM tools, like Hibernate and Microsoft's Entity Framework, allow you to have the best of both the OO and relational worlds.

Tom Cabanski
I use Hibernate every day, and I've used OODBs, and I wouldn't call ORMs the "best of both worlds" by any stretch of the imagination. It's a messy hack to get you a fraction of the benefit while still using all the existing RDBMS infrastructure. But if I could snap my fingers and upgrade all of our systems to actual OODBs, I would, in a heartbeat.
Ken
I like the theory of OODBs too. I love the way they make the programmer's job much, much easier. Unfortunately, they make things like reporting and business intelligence a PITA. That is why there is very little market for them.I guess I should have said that ORMs remove much of the pain associated with OO to relational mapping so that the programmer can focus more energy on other areas of the application.
Tom Cabanski
I don't really think object-oriented databases have fallen out of favor-- they're a different beast and not as widely used as rdbms + orm.
Jim Schubert
+3  A: 

The only means of saving data in a SQL database that will not lead to long-term pain is to actually create a proper, normalized, indexed schema, and extend that schema as necessary when you add new properties to your domain objects.

Please do not attempt to "serialize" objects to a SQL database. If that is really what you want to do, you are better off using an object database such as db4o instead.


Update:

This is how I currently understand the problem space, based on comments and question updates:

  • The table structure is already normalized;
  • A fulltext engine (Sphinx) is being used to optimize certain searches;
  • The specific "serialized" data being discussed here will be used as summaries or previews for individual search results, and does not represent the complete object data.

My choices for this, in order of preference, would be:

  • Use the FTS engine's features.
    Almost every FTS engine, including Sphinx, allows custom attributes to be stored as part of each "document." Presently you say you are only storing the Row ID so you can join to this table. Your results will come back a lot faster if you don't join at all, and instead keep this information inside the fulltext index itself. There are some pretty hard limits to what you can put in here, but if you can work around the limitation, it's your best option.

  • Document-Oriented Database.
    You say you're not even really using the "Full-Text" part of Sphinx, you're just using it to optimize certain queries. Why not cut out the middle man then? You suggested JSON as a serialization format; MongoDB (to cite just one option) supports BSON natively. You can still create indexes on the common columns, but unlike mysql, it actually understands the BSON format, and is able to store that data a lot more efficiently than a JSON or XML string in a relational database. If you're denormalizing anyway, you have the freedom to choose any repository you want; choose the one that's best-optimized for your particular requirement.

  • Single-table inheritance.
    This is a common design that trades off normalization for mapping simplicity. In your case, the entire objective is denormalization, so it's a good trade. This isn't a good option if there will be hundreds of columns, but for 10 or 20, this will be fine, it keeps your data as "data" and shouldn't impact performance in any significant way.

  • XML columns.
    The advantage to this approach is that the data is not opaque. It's actually meaningful in the context of the database. If you must store this information in the mysql database - presumably you expect some ad-hoc queries to be run - then you might as well store it in a format that mysql can actually understand. On the other hand, if you're 100% positive that you'll never need to "deserialize" this data until it hits your application, then I would probably go with...

  • Custom binary serialization format.
    If you must store the data in your mysql database and you know for sure that you'll never need to index it or even read its contents from within a query, then don't waste precious I/O on bloated text encoding. Even JSON is bloated compared to binary, because JSON has to store all the property names; if you do your own serialization, you can use one or two bytes to determine the type and then deserialize the remaining fields/properties in a known order. Just the data, no metadata.

    I wouldn't even use the .NET BinaryFormatter here, I would create my own heavily-optimized version. After all, this needs to be fast fast fast! and every extra byte that goes into your table makes the query slower. You might even be able to GZip compress some of the data, depending on exactly what's in there.


Unless I still haven't fully understand your requirements, I wouldn't even consider any other options.

Aaronaught
I am using Entity Framework extensively. But I'm using this special (denormalised) table to gain fulltext search speed.
Robert Koritnik
I'm aware of this. I added additional info in the question that may be useful to all of you trying to help me. You can as well suggest something better.
Robert Koritnik
@Robert: If this is only for full-text search, do you really need *serialized* data, or do you just need keywords extracted from that data? In other words, are you ever going to *deserialize*?
Aaronaught
@Aaronaught: I was planning to include all common fields (ID, last change date, title/subject, content, image link, entity type and an additional field where I would save few other columns that are needed with different types of entities stored here). Serialized column won't be included in the search, but will be there to avoid loading data from normalized tables. Serialized column also won't have all the possible columns, but only the required ones that I can't live without when displaying search result summaries. And these results have to work **very fast** and should be **paged** as well.
Robert Koritnik
@Robert: So the "serialized" field is not indexed (FTS or otherwise), it is just used to display a "summary" for the FTS result? Is that correct? What is actually being indexed, and do you need to be able keep the contents of the "summary" field in sync with the other columns?
Aaronaught
@Aaronaught: Serialized field is not indexed. Correct. My denormalized table would have auto increment RowId and it would store actual entity ID related to each record. I would index just actual entity IDs and search for results related to them based on and ordered by change date. I will use sphinx indexing just for the purpose of very fast paged row retrieval. It wont actually be FTS, because my search will be based on IDs only. My table won't update its records in any way. Only new ones will be added. So it will be like some sort of activity auditing table. Does this answer your question?
Robert Koritnik
@Aaronaught: I've explained my situation a bit more in my question.
Robert Koritnik
A: 

I actually make use of text blobs in my RDBMS all the time. When used for the right purpose it can have a positive impact on performance and saves the existence and maintenance of many tables and development time. Its ideal when you need to store frequently changing, non-relational metadata about a row.

Though I would only consider using text blobs for KVO objects (i.e. non-entity - objects that is only meaningful with the row that its persisted with). Also don't bother if you need to do any server-side manipulation with it, i.e. querying etc.

For those that are interested I've developed a fast, resilient Type Serializer that is ideal for storing text-blobs in a compact, human-readable text-format.

mythz