tags:

views:

810

answers:

5

Hello

I am currently in the process of designing myself a database driven website. The main reason is for learning purposes but I wont lie, there is a small amount of vanity included!

While I believe that my database design is pretty good so far, I am still not entirely sure on the best way of storing articles or other large texts. I know most DBMSs have the TEXT datatype or equivalent and can hold a massive amount of text. However, storing a full article as one long string makes for unhappy reading, so formatting is going to be needed.

Do I store the article text along with all of the HTML or BBcode tags - or is it better to simply create the page in either a HTML or XML document and store the path to this file in the DB?

I quite like the idea of storing articles as an XML document as I can easily markup an article with custom tags and use PHP's XML and XSLT functions to transform the XML to HTML [or indeed, any other format]. It also allows the author to dictate when to create line/page breaks. This approach would of course require extra coding [which I am not afraid of] but it does present a problem with making articles searchable.

I know MySQL, for example, has SQL syntax for searching for specific terms/phrases inside strings held in a text field. If I were to store text in separate files, how might I approach making these articles searchable?

There is quite a lot I have written here on such a simple question, so I will break it down:

1: Is there a "best" way of storing large amounts of formatted text directly in a database or
2: is it better to hold paths to that text in the form of HTML/XML/Whatever documents.

If 2, is there an elegant way of making that text searchable?

Thank you for your time :)

+2  A: 

Depending on how you have arranged and installed everything, it can be hard to access outside files from remote clients that can access the DB just fine -- so why not save all of the XML into one TEXT field instead? You can refactor things to optimize that later if the DB engine can't handle that load well, but that's the easiest way to get started.

Alex Martelli
+2  A: 

The TEXT, BIGTEXT, LONGTEXT and others data types fields were created in order to store large amount of text (64 Kbytes to 4 Gbytes depending of the RDBMS). They just create a bynary pointer to locate the text in the database and it is not stored directly in the table. Is almost the same procedure if you store a path in a varchar field to locate the document, but having it in the database makes it easier to mantain because if you delete the row the document desappears with it without the need to delete it in other procedure (as if you stored as a file). Logically this makes your database bigger and sometimes not so easier to backup and transport, but to transport the documents one by one would be tedious and slower.

As you see it depends of the amount of docuements and rows in the database.

For the searching procedure I recommend to create a new "keywords" field in order to speed your searches. You can search too into the first n characters of the documents too, casting them as a CHAR or VARCHAR and locate the title and subtitle into these amount if they don't have already a specific field.

backslash17
+4  A: 

Store everthing in one big text field as Alex suggested. For searching, don't hammer your database, use Lucene, or htdig to create an index of your output. This way searches are very fast. The side effect is you make your searches a little more search engine friendly; you take your keywords field (as backslash suggested) and stick them in the meta-keywords attribute.

Edit

Unless you are only searching keywords, having the db do the searches will be horribly slow (ever searched a forum and it takes FOREVER?). There is no way for the database to index a

  select.. where FULLTEXTFIELD like '%cookies%'.

It is frustrating looking for an article and the search doesn't return the results your are looking for because they weren't in the keyword field! Htdig allows you to search the full text of the article efficiently. Your searches will come back instantly, and EVERY term in the article is fully searchable. Putting the keywords in the meta tags will make searches on those terms come higher on the results page.

Another benefit is fuzzy matching. If you search for 'activate' htdigg will match pages that have active, activation, activity etc. (configurable). Or if the user misspells a word, it will still be matched. You want your users to have a Google like experience, not an annoying one. :)

You do need a script to create a list of links to all your pages from your database. Have htdig crawl this automatically and you never have to think about it again.

Also htdig will crawl your non database pages as well so your whole site is searchable through the same simple interface.

As for the keyword field , you should have a separate table called keywords with the id of the article and a keyword field (1 keyword per row). But for simplicity, having a single field in the db isn't a terrible idea, it makes updating the keywords pretty easy if you put it in a form.

If you don't want to fuss with all the hassle of that, you can try using Google custom search. it is far less work, but you have no guarantee that all your pages will get indexed.

Good luck!

Byron Whitlock
Wow, thank you Byron. Your edit was a big help and ht://Dig doesn't seem quite as restrictive as I thought.For a really small site having the DB do the search isn't a huge deal I am sure, but I am hoping if my project works for my site I can reuse it for others and it would be nice if it was scalable.But that's for the future, for the present I need to actually code the thing :)
Peter Spain
A: 

First of all sorry for taking a while to get back to this topic - had some unrelated computer woes!

Thank you all for your input. I am going to raise a few points here rather than comment on each of your replies.

I should start by saying I will most likely put the text directly in the Database. As has been said holding a path to a separate document is not really different to how a DB would do it anyway. The idea of a separate document was to speed up queries, but no matter where I do a search it will take up resources.

I took a look at the two indexing applications suggested by Byron, but I must admit I am not sure how I would implement them. Lucene seems to go out of its way to be horribly vague on how to use it and ht://Dig appears to not support indexing a database directly. One contributor suggests having an index page that links to every record in the database allowing the spider to index everything. See ht://Dig FAQ

As far as I understand it, that isn't really any different to having my DBMS carry out the searches since it has its own indexes and having something like ht://Dig et al is just adding more overhead and maintaining a duplicate index. If I am wrong in my understanding then please correct me!

Coming to the keyword field idea, isn't having a comma-separated list [or any list for that matter] in a single field bad database design? I am by no means an expert but from what I have read and been taught data in any single cell should be atomic and redundancy should be avoided. Having a list goes against both of those I feel.

Again, thank you for your input and ideas. Any further help would greatly be appreciated.

Peter Spain
A: 

Take a quick look at native xml DBs. There are several, and some very good ones are free.

Search eXist, Document xDB, Oracle Berkeley.

If you are persisting, querying and updating semi-structured text and if the structure has any depth at all, you are almost certainly doing it the hard way if you stick with either the RDB of pointers, or stuff-it-in-a-blob techniques -- though there are many exterior reasons that these architectures can be necessary and successful.

Do a little reading on XPath and XQuery before you commit to a design. Here's a good place to start: https://community.emc.com/community/edn/xmltech