tags:

views:

102

answers:

2

I am maintaining a simple php-based in-house cms. I'd like to search the text of articles as they are saved into the system for what will eventually be tens of thousands of different tokens, in order to automatically apply links to those tokens and also to establish a relationship in an association table between the article and the entity the token represents.

What is the best way to do this? Is there a faster/more efficient way to do it than to retrieve a list of all of the tokens and their relevant entity/id every time an article is saved?

I'm less interested in the replacement of the tokens than the best way to establish the list of tokens to search - they will come from several different tables, and I would think that on a per-request basis the data set which needs to be queried would be quite a burden on both the DB and the memory load of the script

Edit: I think I've posed the question incorrectly.

Consider the following text:

Steve McMuffin ate seventeen Fabulous Furry Fajitas at The Stinking Bean, while Johnson Fatlumps ate thirty-two.

I've got two people in there who are both in the 'person' table, one restaurant which is in the 'restaurant' table and one restaurant menu item which is in the 'restaurant_menu_item' table.

I want to know the best way, after that text is saved, to automatically go through and identify what is a person, what is a restaurant, and what is a restaurant menu item without resorting to custom markup as the intended audience has virtually no chance of ever getting that right.

A: 

We had a similar situation. We ended up using Regular Expressions for the parsing and replacement of the tokens. Because the original article was a template that we would generate new articles with the tokens replaced, we'd cache the generated one so no changes to the template meant no new parsing.

Joshua Belden
+1  A: 

This is always going to be difficult (computationally, anyway) unless you can get some guarantee of the token format. Without markup, the computer really doesn't know that any particular string of characters has any special meaning, if it can't be taught to recognize a format.

The "simple" answer is to loop through the text for each token, see if it's there, and handle it. But you'll have two issues: computation time, and collisions (as Chad pointed out in his comment).

Is there a very simple markup you can enforce? MediaWiki only creates internal links if a phrase is surrounded by [[brackets]]. Lots of wiki software will only make links if you CamelCaseThePhrase.

I can't think of a way for the application to automagically know certain character groups have meaning without checking every defined token or enforcing some kind of format.

Are you sure your audience can't handle something like

SteveMcMuffin ate seventeen FabulousFurryFajitas at
TheStinkingBean, while JohnsonFatlumps ate thirty-two.

or

[[Steve McMuffin]] ate seventeen [[Fabulous Furry Fajitas]] at
[[The Stinking Bean]], while [[Johnson Fatlumps]] ate thirty-two.
James Socol
Almost certain, unfortunately. If they could handle stuff like that though, I'd have a lot less work to do :)
Shabbyrobe