views:

68

answers:

2

Hi
I want to process a String, in which I want to find multiple strings, i am trying to make some highlighter in html text in java..
Example:
Find and process phrases table, row, primary key in Each table row contains a primary key column
The text is html text with tags like <b>,<img..>...
if there is ignorable tag in the middle of phrase for ex. primary <b>key</b>(ignorable tag is the tag that does not interrupt the text meaning like <b> or <i>, on the other hand tag like <div> interrupts the text meaning) the phrase can be replaced.
if one phrase is a subphrase of other phrase, the longer has a higher priority. for ex. searching for table row and row contains in the mentioned text the second one should be proccesed

My first pseudocode was somethin like this:

for (each phrase)  
  while(tex.hasNext(phrase)) do
    processPhraseInText(text,phrase)
  end-while
end-for

It was working, but the text was traversed phrases.count() times and I am searching a faster way to process all the prases at once

I want to try the Regular expression and pattern-matcher to use in this.
I came up with two ways, 1. create one regular expression for all the phrases looking somehow like this: regex1|regex2|..|regexN or 2. to create for every phrase one regex( and one Matcher object)

what could be the better way? or is there some totaly different way or existing library for this? isnt the second way with many matchers the same like the sollution I already have?

A: 

If you make a regex for each phrase, you still have to loop over the text a couple of times.

If you make one regex regex1|regex2|..|regexN, you can search through the text in one pass. This would be faster with many phrases.

Sjoerd
thanks, you assured me :) could you tell me, if the one matcher aproach will be faster in general than the sollution I already have? I now use indexOf for each phrase, the simple pseudocode is posted, I know its not much
Zavael
You should profile if its about performance. I think i read somewhere that alternatives in RegEx are pretty slow, but im no RegEx guru.
atamanroman
A: 

You can easily do it in one pass. You don't need keywords, since HTML is a tag based language, but let's say you wanted to colour based on keywords anyway.

Store all your keywords in a Trie

Foreach character
  If character is not < send to output 
  If character is <
      Read until you get > (or ' ' if you want to deal with attributes too)
      If prefix is in Trie, colour appropriately and add to output 
Winston Smith
maybe I didnt understand, but if with the keywords you meant my phrases, than I need them, because I want to highlight the phrases, not html tags
Zavael