views:

76

answers:

3

I want to implement in desktop application in java searching and highlighting multiple phrases in html files, like it is done in web browsers, so html tags (within < and >) are ignored but some tags like <b> arent ignored. When searching for example each table in text ...each <b>table</b> has name... will be highlighted, but in text ...has each</p><p> Table is... it will be not highlighted, because the <p> tag interrupts the text meaning.
in web browser is this somehow implemented, how can I get to this implementation? or is there some source on the net? I tried google, but without success :(

+2  A: 

Instead of searching inside the actual HTML file the browsers search on the rendered output of that HTML.

Get a suitable HTML renderer and get its output as text. Then search on that text output using appropriate string searching algorithms.

The example that you highlighted in your question would result in a newline character in the rendered HTML output and hence a normal string searching algorithm will behave as you expect.

Faisal Feroz
+1 thanks so far the best answer, but I want an algorithm to do this somehow in desktop app... I dont believe that nobody tried this ever :)
Zavael
A: 

This seems pretty easy.

1) Search for the last word in the string. 2) Look at what's before the last word. 3) Decide if what's before the last word constitutes and interruption (<p>, <br />, <div>). 4) If interruption, continue 5) Else evaluate previous word against the search query.

I don't know if this is how browsers perform this operation, but this approach should work.

babbitt
so you suggest to "split" the html text into some pure text parts and then apply the searching within these parts? or did I misunderstand you?
Zavael
A: 

Try using javax.swing.text.html package in java.

Kuri