You're asking for several related, but discrete, abilities:
- Search and Replace content
- Search and Replace formatting
- Search and Replace similar (ie, ignore trivial differences in whitespace)
You should take this in steps - otherwise it becomes overwhelming and a single search algorithm won't be able to do all three without intense effort and resulting in difficult to maintain code.
First, look at the similar problem. Make a search that ignores spaces and case. You might want to get into Lucene or another search engine technology if you also need to deal with "bowl" vs "bowls" and "intelligent" vs "smart" - though I expect this is beyond your current needs.
Once you have that working, it becomes one layer in your stack of searches.
Second, look a formatting search. This is typically done using tokens or tags - which you already have in the form of HTML. However, you have to be able to deal with things out of sequence - so <b><i>text</i></b>
needs to be caught in a search for <i><b>text</b></i>
and the malformed representation where tags aren't nested properly, such as <b><i>text</b></i>
.
One method of this is to pre-parse the string and apply the formatting styles to each character. So you'd have a t that's bold and italic, an e that's bold and italic, etc. to make this easier and faster use a hash to represent the style combination - Read the first character, figure out what style it is (keep track of this turning styles on and off and you find tags) and if it already exists in the hash, assign that hash number to the letter. If it doesn't, get the new hash number and assign that.
Now you can compare the letter and its style hash against your search and get format and content matches. Stack that on top of your similar match and you have what you need.