I am working on a WordPress site where one of the pages lists excerpts about corporate clients.
Let's say I have a web page where the visible text looks like this:
"SuperAmazing.com, a subsidiary of Amazing, the leading provider of integrated messaging and collaboration services, today announced the availability of an enhanced version of its Enterprise Messaging Service (CMS) 2.0, a lower cost webmail alternative to other business email solutions such as Microsoft Exchange, GroupWise and LotusNotes offerings."
But let's say there can be an HTML link or image in this text, so the raw HTML might look like this:
"<img src="/images/corporate/logos/super_amazing.jpg" alt="Company logo for SuperAmazing.com" /> SuperAmazing.com, a subsidiary of <a href="http://www.amazing.com/">Amazing</a>, the leading provider of integrated messaging and collaboration services, today announced the availability of an enhanced version of its Enterprise Messaging Service (CMS) 2.0, a lower cost webmail alternative to other business email solutions such as Microsoft Exchange, GroupWise and LotusNotes offerings."
Here is what I need to do: find out if there is a link inside of the first 20 visible words.
These are first 20 visible words:
"SuperAmazing.com, a subsidiary of Amazing, the leading provider of integrated messaging and collaboration services, today announced the availability of an"
I need to get the character count, including the HTML, out to the 20 visible word, which in this case would be "an", though of course it'll be different for each excerpt on the page.
(I'm willing to count "SuperAmazing.com" as 2 words if that makes things easier.)
I tried number of regular expressions for counting words, but they all count the HTML, not the visible words.
So what would be the correct regular expression for finding the full character count, including the HTML, for the first 20 visible words?