I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
- any HTML tags
- Any javascript
- Any CSS styles
Is there a regular expression (one or more) that will achieve that?
I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
Is there a regular expression (one or more) that will achieve that?
Using perl syntax for defining the regexes, a start might be:
!<body.*?>(.*)</body>!smi
Then applying the following replace to the result of that group:
!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi
This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right). A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.
If you're using PHP, try Simple HTML DOM, available at SourceForge.
Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &).
Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.
Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be:
//body//text()[not(ancestor::script)][not(ancestor::style)]
Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>
You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[
sections correctly at all. Further, some kinds of common HTML things like <text>
will work in a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.
I believe you can just do
document.body.innerText
Which will return the content of all text nodes in the document, visible or not.
[edit (olliej): sigh nevermind, this only works in Safari and IE, and i can't be bothered downloading a firefox nightly to see if it exists in trunk :-/ ]
I use iMacros for firefox for extracting stock quotes. It includes a useful general purpose text extraction feature. https://addons.mozilla.org/en-US/firefox/addon/3863 wiki: Text Extraction Jim2
The simplest way for simple HTML (example in Python):
text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])
Returns this:
'This is my> example HTML, containing tags'