ansaurus

Question

Regex: Extracting readable (non-code) text and URLs from HTML documents

Answer 1

+2 A:

what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.

That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.

It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.

Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.

I would like to once again suggest the benefits of HTML Agility Pack.

ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.

<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
<a href="~/link"></a> - very common URL char missing in group
<a href="link$!*'link"></a> - more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
    ="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: u&#114;l('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)

and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.

bobince 2010-10-17 01:15:19

Regarding the "kind of works" thing: It baffles me because the regex *correctly* matches everything it's supposed to when I test it in RegexBuddy. It's only during testing in .Net that a couple of script / style strings show up.

d7samurai 2010-10-17 01:22:59

Don't know what would cause a difference between the two regex engines for this—maybe default case sensitivity options?—but the expression as-is will fall over for common valid HTML constructs that might not be in your test set. To match a tag you really need at least to go in-depth on attribute value delimiters. At which point you end up with a big unwieldy regex that really isn't preferable to an HTML parser in any way.

bobince 2010-10-17 01:39:20

But as I said, I'm not interested in the html itself - at all. In fact, it's essential that all the html is left untouched (except for the links, which are processed to be converted from relative to absolute where necessary - but that's easy). This is why it's a very clean and efficient way (especially code-wise) to just get all the html/tags/comments/scripts *masked out* in one fell swoop by the regex, and then just performing word replacement on the text during match iteration.

d7samurai 2010-10-17 01:45:15

Here's an example: `<style type="text/css" media="print"> @import url(/layouts/Standard/styles/print_style.css); </style>`. This is part of the html document (as in the rendered source, retrieved through "view source" of the page in question). I'm using the exact same document in RegexBuddy as I am in my VS2010 project. This particular part of the document is correctly picked up by the regex in RegexBuddy. In my app, it shows up as not (although it's stripped down to `@import url(/layouts/Standard/styles/stylesGlobal-min.css);`. It's like you say, a discrepancy between the two engines.

d7samurai 2010-10-17 02:08:22

I'm such an idiot! Since RegexBuddy and .Net have different syntax for declaring capture group names, I had overlooked one of the tags when I converted from RegexBuddy. That solved that problem - and the regex works like a charm now. So what I do is I iterate through the match collection and only processing matches that has content in it's "text" capture group. And I can do this in almost no amount of code at all.

d7samurai 2010-10-17 02:49:43

Ah, yeah... advanced features like this are often different across regex engines. Seriously, though, I don't think you realise what you're getting yourself into. You keep using words like “clean” and “elegant”, but this is really anything but that. Detecting `a href` attributes with regex is absolutely *not* simple, you would need to take apart a tag by quoted or unquoted attributes just to begin with. One piece of malformed markup or a `>` where you don't expect it and the results will fall apart. This can only ever work for *extremely* limited input (ie input you created yourself).

bobince 2010-10-17 03:05:45

Hehe. Believe me, it's the right way for this. It's only for entertainment purposes, so an occasional hiccup in the resulting html page is totally acceptable. Here's the regex for pulling out links (href, src, action, url, background etc). The actual link (as well as the leading attribute) can be polled through theirs capture groups during match iteration: `\b(((?:(?P<html>src|href|background|action|url) *(=|:) *(?P<mh>"|'| ))(?P<link>[\w/.?=#\-\[\]]+)(?P=mh))|(?P<css>url) *$(?P<mc>"|'| *)(?P<link>[\w/.?=#\-\[\]]+)(?P=mc)$)`. Try it.

d7samurai 2010-10-17 03:14:56

Also, since I knew nothing about regex until today, when I finally decided to check out what it was all about earlier tonight, it's kind of a fun thing to play with, too :) And believe me, the code is clean and simple. In only a matter of about 10 lines of code, I can scan through an html document and reconstruct all its links to absolute (if they are relative) with a precision that is pretty high for something like this. I don't see that happening with DOM like models..

d7samurai 2010-10-17 03:21:20

And no, it's not for input that I create. In fact, this is what it does: You give it a link to some web page. Then you give it a list of words that you want replaced, along with corresponding replacement words. The app takes the html code, reconstruct all relative links to absolute, and then does the word replacement. In the end you have a document that looks perfectly like the original, but with an absurd twist, content wise. The page is stored as a compacted binary in a database, and can be served from a totally different server than it originated from, since all links are absolute.

d7samurai 2010-10-17 03:24:51

@bobince Btw - here's a screenshot of how the regex for finding html links handles the source for the cnn.com frontpage: http://www.martinwardener.com/regex_links.jpg . so far i haven't seen it mess up once (except it won't find links that are "camuflaged" / constructed in scripts or just parts of paths that are built later - but then again, what generic routine would?). The regex will find not only href, but src, url (in css references) and action, background and url atrributes. and they can quote their links with ", ', or be unquoted. It seems to work quite well.

d7samurai 2010-10-17 05:34:05

And finally - the way this regex finds links is independent of the tagging. So unescaped tags won't even affect it. If the link itself is well formed (which it would be, or it wouldn't link), it works. The rest of the html code can be as malformed it wants to :)

d7samurai 2010-10-17 05:54:38

You consider the above long, messy link-attribute regex ‘clean’? Then I don't know what to say to you. It's a complicated and far-from-complete attempt to detect two completely different and incompatible syntaxes (CSS and HTML attributes) in one, that can be broken in about a hundred ways even by perfectly valid markup. HTML escapes, CSS escapes, `>` in attributes, IRI, false matches in attribute attribute values, whitespace around quoted attributes, `url()` not detected properly... This sort of this is really easy to do *correctly* with an HTML parser and hideously impossible in plain regex.

bobince 2010-10-17 13:20:33

The regex is long only because i put literal names into it. Don't confuse the written line with 'mess' just because it's hard to read. All this will be *compiled* into symbols by the engine anyway. And no matter how you look at it, any parser needs to do traversing and string matching - just because it's hidden from you in doesn't make it 'cleaner'. And regex is very efficient at string matching - which needs to be done in any implementation. If you wanted to, you could write it much 'cleaner' if the *look* of a regex string bothers you.

d7samurai 2010-10-17 19:57:34

`\b((((src|href|action|url) *(=|:) *("|'| ))[CHARS]+\6)|url *$("|'| *)[CHARS]+\7$)`

d7samurai 2010-10-17 19:59:00

Now, this regex does the same job (sans the first "background" attribute, which is obsolete anyway). Just substitute CHARS for whatevr characters you are allowing in the link. Since this is *not* a *generic* html parser, but just looking for links (of attributes you can decide), there is not much that will break it. it handles whitespace around attributes and their values. Remember, it won't pick up malformed links anyway, which is fine. And it's compiled before use in the application. Ìt doesn't look for "css" sections, or specific tags, just links. It's actually *very* clean - and efficient.

d7samurai 2010-10-17 20:04:15

The fact that you have to spell out the attribute names *somewhere* doesn't make it "messy". As I've been saying here, I'd be interested in seeing a "proper" HTML parser implementation with the same functionality, to compare amount of code, precision and execution speed. Its purpose is to list all links in a document, using the attributes src, href, action and url.

d7samurai 2010-10-17 20:06:40

The regex isn't complicated because of the names in it, it's complicated because of the number of paths through the expression, making it hard to read, covering all the different cases it tries to handle... and fails, in many cases: seriously, that's not what URLs look like in CSS; and it really doesn't handle whitespace and attribute quoting anything like what the HTML grammar says... I don't know why you're so confident that “nothing” will break it.

bobince 2010-10-17 22:22:28

I think you need to forget the DOM paradigm for a second. You're not seeing the forest for the trees here. When there is no tag parsing, there is nothing to "break". A `>` or `<` won't matter to the regex, because it's not looking for any. And unless you allow it to accept `<>` as a valid url characters, it won't pick up any. An html file is really nothing more than a text file. Within that file you're looking for links. Not having to parse tags and nesting saves a lot of effort. And since links in html documents *are* "marked", it makes it even easier.

d7samurai 2010-10-17 22:33:10

And regarding code paths. Seriously, if there's anything a DOM model does, it's persuing bifurcating paths. The regex is simply a very compact way to describe what you're looking for. It doesn't have to first parse the text into DOM compliant entities and structure them - it simply looks for what you want, since it doesn't matter *where* in the structure it is located. The only thing one might want to filter for is "part of html code" and "not part of html code", but i made another regex for that, if that was a requirement.

d7samurai 2010-10-17 22:36:54

Links in html are formatted like this: attribute="link", attribute='link', attribute=link, or even attribute: "link" etc, for the various attribute types allowed in the regex (all of these can also have spaces before or after the : or =, it will still pick them up). in css, it's url("link"), url('link'), url(link) (or in some cases with imports, url "link", but i haven't added that (which would be easy). Also with spaces before or after the parenthesis. It will still pick it up. So tell me, how can a valid link be "broken" or not picked up? And how is it more efficient to run this through DOM?

d7samurai 2010-10-17 22:41:46

The reason there are two "paths" in the search is mainly because in css, the parenthesis used to mark the link uses a different character to mark the start `(` and the end `)` of the link. Please give me an example of a valid link in a piece of html code (otherwise compliant or not) that this regex won't pick up or mess up or be "broken" by.

d7samurai 2010-10-17 22:46:50

What did you mean by it not handling whitespace? It does. Or attribute quoting? It handles all types of attribute quoting. And how hard the regex is to read to some person is totally irrelevant. It's made, and it works. No need to read it. Besides, it's not really that hard to read either, any more than regex generally is. But that's not the point of regex, is it. I used to do assembler programming on the 6502 and the 68000 processors years and years ago. Believe me, VB.Net is child's play compared to that, but it's far more efficient, even though it's not very reader friendly.

d7samurai 2010-10-17 22:50:22

The regex is only interested in links that are coded so that they actually work as links, so "links" that are so malformed that they would not link to anything anyway are ignored. Which is good. No need to push DOM as if my point is to propose regex as an alternative to general html parsing models. But in this particular case, I have yet to see anyone show me an alternative, DOM-based method that does this easier, faster or more precise.

d7samurai 2010-10-17 22:59:51

PS. How do you think DOM parsers work internally? By using a DOM parser? ;)

d7samurai 2010-10-17 23:41:21

DOM-constructing parsers use a variety of string methods (yes, potentially including regex) to parse the low-level tokens of the basic grammar. (And yes, I've written one, and yes I'm also an assembler coder, thanks.) But regex really doesn't have the power to parse higher-level constructs. I am aware that CSS `url` tokens use parentheses, however, the expression pasted above does not contain any literal parentheses and won't match such a URL. It does seem to allow `url` as an attribute, which doesn't exist. It also can't cope with CSS-escaping or HTML-escaping inside the value.

bobince 2010-10-18 00:11:29

No, it actually *does* include that: `url *$ *("|'|)[CHARS]+\7$`, hence the last main alternation (at least my actual regex does - I now see that the `\`s before the `(` and `)` containing the link is missing in the ones i posted. I guess I must have been too eager when I was trimming it during post. Sorry).

d7samurai 2010-10-18 00:20:39

Damn, now I see why - it even happened again - it must happen when "difficult" strings like that are posted here.. Well, at least you should be able to see that the parenthesises need for it are there, it's just the backslashes to make them literal that are missing. Yes, it allows url as an attribute, to also sweep up some url attributes in scripts etc, of which many have the same format. But regarding escaping - what escaping doesn't it cope with?

d7samurai 2010-10-18 00:25:03

As for regex - of course it doesn't have the power to deal with higher level constructs. As I've been trying to say - I'm not posing regex as an alternative to DOM etc in that regard. I have repeatedly stressed that this is a specific task: to find links in an html document. No more no less. Doesn't matter where the link is, doesn't matter what it's for. Just find the links. And with that in mind, I see no reason to put this through some bloated, relatively speaking, higher level parsing mechanism, that will in fact make it clumsier, slower and require more code.

d7samurai 2010-10-18 00:28:02

And the same goes for the original regex - for filtering out all code from a document to be left with only the text - and doing that efficiently while retaining the original structure of the document. Please please please show me a better way through DOM for any of those tasks.

d7samurai 2010-10-18 00:29:58

And if you've written a DOM parser (from scratch), I'm even more baffled why you don't see that for a specific pattern search like this, it is an unnecessary detour to first have a parser parse for generic tokens, build an object hierarchy (including a lot of unneeded data and processing), just to have it then iterate through all its branches and leaves on my behalf to look for almost the same patterns I can get by scanning the flat source document itself directly.

d7samurai 2010-10-18 00:38:59

Here's a link to a little page I set up: http://martinwardener.com/regex. It uses the aforementioned regex patterns to extract links and text from html pages. Simply listing the links or the text will give you a parse time (that includes building the string of html markup for displaying it in the browser. You can also see the full html markup of the page you entered, with the regex matches highlighted. It might give you an idea about how reliable and precise it actually is. A comparison with agility pack would have to include finding all the non-href-links, too

d7samurai 2010-10-18 04:05:39

Hi! I saw your "trip-up-suggestions" :) Some of them are moot regarding my implementation, since it is slightly modified from the one you are basing this on. The missing URL characters are added - that's just an oversight, and I added those (thanks). But the character string can be whatever one wishes to allow, so it doesn't point out any flaws in the regex structure. Some codes (like tab and newline) can be stripped out in a pre-parse document trim, like my web example already uses (optionally) - or just added to the regex. The unquoted and spaced-at-front-but-missing-at-back ones are handled

d7samurai 2010-10-18 10:21:47

Your input is helpful, since I'm not up to speed on (especially) the css escaping and folding issues. Then again, I'm not so sure it's a problem in this case - both because it to a certain extent can be stripped out before parsing, and secondly I am not so sure how much it's encountered in the wild, within a html file. Regarding the escaping in the middle of the css 'url' attributes - who would want to do that anyway? It's a tradeoff - and as I've mentioned, for these purposes, it's OK to slip a bit on semi-obscure notation.

d7samurai 2010-10-18 10:37:57

Even without pre-processing the html and special escape handling in the regex, half of the examples are non-issues in the actual implementation, with missing characters being the problem with a handful alone (some of them were already in the regex in use).

d7samurai 2010-10-18 10:43:48

Then it's the issue of detecting "unconventional" links within script blocks etc (ref comments under Vantomex' answer) - which I'm not sure Agility et al picks up on? Again - check out http://www.martinwardener.com/regex to easily see how the regex performs (both speed-wise and detection-wise) on real world html pages.

d7samurai 2010-10-18 10:53:25

Also, remember that cases where the pattern wrongly detects a "link" are generally not a problem, either - because the point of the link pickup is to check whether they need rewriting (from relative to absolute), so they'll all go through some validation. If they don't qualify as links, they won't be touched anyway.

d7samurai 2010-10-18 11:00:40

Answer 2

A:

Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.

If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.

Vantomex 2010-10-17 01:20:14

As I said in my post, I'm using VB.Net in Visual Studio 2010. Also, this is not a critical application - it's a entertainment utility, so if some pages have malformed html, that's a just a minor scratch in the paintjob. Also, I don't see why it wouldn't handle nested tags? As far as I can see, any tag is masked out - after the "problem tags" like SCRIPT, STYLE and comment are removed, it doesn't concern itself with the document structure at all. Which is good. The point here is to just have access to the raw text without messing with the structure. How would this be done in DOM?

d7samurai 2010-10-17 01:29:23

Even it is not a critical application, I bet, regex should fail to parse and extract bad HTML documents spread everywhere in the Internet. As I said, DOM library should exists in every programming language. Also, regex cannot parse nested tags.

Vantomex 2010-10-17 01:47:44

Could you assure every tags in HTML documents are properly closed, for example tag P, H, LI, etc. ? Also, SCRIPT tags might have contain tags inside it.

Vantomex 2010-10-17 01:52:07

I don't see how a regex that blindly masks out anything enclosed in < and > can fail (apart from where there's no opening/closing bracket, in which case the document is to blame for the garbled up results). And as you can see from the regex, it masks out whole blocks of code by not only searching for bracket markers, but whole sequences of code (as in, starting wih `<script` and ending with `</script>`. If there's an orphan script tag somewhere, it will either result in that being isolated on its own (since it will be picked up by the generic `<tag>` match afterwards, or it masks out too much.

d7samurai 2010-10-17 01:53:56

IMO, when browser can display HTML documents with unclosed tags properly, our program should be able to extract them properly too. As HTML 4 Specification says, many tags doesn't need to have their closing tags. It's an official specification.

Vantomex 2010-10-17 02:00:40

There is an exception of course, regex can be used to parse particular HTML documents which the format is already known.

Vantomex 2010-10-17 02:02:46

Well, reliability isn't so important here. Making it simple and clean is the priority (although, the time I've spent commenting here now probably cancels out any gain from that :)

d7samurai 2010-10-17 02:30:16

One more example, supposing in your HTML document contains `P` tag like this: `<p>test <span class="a">test</span> <b>test</b> <i>test</i> <span class="b">test</span></p>`. After you use `<p>(.*?)</p>`, you get the content of `P` tag so what next? The extracted result still contain many tags, and you don't get what you said as "textual content".

Vantomex 2010-10-17 02:53:05

Wrong. The regex I showed in the original posts correctly picks up *all* tags and separate them from plain text (which is also picked up, but as a separate match), due to the sequence of alternate patterns. It returns a collection of matches, effectively covering the complete document, but since I have created a capture group for the "just plain text" pattern - with a name reference - I can check whether that capture result has content. If it does, it means it was a "pure text string" match, and I can process the match string.

d7samurai 2010-10-17 03:38:16

As you can see up top, it's not a simple `<p>(.*?)</p>` pattern. The regex looks like this: `(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)`.

d7samurai 2010-10-17 03:39:09

(and yes, I tested pasting in your example string in my test documents, and it isolates all of it perfectly. Using RegexBuddy, you can see each match highlighted in alternating colours, so it's easy to see what each match contains. All tags are separate matches, and all the text strings are sepate matches - identifiable during iteration because they are tagged with "text")..

d7samurai 2010-10-17 03:44:25

OK, I missed your current regex part, so how to solve this: `<textarea>The <p>, <li>, and, <table> elements are very common</textarea>`. However, if you insists on using regex, you don't have to inverse the matching, simply delete the matching pattern and store it in a variable.

Vantomex 2010-10-17 04:10:11

Same thing there. Those tags are picked up and isolated from the text in between automatically, just as everything else. The text that is left is *only* plain *content*, which is the purpose. As you'll see in my answer above (with complete VB code), I have solved the actual goal, only not *completely* in regex. What you suggest would be too cumbersome, since I don't only need the text extracted, I need to *replace* a set of given words in it - and *put it back into the html* so that the page behaves exactly as it did.

d7samurai 2010-10-17 04:45:21

Simply taking away the html means I'd have to build a mechanism for keeping track of which pieces of html/script etc goes where, and where to put the text back in. That's why this is preferable - since you can just hook onto the match iteration and replace as it goes along, preserving the document all the way.

d7samurai 2010-10-17 04:46:22

In fact, I took a screenshot of the regex at work, in RegexBuddy, where you can clearly see how it picks out everything but the text with perfect consistency. I grabbed the source html from this page on newscientist.com: http://www.newscientist.com/article/mg20827826.200-a-3d-model-of-the-ultimate-ear.html. The picture is here: http://www.martinwardener.com/regex.jpg. As you can see, I pasted your examples in there, as well, so you can get a more visual impression of it.

d7samurai 2010-10-17 04:55:54

No @Martin, the "<p>, <li>, and, <table>" inside `<textarea>` in my example is a text, not tags. Your regex exclude them whereas they are part of the textual content.

Vantomex 2010-10-17 05:11:42

d7samurai 2010-10-17 05:38:50

Yes if your program only process HTML documents made by you.

Vantomex 2010-10-17 07:31:31

d7samurai 2010-10-17 07:42:42

Yes, by definition, but not by practice. I just give one popular website as a sample, have a look at all "try" pages in `www.w3schools.com`, e.g. `http://www.w3schools.com/html/tryit.asp?filename=tryhtml_intro`

Vantomex 2010-10-17 07:54:35

I'm not sure I'm following. That's an html editor... Why don't you give me some example html code that you think would cause the regex to fail, and I'll test it out.

d7samurai 2010-10-17 08:03:31

Have you done a survey on HTML documents out there? They are usually mess/bad HTML codes, e.g. `<p>The < sign is a less-than operator</p>` or `<p>The <> sign is an unequal operator</p>`. All browsers rendered them correctly, even IE5, or maybe even IE4.

Vantomex 2010-10-17 08:30:30

No I haven't LOL. But even so, number one, it's on them. Number two, for my purposes, it would not create a big problem, it would only mask a few words more (between the < sign and the next tag, whatever that would be). It's a problem when you're attempting to parse nesting etc, but as I've said, here the engine is just masking everything, so it doesn't attempt to "make sense" of the tag pairing.

d7samurai 2010-10-17 08:37:57

Example number one works? Of course not, at first, your regex matches `<p>`, then it matches `< is a less-than operator</p>` not `</p>`, so your regex considers the problematic textual content as a part of the last `</p>` tag. No, I haven't LOL too here.

Vantomex 2010-10-17 09:29:20

Finally, as I said before, if you insists on using regex (regardless its weakness and unefficiency), you don't have to inverse the matching, simply delete the matching pattern and store it in a variable, then you'll get the inverse of it. There is no way in Regex to invert the matching of whole match. PowerGREP was based on open source TPerlRegex library. Look into it. The inverse feature in PowerGREP is not done by a regex magic formula. So, we need a dirty trick (as I mentioned above) to achive the same result.

Vantomex 2010-10-17 09:51:07

Well, as long as I have to do the final processing in code outside the regex itself, what I have done (and posted the code for as an answer to my own question here) is much more efficient. Deleting it and storing it in a variable would mean I'd have to keep track of which pieces of text belonged between what pieces of code etc. Instead, the regex matches *everything* in the code, except that it tags the plain text through its capture group. So during match iteration/regex replace, I just check for the tag and do the processing of the text *as it's traversing* the document. It works very well.

d7samurai 2010-10-17 10:04:51

Yes, of course it does consider the "< is a less than operator</p>" as the closing tag for the <p>. Not really much to do about that, unless you want to take it upon yourself to clean up the web for people. But the problem is miniscule. But regarding the efficiency.. 1) have a look at the code I posted. 12 lines of code. 2) It's very fast - and the regex is compiled for performance. 3) It's very accurate. Show me the corresponding code with the technique you suggest - I'm curious to see how that looks and performs.

d7samurai 2010-10-17 10:09:55

http://www.martinwardener.com/booyaa/?article=10

d7samurai 2010-10-17 10:28:56

I must confess, I has never learned VB/VB.NET, but three weeks ago, I have been starting learning VBA. So, I understand some of your code, but not all. Thus, I couldn't speak much about your challenge. Good Luck!

Vantomex 2010-10-17 10:51:13

But VB.Net or not. Since you seem to be familiar with DOM (in whatever language), I'd be interested in seeing how you would implement a subroutine that would find *every* parseable link in a given html document and allow to you iterate through them..

d7samurai 2010-10-17 23:04:31

See the *very first example* at htmlagilitypack's examples page for a trivially simple and *correct* method to extract all `a href`s. Repeat for any other URL-containing attributes you want. If you need CSS you can extract that from inline styles and stylesheet, but again, using regex on it isn't reliable—though more reliable than using regex over CSS-in-HTML.

bobince 2010-10-18 00:44:16

Well. Efficiency-wise, how much parsing do you think is going on under the hood in that operation? That parsing is being done, even if I'm not the one coding it. So efficiency-wise, I think the overhead is way way higher with agility pack (for this particular purpose). In addition, the regex doesn't just find href links, it finds a bunch of different ones. And it also finds links within scripts etc. The point here is to rebuild the links in a document (primarily from relative to absolute), so the document will function when served from another server.

d7samurai 2010-10-18 03:51:52

Here's a link to a little page I set up: http://www.martinwardener.com/regex/. It uses the aforementioned regex patterns to extract links and text from html pages. Simply listing the links or the text will give you a parse time (that includes building the string of html markup for displaying it in the browser. You can also see the full html markup of the page you entered, with the regex matches highlighted. It might give you an idea about how reliable and precise it actually is. A comparison with agility pack would have to include finding all the non-href-links, too.

d7samurai 2010-10-18 03:57:52

Question: How does Agility pack handle hrefs in tags like ``?

d7samurai 2010-10-18 04:22:25

or `<form onsubmit="(new Image()).src='/rg/SEARCH-BOX/HEADER/images/b.gif?link=/find';" action="/find" method="get">`

d7samurai 2010-10-18 04:24:09

or `<script language="JavaScript1.1" src="/js/Layout.js" type="text/javascript>` or `<style type="text/css" media="print"> @import url(/layouts/Standard/styles/print_style.css); </style` or urls inside javascript document.write blocks etc? I'm just curious..

d7samurai 2010-10-18 04:38:25

`window.location.href = 'http://htmlagilitypack.codeplex.com/Wiki/Search.aspx' + '?tab=Home`

d7samurai 2010-10-18 04:51:25

When I said unefficiency of Regex, I was comparing Regex with XPath (not DOM), XPath is faster than DOM in execution speed view. I said in my answer, used DOM for simplicity AND reliability because you said your program is not a critical application. Do you guess Regex should be faster than XPath when parsing HTML documents? Regex was not designed for working with HTML docs (they are not regular). Efficiency here means effeciency of burden of processor. Regex often tries MANY possible permutations to match a pattern, especially when use alternate operator `|`.

Vantomex 2010-10-18 05:00:11

One more, even if for your specific cases you find that Regex is faster than XPath, that is NOT efficient, because efficiency means "a minimum effort to achieve reliability". Forget to say, one line of code doesn't always faster than 100 lines of code.

Vantomex 2010-10-18 05:06:00

You have to distinguish between general HTML parsing and the parsing that is needed here. In this case, yes, I'm claiming regex is faster. All parsers need to perform something similar to regex at the lowest level, then use *that* information to build up a model of the document hierarchy. **Then** you can start doing your searches through the document - and as it turns out - you'll probably have to do many passes to pin down the same information as these regexes do. Check the link to a demo i made: http://www.martinwardener.com/regex/

d7samurai 2010-10-18 05:09:46

As I showed earlier, this takes a regex (that is *compiled* in itself before the application runs) and gets all the job done with a few lines of code. Even XPath would have to *somehow* search the whole text to pin down the tokens it is looking for - and probably even uses regex for that internally (!). Then you have a bunch of integrity checks and creating objects of the parsed data and organizing it. Then *you* can start actually searching through it. It's of course the price to pay for general capabilities. Specialization will always be better at some particular task. That's just how it is.

d7samurai 2010-10-18 05:15:54

The "regular" in "regular expressions" is not in reference to the text you are searching *through* - it is to the fact that what you are searching *for* has a regular form - a pattern.

d7samurai 2010-10-18 05:18:07

No, (X)HTML/XML parsers doesn't use permutation as Regex. Sometimes Regex accidentally meets a worst case. Feel free to post here your benchmark result and the methods you use to benchmark it, hopefully, you are willing to write a definitive conclusion. Sure, I will vote up if apparently you are the right one.

Vantomex 2010-10-18 05:26:01

I'm not talking about how the parsers work *after* they have scanned the raw document for tokens, I'm talking about how they scan it in the first place. As in, the source code for the parser itself, not the user interface. At some level, there will be plain text searching going on through the raw data. If the parser uses more time simply reading the document and creating the object model than the regex does on the whole job, it's already lost the efficiency test. If parsers somehow use some magic method for searching through text, I'd like to know how to use *that* directly.

d7samurai 2010-10-18 05:33:11

As for "voting up".. Lol. This whole thing was a question about whether regex was able to provide a certain functionality, not a competition between regex and various DOM implementations regarding general html parsing. But everyone seems to be so set on promoting DOM-type models that they lose track of what this is about. As for benchmarking, the online demo I put up should give you some pointers, at least (although it's running as an asp.net application on a shared server at discountasp.net).

d7samurai 2010-10-18 05:37:34

Well, for your last comment, I would say again efficiency means "a minimum effort to achieve reliability with the fastest speed". I have look your given link, it is amazing and I appreciate your work, but since the links there refer to an online link, I couldn't see the real speed it offered.

Vantomex 2010-10-18 05:41:51

The task at hand is this: Find and extract as many links as possible in a given html document (in any variety). Then see how fast it's done, and how much code you need to write to do it.

d7samurai 2010-10-18 05:42:22

LOL! Didn't you start the competition by offer a challenge in your previous comment?

Vantomex 2010-10-18 05:44:30

(The other task is, of course, to find and extract all plain text from a html page). As for the demo, it doesn't count the time it takes to download the html - it starts the clock when it has the text and starts doing the search, and stops it when it has created the highlighted list of links/the plain text extract. But I included the option to see the markup, too, so that you can go through it and see what it has missed (if anything) and if it picked out something wrongly. It rarely does, and if so, it's a special case that generally is easily detected in the post processing (if you need any).

d7samurai 2010-10-18 05:46:22

Well, I'd be happy to see the two methods go head to head, but then someone would have to make a .Net implementation in DOM..

d7samurai 2010-10-18 05:47:51

No, what I meant was that noone here is saying that regex is a challenger to DOM when it comes to general html parsing. I'm only using it for these particular tasks. And that's why it surprises me that noone seems to want to admit that in a "flat file" parsing scenario like this, DOM might not be the best solution.

d7samurai 2010-10-18 05:49:50

And Vantomex, I have something for you LOL: http://www.martinwardener.com/booyaa/?article=14

d7samurai 2010-10-18 06:04:15

LOL! O damn good, thanks for publishing my personality and heroics to public. :-)

Vantomex 2010-10-18 07:20:14

See, this is what the routine is inteded for :) It needs to isolate the text from the html so that it can perform search and replace on it. Then it detects all links and reconstructs them from relative to absolute where needed, before it packs the whole file into a database. It can then be retrieved like you just saw, and served from a different server with all links (and most functionality intact).. :) BTW, I just uploaded a touched up version of the online regex demo, you should check it out. After I removed the result color coding from the actual extraction loop, it's much faster.

d7samurai 2010-10-18 08:11:28

Answer 3

A:

If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.

Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.

However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.

meagar 2010-10-17 01:25:36

I can't do that, because the html page is supposed to work. It's supposed to be an exact replica of the original, functionality included, with just the text contents changed. If I just remove the tags and code, I'm left with the text, yes, but that means I'd have to store each removed portion in some other index to be able to glue it all back together after I'm done processing the text.

d7samurai 2010-10-17 01:31:48

If you're actually interested in interacting with the non-script, non-style tags, use an xhtml parser. Regex isn't suitable.

meagar 2010-10-17 01:33:17

I'm not interested in interacting with any tags. I just want to flip certain words on the page. The application takes a web page (that it gets from a URL - from somewhere - and works on the "source" code, that is, the same document you get when you click "view source" in your browser. It goes through the page, detecting relative links, replaces them with absolute links and then stores the page in a compact, binary form in a database - so that the page can be retrieved directly later and displayed from a different server than the one it originated from.

d7samurai 2010-10-17 01:41:09

How can a `<script>` tag change a document that is already displayed in the browser? Is javascript capable of rewriting its own document after it's been rendered?

d7samurai 2010-10-17 01:47:53

While the tags themselves cannot be nested in the DOM, it is possible for one script tag to contain a statement like `$(body).append('<script src="/myscript"></script>');` which will break your regex solution.

meagar 2010-10-17 01:58:28

True. But only until the next `</script>` tag. And since the search and replace that will be performed on the "plain text" parts are unlikely to match with a few lines of programming syntax, it's acceptable. Unless some of the text in the script is fatally replaced (which isn't a big deal either - it only means that that web page doesn't render well after the replacement), it won't harm anything - and the page will work fine.

d7samurai 2010-10-17 02:28:13

I've noticed that many script authors write those passages as `('</scr'+'ipt>');` - surely to avoid this very problem. And with that being the case, the potential snags for this regex are even more negligible..

d7samurai 2010-10-17 05:05:22

Also, I'm not interested in removing `<script>` and `<style>` tags - or any other tags. The point here is to *preserve* everything - so that I can freely change the textual content (by means of search and replace on specific words and phrases) without accidentally "renaming" something in the code. If you do a .Rename("href","hello") on the whole document, you'd also be renaming the hrefs in the code, and I want to be able to do that without affecting the code. I simply *mask* the code so I can get at the text, and I have done that, it works very well, but it's not 100% regex (see posted code).

d7samurai 2010-10-17 06:19:52

See this: http://www.martinwardener.com/regex/

d7samurai 2010-10-18 04:03:56

Answer 4

A:

OK, so here's how I'm doing it:

Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):

(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)

Then in VB.Net:

Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)

The actual replacing of text happens here:

Private Function MatchEvalFunction(ByVal match As Match) As String
    Dim plainText As String = match.Groups("text").Value
    If plainText IsNot Nothing AndAlso plainText <> "" Then
        MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
    Else
        MatchEvalFunction = match.Value
    End If
End Function

Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).

d7samurai 2010-10-17 04:32:20

For those interested, here's a screenshot of the regex itself at work in RegexBuddy (showing the html source for this page: http://www.newscientist.com/article/mg20827826.200-a-3d-model-of-the-ultimate-ear.html). I also pasted in a couple of "challenges" to the regex by some commenters, so it's easy to see how the regex easily picks up only what it's supposed to. The regex used here is without the trailing part that also selects the text parts and tags them as "text" through a capture group - for clarity. Each coloured block shows a match from the regex. http://www.martinwardener.com/regex.jpg

d7samurai 2010-10-17 04:59:44

And here is an online implementation that shows the regex matches when applied to a given web page: http://www.martinwardener.com/regex/

d7samurai 2010-10-18 04:04:32

Answer 5

A:

You cannot parse HTML with regular expressions.

Parsing HTML with regular expressions leads to sadness.

I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.

Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.

Andy Lester 2010-10-17 05:25:46

But if you read the comments here, you'll see that it's not about parsing the html per se - "merely" to mask *all of it* out so that one can run generic text processing on the pure text alone, leaving the html untouched (because the point here is to leave the entire *structure* intact, and simply edit the textual content). The html page is supposed to be stored locally (with textual changes) and then served on demand - which is why I also made a regex for extracting links (of all sorts) so that all relative links can be reconstructed in an absolute form. See screenshots in other comments.

d7samurai 2010-10-17 05:45:22

The solution I went for is posted in my own answer above, and performs this task very reliably in just 12 lines of program code. As I have mentioned earlier, I'm not a DOM experienced person, so I would really like to see code for that same task done in the DOM/HTML Agility/XPath etc way.

d7samurai 2010-10-17 05:50:21

"to mask all of it" requires finding the HTML markup which is part of parsing.

Andy Lester 2010-10-17 06:48:44

Fair enough. But then this regex does that extremely efficiently: `(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)`. Try it on some hefty html sourcecode. I have posted a couple of links to RegexBuddy screenshots (which will visualize that well) in the comments of some of the other answers.

d7samurai 2010-10-17 07:13:21

Try this: http://www.martinwardener.com/regex/

d7samurai 2010-10-18 04:03:37

Answer 6

A:

I suggest you read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags.

Ms2ger 2010-10-29 08:24:48

I have already read it. Since you suggested that link, I suggest you read the threads and my comments on THIS page. And also check out the actual implementation demo here: http://www.martinwardener.com/regex where you can see the results and performance for yourself.

d7samurai 2010-10-29 08:50:39

Also, I'd like to REPEAT the fact that this question is really about whether regex is capable of "inverting" its match results, not whether regex is in any way an alternative to DOM parsing. However, commentators seemed to be more concerned with attacking THAT notion, most notably bobince, who seem to have been violated by a regex/html combo as a child, based on his attitude. But regardless, in THIS case, it's not a question of parsing HTML as such. This task can be boiled down to a flat text pattern search, since it doesn't concern itself with the specifics of tags or their hierarchy.

d7samurai 2010-10-29 09:05:40

ansaurus

tags:

views:

answers:

Regex: Extracting readable (non-code) text and URLs from HTML documents

related questions