views:

3476

answers:

8

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

Are regular expressions the best way to achieve what I'm trying to accomplish?

+6  A: 

Regular expressions are one way to do it, but it can be problematic.

Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.

You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.

Chris Lively
Good answer - regex is your friend!
Jarrod Dixon
Bad answer. Don't do this.
SLaks
-1 Hmmm, using Regex to parse HTML. What could possibly go wrong? Oh that's right: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Ash
Is it a bad idea to try and parse ALL the tags with RegEx? yes. However, regex is built to grab all href="whatever" values out of a string.. Which is what the OP wanted to do
Chris Lively
A: 

You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.

On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:

  • Do you know if the "href" text will always be lower case?
  • Do you know if it will always use double quotes, single quotes, or nothing around the url?
  • Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
  • Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
  • What else can you tell us about the document?
Joel Coehoorn
I know the href text will always be lower case.It will always use double quotes.It may or may not always be a valid URL, but I'm assuming it will be 99% of the time.The doc has a chance of having "href" elsewhere.That's all I can think of. Would a parsing function really be better than regex?
Matt S
The killer here is allowing href= elsewhere. It sends you back to finding a real anchor tag, and that means you're better off using a (very lenient) parsing library. You might even try loading it into a webbrowser control.
Joel Coehoorn
+2  A: 

Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php

There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.

I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.

JasonTrue
+2  A: 

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

From here on RegExLib should get you started

Tim Jarvis
Thanks Time. I'm trying to use this, however, C# keeps telling me all the backslashes are "unrecognized escape sequence"s. Throwing an @ in there doesn't help either. Do you know what's going on?
Matt S
Hahah, I meant "Thanks TIM". Time doesn't deserve any thanks.
Matt S
This link helped me figure it out http://regexadvice.com/forums/thread/36529.aspx
Matt S
It's because HTML is so often not well formed that you shouldn't use RegEx: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Ash
I'd agree with the general case, parsing HTML is a lot harder than anyone expects and way too hard for regex alone generally, but...in this specific case, just parsing for hrefs only, regex is fine for that and easier than an XML DOM
Tim Jarvis
+2  A: 

For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack @ http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.

Duncan
+22  A: 

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;
Jeff Donnici
And it's really easy to use.
Carles
+1  A: 

It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:

Here are three good tools:

  1. TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
    Taggle is a commercial C++ port of TagSoup.

  2. SgmlReader is a tool developed by Microsoft's Chris Lovett.
    SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
    Download the zip file including the standalone executable and the full source code: SgmlReader.zip

  3. An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.

Reading its code would be a great learning exercise for everyone of us.

From the description:

"d:htmlparse(string)
 d:htmlparse(string,namespace,html-mode)

  The one argument form is equivalent to)
  d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

  Parses the string as HTML and/or XML using some inbuilt heuristics to)
  control implied opening and closing of elements.

  It doesn't have full knowledge of HTML DTD but does have full list of
  empty elements and full list of entity definitions. HTML entities, and
  decimal and hex character references are all accepted. Note html-entities
  are recognised even if html-mode=false().

  Element names are lowercased (if html-mode is true()) and placed into the
  namespace specified by the namespace parameter (which may be "" to denote
  no-namespace unless the input has explict namespace declarations, in
  which case these will be honoured.

  Attribute names are lowercased if html-mode=true()
"

Read a more detailed description here.

Hope this helped.

Cheers,

Dimitre Novatchev.

Dimitre Novatchev
A: 

I've linked some code here that will let you use "LINQ to HTML"...

http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410

Frank Schwieterman