tags:

views:

347

answers:

3

I'm trying to use System.Windows.Forms.HTMLDocument in a console application. First, is this even possible? If so, how can I load up a page from the web into it? I was trying to use WebBrowser, but it's telling me:

Unhandled Exception: System.Threading.ThreadStateException: ActiveX control '885 6f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current th read is not in a single-threaded apartment.

There seems to be a severe lack of tutorials on the HTMLDocument object (or Google is just turning up useless results).


Just discovered mshtml.HTMLDocument.createDocumentFromUrl, but that throws me

Unhandled Exception: System.Runtime.InteropServices.COMException (0x80010105): T he server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERF AULT)) at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFla gs flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstr Options) at iget.Program.Main(String[] args)

What the heck? All I want is a list of <a> tags on a page. Why is this so hard?


For those that are curious, here's the solution I came up with, thanks to TrueWill:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace iget
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient wc = new WebClient();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(wc.OpenRead("http://google.com"));
            foreach(HtmlNode a in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                Console.WriteLine(a.Attributes["href"].Value);
            }
        }
    }
}
A: 

If it's xhtml load it into an XDocument and parse the anchor tags out, or you could also do it with RegEx, if all you need is the anchor tags.

Wil P
It's not XHTML. RegEx is a hack... I have no idea how malformed the HTML I'll be working with is. I need the links (hrefs) in the anchors.
Mark
Why is regex a hack? Easy enough to get the hrefs too. +Regex is fast.
Wil P
As for why regex (in this case) is a hack, see http://www.codinghorror.com/blog/archives/001311.html
TrueWill
Good enough. Thx
Wil P
+2  A: 

Add the [STAThread] attribute to your Main method

    [STAThread]
    static void Main(string[] args)
    {
    }

That should fix it.

chris.w.mclean
Yep. That seems to get rid of the error. Thank you!
Mark
I don't think it solves the problem though. I've created a WebBrowser object, and then I `Navigate` to `google.com`.. I've attached a `DocumentCompleted` event handler so I know when it's done loading, but it never gets fired. In fact, the program just runs to completion almost immediately, which tells me it's not waiting for the page to load at all. I don't think it likes being single-threaded.
Mark
Looks like you'd also need a message pump. See http://stackoverflow.com/questions/764869/c-console-app-event-handling
TrueWill
That sounds nasty. Way too much work just to read an HTML doc from the web :) Thanks though.
Mark
mshtml is definitely not designed for console use. It's been long recommended against using it in server-side applications for the same reasons. HTML agility pack is a great alternative for parsing though.
Josh Einstein
+4  A: 

As an alternative, you could use the free Html Agility Pack library. That can parse HTML and will let you query it with LINQ. I used an older version for a project at home and it worked great.

EDIT: You may also want to use the WebClient or WebRequest classes to download the web page. See my blog post on Web scraping in .NET. (Note that I haven't tried this in a console app.)

TrueWill
Not familiar with LINQ, but a quick glance over that front page mentions XPATH, which is good! Might give this a go if chris's solution doesn't work.
Mark
@Mark: You don't have to use LINQ - when I was using the library that feature hadn't been added. It was still pretty easy. You could create an XPathNavigator, call Select on that and pass in an XPath string, then iterate over the result. SelectSingleNode is the other major method I used.
TrueWill
I added some code to my question. Works great in a console :)
Mark
@Mark: Thanks! Your code is very concise. One aside: It probably isn't relevant in your program, but WebClient is IDisposable.
TrueWill