views:

1284

answers:

10

Is there a unified way to do this? Browsers usually don't respond as expected to user32's GetWindowText and SendMessage, which you can use to scrape the text out of most win32 applications.

I'd like to get the equivalent of "View Source" on the open web page.

Currently, I'm using the API for screen readers to scrape from IE, but that doesn't work for FF or Chrome or etc.

ANY suggestions (not requiring me to proxy or try to connect to a web server) to do this for different browsers is GREATLY appreciated. Again, I'm using the screen reader API for IE (which doesn't always work correctly!)...


This question now has a bounty.

To collect the bounty, I want to see code, in c#, to scrape an IE window and a FireFox window.

Alternatively, code for a HTTP proxy that does the same thing (i.e., record the current request's content) that works for HTTPS as well.


Let me be clear... My requirements are to scrape a window. Oh, and by the way, that window is a browser. I give you the option to show me a proxy to claim the bounty. I already understand there are proxies and what they do, telling me about their existence isn't helping me.

I also can't just connect to the url in code and scrape the response object. Why? Requirements won't let me, among other issues.

I require a solution in .net, as the tags say.

+3  A: 

Why don't you just create a simple HttpRequest on the page to get all the html in a String?

Edit:

Here is some code that illustrate the meaning of my first sentence.

Daok
That was my first thought as well but I think he's wanting to monitor a user's browsing - not just scrape an html page.
Lance McNearney
Won't work--cookies, security, etc. BTW, I didn't down you. I don't down people who answer my questions.
Will
First of all, you can get Cookie and Security with HttpRequest. Second, -1 ...
Daok
I have yet to see an authentication system I couldn't scrape using HttpWebRequest and keeping the cookie container. There's always a session token somewhere at the client- all you have to do is find it.
Joel Coehoorn
No... no... The user has logged into a secure system and is viewing a webpage. In order for me to do this from my application, I must log into the secure system, if I want to use a HttpRequest. Understand the issue?
Will
You aren't clear about the secure system... I think you need to define more what you need before requesting some clear solution. What ever your browser request security you should be able to do it with programming request too...
Daok
I just want to scrape a browser, not browse to where the user is. Those are my requirements.
Will
What if there's JavaScript-generated content on the page?
Ates Goral
I want it. I want everything. You see the browser window? You see text in window? I want text.
Will
Will, I would automate logging into the secure system and then from there send an HttpRequest (or equivalent). I understand that's not what you want, but anyhow, that's how I'd do it. :)
skiphoppy
Well, unfortunately that won't work in this case. In other cases (e.g., when not scraping the window) it would be the best way, yes.
Will
A: 

I'd have thought it was a per-browser thing.

mackenir
Yeah, and its tough, hence the question.
Will
The question is "is there a single way that works for all browsers?". The answer (I am guessing) is surely "no, any solution would be tailored to an individual browser".
mackenir
Can you imagine a single piece of code that gets HTML from a browser written in java and Internet Explorer? Without a switch that effectively makes it two pieces of code? I can't.
mackenir
There's a unified way to get text out of every single window on a MS box. Why are browsers different?
Will
Because a browser window doesn't contain 'text'.
mackenir
@mackenir - That's entirely false, most text in windows flows through a handful of API's. A browser window does contain text, but it also contains images as well.
Gavin Miller
So when I navigate to a jpg, my browser contains text. Interesting. Browsers can display HTML but that isn't text - it's markup.
mackenir
@mackenir - You're absolutely right. However let's clarify on logic "a browser window doesn't contain text" - that's false. "Some browser windows don't contain text" - that's true and follows with the example you give of a jpg in a browser window.
Gavin Miller
Browsers render markup to text. They don't display markup. When we "View Source" of the page, that is the markup.
Gavin Miller
A browser takes some content (sometimes binary, sometimes textual format) and renders it to the screen. IYAM it only ever can be said to contain plain text when you use it to view a plain text file. I think the 'clarification of logic' is hair-splitting and a bit of a waste of time.
mackenir
BTW, if you drew a Venn diagram, "Markup" would be a smaller circle entirely encompassed by "Text". Splitting hairs doesn't help. Of course, to make a decent browser scraper you not only need to be able to grab the markup/text displayed, but also the text within java, silverlight, and other similar embedded content types. Not fun.
Will
+2  A: 

I would suggest you use a mechanize "module" . It will eliminate all need for browsers ( some of them include JavaScript support,too ) . Here are some links :

Of course there are alternatives for other languages too , for example , Java has HtmlUnit etc. If there is a reason to why you need the output from the browser's window , please elaborate on that .

EDIT : having read your comments , why don't you make the browser go through a proxy ? This way you can monitor all browser traffic .

Geo
My requirements. That's why.
Will
Will, the question is "why do your requirements keep you from doing this?" "My requirements" is not an answer to that question. You don't have to answer, of course, but people are curious.
skiphoppy
If you'll provide more information about your requirements, including the reasoning behind them, somebody may be able to find a solution that will solve all your problems even if your preferred solution is not possible.
skiphoppy
Well, the requirements are essentially "user clicks button, you scrape the contents of the browser. No complex setup required." Proxies, while an option, are fragile and require relatively complex setup on the user side. Its an option I would want to avoid at all costs.
Will
A: 

Not sure if this is exactly what you want, but you could use WatiN (that uses SHDocVw) to control the browser) to grab the source of a specific url that you control from the program. Downside is that to make it work with FF you only get the what's inside the . It would go something like:

        IBrowser browser = BrowserFactory.Create(BrowserType.FireFox);

        browser.GoTo(url);
        browser.WaitForComplete();

        src = browser.Html;
Jaime
Thanks for the try. My requirements won't let me do this. I need to scrape any and all visible text out of a window that just happens to be a browser.
Will
+9  A: 
Gavin Miller
Damnit... I was really thinking nobody would give me a real answer. This might very well be it. Give me a couple days to research it. If I have any concerns I'll update here.
Will
Its still on my plate....
Will
This doesn't give you the HTML.
mackenir
Not the optimal answer, definitely.
Will
+2  A: 

You could use Selenium (cross-browser) to collect the data you need directly from the DOM tree via id/CSS/XPath locators. That way, you would have access to DHTML content (which can't possibly be scraped by just parsing the fetched HTML). You can either issue a runScript command or write your own Selenium extension command to post the collected data (JSON via Ajax) to a URL. Or if you don't want to deal with Ajax, you could change the <title> and grab the window title of the running browser window.

Most answers here focus on directly scraping the fetched HTML. However, since most modern websites employ DHTML in one form or another, wouldn't you want to scrape the content off the rendered page?

Ates Goral
Let me do some research on this one, both in how it might work (yes, I need to scrape a rendered page), and licensing issues (I'd have to distribute it as part of an application). Gimme a couple days.
Will
Am I right in understanding this has to be installed on the server? I need to scrape on a user workstation...
Will
No, Selenium runs inside the browser (client-side).
Ates Goral
I'm rejecting this answer, sorry. It would add a crapton of requirements on users; I'm looking for the "it just works" solution. There's no direct API, just an API for interacting with a local server, which requires Java and a proxy and has to own the browser process... Thanks for the suggestion.
Will
Actually, you should look into the Selenium test runner. You don't have to use Selenium RC or Selenium Server or Java etc. The test runner is a stand-alone HTML file that will launch any test suite inside any browser you want. You test suite can scrape data from DOM.
Ates Goral
A: 

I can see some reasons why your requirements require you to get the HTML from within the browser rather than scrape separately with Mechanize, Scrapy, etc.:

  • You don't know ahead of time what page you want to scrape
  • You need to login first and can't easily automate it

You could try to produce a solution at the Win32 level that works with any browser but that will be specific to Windows and probably contain a lot of code to deal with differences between the different browsers.

How about writing a Firefox extension? This would easily allow you to extract the current page's HTML on-demand. I'm sure how well it will apply, but take a look at Piggy Bank. One additional benefit of a Firefox extension is that its cross-platform. I'm not too familiar with Google Chrome, does it allow similar extensions?

Mat
Firefox extension would be great. Except for those running IE, chrome, etc etc. Writing an extension for a single browser (which means writing at least three) isn't the ideal. However, it definitely is a way to have the entire DOM open for inspection. Was hoping for a more general and simple solution.
Will
Having just looked at this again, Piggy Bank and/or Solvent seem like a ready-made solution for Firefox (I've not used them though), maybe something similar is available for IE? Anything outside the browser is unlikely to be able to get the HTML unless you go down to network level.
Mat
A: 

Hmm. Just a thought, Add a FileSystemWatcher on known temp-folders for all browsers cachestorage. When Filesystemwatcher signals that a new page has been cached, read the file in an own instance of internet explorer in your program, fetch the outertext or what you want from the DOM to get all text but not the html-tags.

Possible pitfall is if secure pages dont land in the cache, I think they do but are erased after the visit, but Im not sure..

Stefan
Nice backdoor, however it would be pretty hard to monitor all the files that get dumped in the temp cache(s) and determine which specifically hold the current browser contents.
Will
A: 

Can't be done in purely managed code. You can do this in something like autohotkey.

Select window with title xxx send: ctrl + a ctrl + c

and paste where you need it.

Mischa Kroon
+1  A: 

This is strictly for viewing/displaying in IE but you could use the WebBrowserControl.

    public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
        Scrape();
    }

    private void Scrape()
    {
        this.webBrowser1.Url = new Uri("http://www.stackoverflow.com");
        this.webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
    }

    void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        if (e.Url == webBrowser1.Url)
        {
            MessageBox.Show(webBrowser1.DocumentText);
        }
    }
}
Pat