views:

66

answers:

2

I'm working on a web scrapping application and was testing it with ebay. The thing is that the application should follow the link "Next" (the one at the bottom of the page that should go to the next page of results) but it kinda stays on the same page (yea, i'm actually not sure about that). If you try to open ebay and search for any term that will give a result with multiple pages, and then either copy the link of "Next" and paste it on a new window or right click it and select open in a new tab/window, it will stay on the same page. I tested it on Chrome and IE8. So my question is what are these browsers doing when they actually follow the link (when I just click on it) so that I can do the same with my scraping application? (Oh, and by the way I'm working on C#)

+1  A: 

In the case of eBay it is just a normal link (at least on http://www.ebay.com, look for page 2 of TV's) so the problem is probably with your code (are you storing cookies for instance?). From your description it sounds that it's an AJAX request, which would go "under the hood" and gets XML from the server which is rendered by JavaScript on the client side.

Traditionally, AJAX requests are hard to follow. In the case of ebay, however, I'd suggest use the interface that ebay has to query for information. If you are building a generalized web crawler, then stay away from the AJAX requests. Google doesn't bother either, most of the time.

Abel
+1 for using the provided APIs.
overslacked
But, have you tried copying the link and then paste it on a new window? Doesn't it goes back to the first page? I just want to know what the browser does when I click on "Next" so that it lands on the next page so that my program can do it too.
jsoldi
@jsoldi: I have no idea what you do to get a different result, but that's exactly what I did and it *just works*. Rightclick the link above (under *"page 2 of TV's"*) and select *"Copy Link Location"* (or the equivalent in your language / browser). Paste in a new browser window or even another browser (from FF to IE to Chrome) and you'll see that it shows the second page, just as when you would click on the link. Your program will automatically do exactly the same if it executes an `HTTP GET` (i.e., with `System.Net.WebClient`) with that address and processes the received page.
Abel
@jsoldi, quote *"what the browser does when I click on "Next" "* >> maybe I misunderstood your question, but what the browser does is interpreting anything that's under the `<a href="...">`, or what's placed there as a result of a script, and then simply issues `HTTP GET`. If you want to know exactly how this goes, then *HTTP, The Definitive Guide* by Gourley et al. and *High Performance Web Sites* by Sounders are books that both explain this process in great detail.
Abel
A: 

I did a element.InvokeMember("click"); (where element is an HtmlElement) and it worked. Not sure why though. I'll take a look at that HTTP GET thing anyway.

jsoldi
`HTTP GET` happens for every request, *always* (well, unless it's an `HTTP POST`). Invoking the click event is good, but is a terrible method if you are building a crawler because it assumes interaction and might call JavaScript. You should just take the URL (which I assume you've got by parsing the document raw data you received) and request that (with `WebClient` for instance).
Abel