tags:

views:

65

answers:

4

I'm trying to fetch this page (it's in Chinese, sorry for that):

amazon(dot)cn/s?rh=n:663227051

using the following code:

import java.io.BufferedReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class Application {
    public static void main(String[] args) throws IOException, InterruptedException {
        final URL url = new URL("http://www.amazon.cn/s?rh=n:663227051");
        final String agentString = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)";

        URLConnection urlConnection = url.openConnection();
        urlConnection.setRequestProperty("User-Agent", agentString);

        InputStreamReader streamReader = new InputStreamReader(urlConnection.getInputStream());
        BufferedReader reader = new BufferedReader(streamReader);

        final String path = "d:\\desktop\\Test.html";
        FileWriter writer = new FileWriter(path);
        writer.write("");
        String line;
        while ((line = reader.readLine()) != null)
            writer.append(line).append(System.getProperty("line.separator"));
        writer.close();
    }
}

But after running this code for several times, I found that I randomly get two different results (see screenshots here http://www.flickr.com/photos/31629891@N07/4173636464/)

No matter how many times I refresh this page in browser, it returns the same result.

I'm wondering why is this so?

A: 

Seems to me like it is an Amazon issue. Maybe you should ask them about this.

Ngu Soon Hui
I think they'll simply ignore me:(
Typocolder
A: 

You should examine the traffic being sent from your program and compare it to what the browser sends. Use Fiddler to capture the browser transaction and Wireshark to capture your program's transaction (or use Wireshark for both). You will probably find that there's a subtle difference that's causing the server to return different results, possibly having to do with cookies.

Jim Garrison
I captured the package sent by the browser, and add header field definitions to code accordingly(except for "Accept-Encoding", which is "gzip,deflate"), but it doesn't seem to work. And I think it may not be the cookies. When I use a browser with cookies disabled, the results it returns are still always the right one.
Typocolder
A: 

Amazon goes to a lot of effort to tailor the search results to what the (potential) customer is likely to want to buy. All sorts of things happen that (to the outside observer) are not exactly predictable / explicable. I could say more ... but I think I'm still under an NDA.

In short, I'm not surprised that your application is seeing different results all the time.

EDIT: By the way, if you are screen-scraping the Amazon site for some reason, you should pay attention to the following excerpt from the "Conditions of Use" page:

Amazon grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of Amazon. This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools. This site or any portion of this site may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of Amazon. You may not frame or utilize framing techniques to enclose any trademark, logo, or other proprietary information (including images, text, page layout, or form) of Amazon without express written consent. You may not use any meta tags or any other "hidden text" utilizing Amazon's name or trademarks without the express written consent of Amazon. Any unauthorized use terminates the permission or license granted by Amazon. You are granted a limited, revocable, and nonexclusive right to create a hyperlink to the home page of Amazon.com so long as the link does not portray Amazon, or its products or services in a false, misleading, derogatory, or otherwise offensive matter. You may not use any Amazon logo or other proprietary graphic or trademark as part of the link without express written permission.

In short, GET PERMISSION.

Stephen C
More hint please? The problem is just this: when i use a browser, the items sold out are always displayed, but when i use java code to fetch the same page, these sold-out items are somtimes concealed.
Typocolder
I really cannot help, not least because I don't know. I suggest that you try and contact Amazon, explain what you are trying to do, and ask for advice. They might not ignore you. (I know for a fact that some "feedback" email definitely does get through to people you could do something about it.)
Stephen C
Ok, thanks anyway.:)
Typocolder
Oh I didn't notice that, thanks for reminding.btw, what i am working on is just a student project as homework for the Information Search and Retrieval course. i was not careful enough, sorry for that.
Typocolder
A: 

You can probably get rid of some of this variability by adding an HTTP Cache-Control: no-cache header to your request (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). Otherwise your request may be satisfied by any of a number of intermediate HTTP caches along the route to Amazon's "origin web server", and these caches may each have different versions of the page depending on how long Amazon allows copies of the page to be cached. A web site gets much higher scalability if they sacrifice a bit of consistency for content that doesn't absolutely have to be up to date.

The same sacrifice of consistency for scalability holds true once your request enters an Amazon data center. It can be load-balanced to any free web server, and that web server in general can draw on different sources for the components on the page. Perhaps the difference is that the pages got assembled from parts stored on two different clusters of memcached (distributed in-memory cache) machines.

And on top of this, as @Stephen C alludes to, you may be seeing personalization effects.

Jim Ferrans
"no-cache" doesn't work...
Typocolder
Hmm, so this variability would seem to be happening inside the Amazon data center(s) or be due to personalization effects.
Jim Ferrans