views:

419

answers:

6

Hi,

Assuming I have an Amazon product URL like so

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846

How could I scrape just the ASIN using javascript? Thanks!

A: 

what's asin? generally you load the page into an invisible iframe and then parse it using something like xpath

glebm
The ASIN is in the URL - in this case, it's `B0015T963C` (right after `/dp/`). No need to load the page itself to get something out of the URL string.
ceejayoz
then just get it with regexp
glebm
it's "/dp/(.*)/" what you are looking for, right?
glebm
@Glex, the ASIN is, I believe, the Amazon Standard Identification Number, and is common amongst all of Amazon's sites (so far as I'm aware). It serves a similar purpose to the ISBN/ISSN, and presumably served to help normalize Amazon's database structure.
David Thomas
cool, thank you!
glebm
A: 

If the ASIN is always in that position in the URL:

var asin= decodeURIComponent(url.split('/')[5]);

though there's probably little chance of an ASIN getting %-escaped.

bobince
It isn't always in that position. Amazon URLs have many forms, like http://www.amazon.com/dp/B0015T963C
ceejayoz
A: 

something like this should work (not tested)

var match = /\/dp\/(.*?)\/ref=amb_link/.exec(amazon_url);
var asin = match ? match[1] : '';
Scott Evernden
A: 

The Wikipedia article on ASIN (which I've linkified in your question) gives the various forms of Amazon URLs. You can fairly easily create a regular expression (or series of them) to fetch this data using the match() method.

ceejayoz
A: 

Since the ASIN is always a sequence of 10 letters and/or numbers immediately after a slash, try this:

url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")

The additional (?:[/?]|$) after the ASIN is to ensure that only a full path segment is taken.

Gumbo
A: 

Amazon's detail pages can have several forms, so to be thorough you should check for them all. These are all equivalent:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

They always look like either this or this:

http://www.amazon.com/&lt;SEO STRING>/dp/<VIEW>/ASIN
http://www.amazon.com/gp/product/&lt;VIEW&gt;/ASIN

This should do it:

var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
m = url.match(regex);
if (m) { 
    alert("ASIN=" + m[4]);
}
darkporter
One more possible form: amazon.com/exec/obidos/asin/B0015T963C. Just to be completely comprehensive the regex could be extended with `dp|gp/product|exec/obidos/asin`.
darkporter