views:

46

answers:

1

Hello Everyone,

I am currently designing a focused webcrawler. I have it tested with some websites until i encountered below anchor ("the <a href="...">):

href="javascript: openDocument('DATA//PCP200803.pdf');"

My html parsing routine results to

javascript: openDocument('DATA//PCP200803.pdf');

Does anyone have any idea on how to download the referenced document?

Thanks a lot.

A: 

In the case of the openDocument() command, you could just add "DATA/PCP200803.pdf" to your collection of other resources to fetch/crawl, same as any other hyperlink in the page.

Other JavaScript methods, though, (e.g., XMLHttpRequest's open()) may not be as straightforward.

Jason Hall
Thanks ImJasonH. I actually expected something like if there's a good third party utility that can offer higher level of resolving those javascripts hrefs since i assume there might be a lot of different instances of those types of hrefs. Anyways, tnx :)
Jojo