ansaurus

Question

To identify links regarding the Press Release pages alone

Answer 1

+4 A:

I dont think there would be any definitive way to achieve this. You can make a set of all possible keywords like 'press', 'release' and 'pr' etc and match the urls to find the keywords using regex etc. The correctness of this would depend on how comprehensive your set of keywords is.

Gopi 2010-08-12 14:30:37

Do you mean to say about searching the urls found in the website for the keywords and selecting them?

LGAP 2010-08-12 14:32:20

Yes. keywords or combination of keywords

Gopi 2010-08-12 14:44:12

You're not taking benefit of Jsoup's powers.

BalusC 2010-08-13 19:14:20

Answer 2

+3 A:

Look at the site today. Cache to a file whatever links you saw. Look at the site tomorrow; any new links are links to news articles, maybe? You'll get incorrect results - once - any time they change the rest of the page around you.

You could, you know, just use the RSS feed provided, which is designed to do exactly what you're asking for.

Dean J 2010-08-12 14:31:02

am in an assignment of this task for non-rss feed pages... and hence the problem in finding the solution...ur valuable suggestions are welcome... if any..

LGAP 2010-08-12 14:34:08

@Anand, well, in that case, create your own website that is backed up a RSS feed, and parse the website instead. The solution is more difficult if you choose to write an knowledge retrieval engine and an inference engine at the same time.

Vineet Reynolds 2010-08-12 14:48:34

Answer 3

+2 A:

You need to find some attribute which defines a "press release link". In the case of that site, pointing to "/pr/library/" indicates that it's an Apple press release.

Borealid 2010-08-12 14:31:41

Answer 4

+2 A:

Look at the HTML source code. Open the page in a normal webbrowser, rightclick and choose View Source. You have to find a path in the HTML document tree to uniquely identify those links.

They are all housed in a <ul class="stories"> element inside a <div id="releases"> element. The appropriate CSS selector would then be "div#releases ul.stories a".

Here's how it should look like:

public static void main(String... args) throws Exception {
    URL url = new URL("http://www.apple.com/pr/");
    Document document = Jsoup.parse(url, 3000);
    for (Element element : document.select("div#releases ul.stories a")) {
        System.out.println(element.attr("href"));
    }
}

This yields as of now, exactly what you want:

/pr/library/2010/07/28safari.html
/pr/library/2010/07/27imac.html
/pr/library/2010/07/27macpro.html
/pr/library/2010/07/27display.html
/pr/library/2010/07/26iphone.html
/pr/library/2010/07/23iphonestatement.html
/pr/library/2010/07/20results.html
/pr/library/2010/07/19ipad.html
/pr/library/2010/07/19alert_results.html
/pr/library/2010/07/02appleletter.html
/pr/library/2010/06/28iphone.html
/pr/library/2010/06/23iphonestatement.html
/pr/library/2010/06/22ipad.html
/pr/library/2010/06/16iphone.html
/pr/library/2010/06/15applestoreapp.html
/pr/library/2010/06/15macmini.html
/pr/library/2010/06/07iphone.html
/pr/library/2010/06/07iads.html
/pr/library/2010/06/07safari.html

To learn more about CSS selectors, read the Jsoup manual and the W3 CSS selector spec.

BalusC 2010-08-13 19:12:40

But will this be applicable for all the webpages?????? Please advise.I need some generic solution. Not for apple.com alone...

LGAP 2010-08-16 14:02:06

HTML parsing can never be generic. You can at highest make the Java code dynamic so that you can end up with just a mapping of links and selectors in some configuration file. P.S: one question mark is really enough do denote a question.

BalusC 2010-08-16 14:12:45

ansaurus

tags:

views:

answers:

To identify links regarding the Press Release pages alone

related questions