views:

60

answers:

4

My task is to find the actual Press release links of a given link. Say http://www.apple.com/pr/ for example.

My tool has to find the press release links alone from the above URL excluding other advertisement links, tab links(or whatever) that are found in that site.

The program below is developed and the result this gives is, all the links that are present in the given webpage.

How can I modify the below program to find the Press Release links alone from a given URL? Also, I want the program to be generic so that it identifies press release links from any press release URLs if given.

import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.sql.*;
import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element; 
public class linksfind{
public static void main(String[] args) {
    try{
         URL url = new URL("http://www.apple.com/pr/");
         Document document = Jsoup.parse(url, 1000); // Can also take an URL.
         for (Element element : document.getElementsByTag("a")) {
             System.out.println(element.attr("href"));}
             }catch (Exception ex){ex.printStackTrace();}
}
}
+4  A: 

I dont think there would be any definitive way to achieve this. You can make a set of all possible keywords like 'press', 'release' and 'pr' etc and match the urls to find the keywords using regex etc. The correctness of this would depend on how comprehensive your set of keywords is.

Gopi
Do you mean to say about searching the urls found in the website for the keywords and selecting them?
LGAP
Yes. keywords or combination of keywords
Gopi
You're not taking benefit of Jsoup's powers.
BalusC
+3  A: 

Look at the site today. Cache to a file whatever links you saw. Look at the site tomorrow; any new links are links to news articles, maybe? You'll get incorrect results - once - any time they change the rest of the page around you.

You could, you know, just use the RSS feed provided, which is designed to do exactly what you're asking for.

Dean J
am in an assignment of this task for non-rss feed pages... and hence the problem in finding the solution...ur valuable suggestions are welcome... if any..
LGAP
@Anand, well, in that case, create your own website that is backed up a RSS feed, and parse the website instead. The solution is more difficult if you choose to write an knowledge retrieval engine and an inference engine at the same time.
Vineet Reynolds
+2  A: 

You need to find some attribute which defines a "press release link". In the case of that site, pointing to "/pr/library/" indicates that it's an Apple press release.

Borealid
+2  A: 

Look at the HTML source code. Open the page in a normal webbrowser, rightclick and choose View Source. You have to find a path in the HTML document tree to uniquely identify those links.

They are all housed in a <ul class="stories"> element inside a <div id="releases"> element. The appropriate CSS selector would then be "div#releases ul.stories a".

Here's how it should look like:

public static void main(String... args) throws Exception {
    URL url = new URL("http://www.apple.com/pr/");
    Document document = Jsoup.parse(url, 3000);
    for (Element element : document.select("div#releases ul.stories a")) {
        System.out.println(element.attr("href"));
    }
}

This yields as of now, exactly what you want:

/pr/library/2010/07/28safari.html
/pr/library/2010/07/27imac.html
/pr/library/2010/07/27macpro.html
/pr/library/2010/07/27display.html
/pr/library/2010/07/26iphone.html
/pr/library/2010/07/23iphonestatement.html
/pr/library/2010/07/20results.html
/pr/library/2010/07/19ipad.html
/pr/library/2010/07/19alert_results.html
/pr/library/2010/07/02appleletter.html
/pr/library/2010/06/28iphone.html
/pr/library/2010/06/23iphonestatement.html
/pr/library/2010/06/22ipad.html
/pr/library/2010/06/16iphone.html
/pr/library/2010/06/15applestoreapp.html
/pr/library/2010/06/15macmini.html
/pr/library/2010/06/07iphone.html
/pr/library/2010/06/07iads.html
/pr/library/2010/06/07safari.html

To learn more about CSS selectors, read the Jsoup manual and the W3 CSS selector spec.

BalusC
But will this be applicable for all the webpages?????? Please advise.I need some generic solution. Not for apple.com alone...
LGAP
HTML parsing can never be generic. You can at highest make the Java code dynamic so that you can end up with just a mapping of links and selectors in some configuration file. P.S: one question mark is really enough do denote a question.
BalusC