tags:

views:

184

answers:

7

Hi people!

I'm having a little problem in Java. How to do this: I want to search in a HTML file for the tags href and src, and then I want to get the URL associated with that tags.

What is the best way to do it?

Thanks for the help. Best regards.

+2  A: 

Do you want to do this as a one-time editing task, or do you need a systematic (i.e. code) implementation? In the second case, find a Java HTML parser implementation and walk the DOM tree.

http://java-source.net/open-source/html-parsers

Wouter Lievens
I need to do this using some Java code.
arpf
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
bwalliser
A: 

If your file is an xhtml document, it is a standard xml document and the bast way to parse it is using jdom. JDom is very powerful and easy to use and understand.

If you have an html document you can try htmlparser, in particoular the class LinkTag.

alexmeia
A: 

Take a look at this question:

The answer I used was JTidy

Richie_W
A: 

You can use Rhino, then load the html file. Once it gets loaded you can used getElementBy to go to any node or to get value.

Thej
A: 

I would have a look at tagsoup, which will build a DOM tree from any HTML document, even the most non-compliant ones.

Then use XPath and iterate over the NodeList returned by:

//a

and

//img

+1  A: 

This is the code I used to accomplish exactly what you'd like to do, but first let me give you a few tips.

If you're in a Java Swing environment, make sure to use the methods in the javax.swing.text.html and javax.swing.text.html.parser packages. Unfortunately, they're mostly intended for use on a JEditorPane, but I'd still strongly recommend that you take a look at these.

There's a class in the Java 6 API called HTML.Tag that identifies the HTML start and end tags, which you can then use in order to determine where the links are that you'd like your program to follow.http://java.sun.com/javase/6/docs/api/javax/swing/text/html/HTML.Tag.html

When I wrote a program very similar to this, I used 3 main methods:

public void handleStartTag(HTML.Tag t, MUtableAttributeSet atts, int pos)
public void handleEndTag(HTML.Tag t, int pos)
public void handleText(char[] text, int pos)

If you need more help on how to write these methods, you can message me, but basically, you are looking for an initial tag and an end tag and then from that you will have identified the url and then you can proceed to the next step, which is following the url.

To follow the url, I advise you to use the JEditorPane object. The javax.swing.event.HyperlinkListener interface defines only one method, hyperlinkUpdate(HyperlinkEvent e), which you can pass the url into and then call .setPage(evt.getURL()) on your JEditorPane object. This will then update the pane with the new page and allow you to start the process again.

Msg me if you have any probs and please vote this answer!

Yoely
A: 

I've used the Neko HTML Parser successfully for this sort of thing (screen scraping).

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Node;

public class TestParser {

     public static void main(String[] argv) throws Exception {
       DOMParser parser = new DOMParser();
       for (int i = 0; i 
Damo