ansaurus

Question

How to search in a HTML file for some tags?

Answer 1

+2 A:

Do you want to do this as a one-time editing task, or do you need a systematic (i.e. code) implementation? In the second case, find a Java HTML parser implementation and walk the DOM tree.

http://java-source.net/open-source/html-parsers

Wouter Lievens 2009-03-23 10:11:57

I need to do this using some Java code.

arpf 2009-03-23 10:19:35

http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/

bwalliser 2009-03-23 10:34:13

Answer 2

A:

If your file is an xhtml document, it is a standard xml document and the bast way to parse it is using jdom. JDom is very powerful and easy to use and understand.

If you have an html document you can try htmlparser, in particoular the class LinkTag.

alexmeia 2009-03-23 10:27:14

Answer 3

A:

Take a look at this question:

The answer I used was JTidy

Richie_W 2009-03-23 10:32:12

Answer 4

A:

You can use Rhino, then load the html file. Once it gets loaded you can used getElementBy to go to any node or to get value.

Thej 2009-03-23 10:36:01

Answer 5

A:

I would have a look at tagsoup, which will build a DOM tree from any HTML document, even the most non-compliant ones.

Then use XPath and iterate over the NodeList returned by:

//a

and

//img

2009-03-23 10:49:42

Answer 6

+1 A:

This is the code I used to accomplish exactly what you'd like to do, but first let me give you a few tips.

If you're in a Java Swing environment, make sure to use the methods in the javax.swing.text.html and javax.swing.text.html.parser packages. Unfortunately, they're mostly intended for use on a JEditorPane, but I'd still strongly recommend that you take a look at these.

There's a class in the Java 6 API called HTML.Tag that identifies the HTML start and end tags, which you can then use in order to determine where the links are that you'd like your program to follow.http://java.sun.com/javase/6/docs/api/javax/swing/text/html/HTML.Tag.html

When I wrote a program very similar to this, I used 3 main methods:

public void handleStartTag(HTML.Tag t, MUtableAttributeSet atts, int pos)
public void handleEndTag(HTML.Tag t, int pos)
public void handleText(char[] text, int pos)

If you need more help on how to write these methods, you can message me, but basically, you are looking for an initial tag and an end tag and then from that you will have identified the url and then you can proceed to the next step, which is following the url.

To follow the url, I advise you to use the JEditorPane object. The javax.swing.event.HyperlinkListener interface defines only one method, hyperlinkUpdate(HyperlinkEvent e), which you can pass the url into and then call .setPage(evt.getURL()) on your JEditorPane object. This will then update the pane with the new page and allow you to start the process again.

Msg me if you have any probs and please vote this answer!

Yoely 2009-03-23 11:07:23

Answer 7

A:

I've used the Neko HTML Parser successfully for this sort of thing (screen scraping).

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Node;

public class TestParser {

     public static void main(String[] argv) throws Exception {
       DOMParser parser = new DOMParser();
       for (int i = 0; i

Damo 2009-03-23 11:07:48

ansaurus

tags:

views:

answers:

How to search in a HTML file for some tags?

related questions