ansaurus

Question

How to parse a large HTML file with Java HTMLParser library

Answer 1

+2 A:

Have you tried increased the max heap size of the JVM

The following command line argument will up it to 512 megabytes: -Xmx512M

E.g.

java -Xmx512M myrunclass

Kris 2009-05-26 12:37:00

Thanks, it works. I up vote you. I will wait to accept your answer because I want to see if someone posts a more efficinet way.

Sergio del Amo 2009-05-26 12:47:57

Answer 2

+1 A:

Don't build a DOM when you only want to extract some information and you are not interested to perform some XPath queries or other type of queries which perform best on a DOM structure(parent-child relations, etc).

Use Parser.visitAllNodesWith() instead of Prser.parse().

adrian.tarau 2009-05-26 13:02:28

Could you post an example?

Sergio del Amo 2009-05-26 14:06:53

Have a look in org.htmlparser.tests.visitorsTests package, you'll find all test cases related with visitors. Everyting you need to know about parsing with a visitor is there.You have an implementation close to what you need : TagFindingVisitor. TagFindingVisitor visitor = new TagFindingVisitor( new String [] { "LI","BODY","UL","A" } ); parser.visitAllNodesWith(visitor);

adrian.tarau 2009-05-27 02:08:59

ansaurus

tags:

views:

answers:

How to parse a large HTML file with Java HTMLParser library

related questions