ansaurus

Question

Question about parsing HTML using Regex and Java

Answer 1

+7 A:

The correct way to loop through matches is:

Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
  System.out.println(m.group());
}

That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using > inside attribute values and so on.

Use a dedicated HTML parser instead such as HTML Parser.

cletus 2010-03-06 23:25:15

Thanks it works.:)I will use html parser later .

Elham 2010-03-07 00:13:27

"I will use html parser later". That's what they all say ... :-)

Stephen C 2010-03-07 00:18:57

Answer 2

A:

What if someone wants to write his/her own HTML parser.

If not Regex then, What is the best way to do it ?

Raha 2010-03-06 23:38:31

I'd post this as a separate question, if it hasn't already been asked. You might consider if you want a language-specific answer or a general parsing answer when phrasing the question.

TrueWill 2010-03-07 03:14:30

Regex only solves the scanning part. Parsing is a completely different exercise. And as HTML has an nested (recursive) syntax, you need a parser, not just a scanner.

EJP 2010-03-07 05:03:37

Answer 3

+2 A:

Why don't you try looking at the source code of some open source HTML Parsers? HtmlCleaner, Tagsoup etc.

The general strategy seems to be to attempt to parse and clean the html and return an Xml tree.

Personally, I would read through the HTML adding opening tags to a LIFO Queue, and removing (matching) opening tags from the start of the queue when a closing tag is encountered - performing queue shifting to allow for tag mismatches.

Finbarr 2010-03-06 23:59:26

Is this answer in response to @Raha's question about writing one's own HTML parser?

Alan Moore 2010-03-07 01:49:53

Answer 4

+1 A:

I want to get keyword content from HTML tag I wrote:

Pattern keyLineContents = Pattern.compile("<(.[^<]*)(keywords)(.[^<]*)>");
Matcher keyLineMatcher = keyLineContents.matcher(documentURL);
boolean result = keyLineMatcher.find();
while(result)
{
  String metaTagContent = keyLineMatcher.group(1) + " " + keyLineMatcher.group(3);
  Pattern kcontent = Pattern.compile("(.*?content=\")(.[^<]*?)(\".[^<]*?)");
  Matcher keyLineMatcher2 = kcontent.matcher(metaTagContent);
  boolean result2 = keyLineMatcher.find();
  while (result2)
  {
    String metaTagContent2 = keyLineMatcher.group(1);
    result2 = keyLineMatcher.find();
  }
}

But I don't understand why my result2 is false. Result one is fine give all content of keyword tag

thanks

Elham 2010-03-07 00:16:09

Try these regexes instead: `"<([^<]*)(keywords)([^<]*)>"` and `".*?content=\"([^<]*?)\""`

Alan Moore 2010-03-07 01:46:19

ansaurus

tags:

views:

answers:

Question about parsing HTML using Regex and Java

related questions