ansaurus

Question

HTML parser...My recent project needs a web spider..

Answer 1

A:

I think the subject you need to know is Regular Expression.

Regular Expression is available on all platform and all languages (Java, PHP, Python, C#, Ruby, Javascript). Using Regular Expression, you can easily exact its content as preferred form you want.

Pattern p = Pattern.compile("<a\\s[^>]*href=\"([^\"]+?)\"[^>]*>");
Matcher m = p.matcher(pageContent);
while( m.find() ) { 
  System.out.println( m.group(1) );
}

Above code block written in Java will extract all anchor tags in a page and extract URL into your hand.

If you don't have enough time to learn Regular Expression, the following references will help you.

http://htmlparser.sourceforge.net/

xrath 2009-09-25 03:10:14

You should never use regular expressions to parse non-regular languages. Even if this will work, what happens when your requirements change? Why not start with the right tool for the job rather than try to hack something together? (X|HT)ML parsers are avaliable in almost every modern language, and are fairly easy to work with.

Chris Lutz 2009-09-25 03:12:09

regex to parse html? wtf?

hasen j 2009-09-25 03:31:04

Answer 2

+2 A:

Here is a StackOverflow question showing how to use a number of XML/HTML parsers in different languages. If you tell us what language you're using, I can be more specific, but your answer may already be in there.

Chris Lutz 2009-09-25 03:15:38

Answer 3

A:

Depends what language you are developing for, trying googling:

html parser languagename

hpricot is a good one for Ruby, for example.

David Claridge 2009-09-25 03:18:41

I just need that in C or C++

Macroideal 2009-09-27 09:05:07

http://www.lmgtfy.com/?q=html+parser+c%2B%2B

David Claridge 2009-10-12 04:54:15

ansaurus

tags:

views:

answers:

HTML parser...My recent project needs a web spider..

related questions