Parsing web pages

tags:

html-parsing

views:

105

answers:

Parsing web pages

I have a question about parsing HTML pages, specificaly forums, i want to parse a forum or thread containing certain post criterias, i havent defined the algorithm yet, since i have only parsed structure text formats before, A use case may be copy and paste each thread into the program by hand, or insert a URL like http://www.forums.com/forum/showthread.php?t=46875&page=3 and let the program parse the pages

Given all this i would like to know:

Is it possible to parse a forum thread on a HTML page?
what would be the best/Fastest/easiest language for doing this?
If i prefer Java what tools/libraries do i need for this?
Any other thing i should consider?

yes
regular expressions, any flavor.
probably the ones w/regex
there are tools out there that will do this for you.

Jason 2009-11-23 23:09:08

I wouldn't want to test that regular expression! :P

Aiden Bell 2009-11-23 23:10:28

Matching html tags via regex might be difficult, see: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

The MYYN 2009-11-23 23:11:46

@myyn - that answer is the greatest SO answer i've ever seen. but if you can assume that all HTML you're trying to parse is legit, it's actually fairly easy to do so w/regex, as i frequently do so. of course, this is a large assumption.

Jason 2009-11-23 23:18:43

true, i myself use / re.compile(r'<.*?>').sub('', html) / frequently ;)

The MYYN 2009-11-23 23:23:00

+1 A:

1 / yes

2 / Use some compact language like python or ruby for prototyping.

For python there is a neat library for HTML/XML parsing called beautifulsoup
For ruby, you could try: nokogiri or hpricot

3 / A Java tool to consider: htmlparser

4 / If you are interested only in some particular text or some special classes, a regular expression might be sufficient. But as soon as you want to dig deeper into the structure of the content, you'll need some kind of model to hold your data, and hence a parser, which, in the best case, can cope with the occuring incosistencies of real world html.

The MYYN 2009-11-23 23:13:41

+1 A:

You might want to look into some sort of html parsing library, rather than using regular expressions to do this. There are some really good html parsers for ruby and python, but a quick google shows there to be a number of parsers for java as well. The benefit of these libraries is that you don't have to handle every edge case with regular expressions/they handle malformed html (both of which can be impossible with regexes, depending on what you want to do) and they also give you a much way of dealing with the data (for example, beautiful soup lets you grab all elements which belong to a specific class or to use some other css selector to limit which page elements you want to deal with).

Personally, I would, at least for the beginning, start in ruby or python, as the libraries are known and there is a lot of info about using them for this purpose. Also, I find it easier to quickly prototype these types of things in ruby or python than in the jvm. You could even later bring that code onto the jvm with jruby or jython, if it becomes necessary.

Paul Wicks 2009-11-23 23:17:31

ansaurus

tags:

views:

answers:

Parsing web pages

related questions