tags:

views:

127

answers:

2

Hi,

I am very new to Erlang and as part of my learning exercise, I would like to write an HTML parser in Erlang.

I want to extract certain values from a web page, perhaps using a pattern to describe what data I want to extract.

Can anybody offer me some high level advice as to how they would approach this problem in Erlang?

I think I need to turn the document into a stack of tokens perhaps using a finite state machine to track where I am with regards to nesting and where I am in the element.

Cheers

Paul

+3  A: 

I would suggest you to have a look to the one included in Mochiweb:

http://github.com/mochi/mochiweb/blob/master/src/mochiweb_html.erl

The parse/1 function is probably the entry point you're interested into.

Roberto Aloi
+1  A: 

This is a big job if you plan to be complete about it. You are best to use the one that Roberto suggest, but if you are determined to write your own as a project to get familiar with Erlang here are some suggestions...

You should first decide whether you are going to hand-code your parser or use leex and yecc to generate your parser from a grammar. Hand coding might be a better learning experience if you want to learn how to write idiomatic Erlang. Writing a parser is an excellent way to introduce yourself to Erlang; functional programming languages excel at implementing parsers.

Second, you should decide if you want to generate a DOM-like structure or do a SAX-like callback model known as a behaviour in Erlang. If you do the latter, you could simply implement the behaviour to create a DOM.

If you look at behaviours, you may also want to look into parametrized modules. This is an experimental feature that can complement behaviours, allowing immutable state to be stored within the an "instance of a module". It is not known whether or not this new feature will be supported by the community or not. (For some people it just looks too OO).

Another excellent resource is the xmerl code. Pay close at to how it determines the character encoding and parses accordingly. HTML (varioust standards) work slightly different, but it's important that you take into account the proper character encoding when you read the file.

Also from xmerl, you can see how that library constructs a DOM using Erlang tuples. You might want to do something similar.

dsmith
Perhaps I was hasty suggesting that you look into parametrised modules. There are good arguments for avoiding it (http://stackoverflow.com/questions/2291155/what-alternatives-are-there-to-parameterised-modules-in-erlang).
dsmith
Thank you for your answer, this is a learning exercise rather than something that will be used by many. XMerl is a great resource to look at.
dagda1