views:

295

answers:

3

I'm looking for a fast library/class to parse plain text using expressions like below:

Text is: <b>Name:</b>John<br><i>Age</i>32<br>

Pattern is: {*}Name:</b>{%}<br>{*}Age</i>{%}<br>

And it will find me two values: John and 32. Intent is to parse simple HTML web pages without involving heavy duty tools. It should not be using string operations or regexps internally but probably do char by char parsing.

A: 

A regex replacement would work. Just get it to return both values together like "John%32" and then split the response to get the two separate values.

mlathe
A: 

There's really no advantage to character-by-character parsing manually implemented here, as such problems have been by and large solved for these types of problems.

  • If you're dealing with an extremely normalized set of data (i.e. the template you described above is formatted exactly the same in every circumstance with no possibility of missing closing tags, HTML being inserted in odd places, etc.), regular expressions are a perfectly appropriate tool to parse this sort of data.
  • If the HTML can not be guaranteed to be perfect, then the most straightforward solution is to use a tool to load the HTML structure into a DOM and find the appropriate elements in the document tree.

Developing a character-by-character approach will probably end up being equivalent to manually implementing one of the above two options, which is not a trivial thing to implement.

Ryan Brunner
I cannot use DOM, since I want a plain text parsing solution, so that I can parse part of tag name, for example. I do not want regexps since text to parse may be quite long. Basically I want algorithm to parse text against wildcard based patterns, just have notion for 'any characters sequence to ignore' and 'any characters sequence to store and return back to user' being {*} and {%}. Patterns are user defined and allow to easily parse text data and extract specific parts of it.
Igor Romanov
It just seems what you're describing is essentially a version of regular expressions with a customized syntax. Regular expression libraries are quite mature and should be able to handle fairly hefty page sizes. Perhaps a solution is to take your custom syntax and translate that into standard regex syntax?
Ryan Brunner
Funny how these days people tend to use very high level APIs regardless of task :-) I guess final method will be 20-30 lines of code but its hard to make myself into sitting with paper and pencil and do some thinking instead of Googling... :-(
Igor Romanov
More like 5 if you're willing to use regex and perform only minimal validation.
Anon.
I wouldn't say people use high-level APIs regardless of task. I *would* say people will view an API as an appropriate solution for a problem that they were specifically designed for. Your problem is more or less the definition of what regular expressions were designed to solve.
Ryan Brunner
A: 

Since you appear to be asking the user to specify the HTML content you want, it's probably alright to use regular expressions here (why do you have an aversion to them?). It's not HTML parsing, anymore, just simple text matching, which is what regular expressions are designed for.

Here's an example:

$match =~ s/{\*}/.*?/g;
$match =~ s/{%}/(.*?)/g;
$html =~ /$match/;

Which will leave what you need in your capturing groups.

Anon.
basically its to be used in small app to download HTML-like (wap) file and extracts some numbers from it. text structure is not guaranteed to be same but some parts of it can be recognized and treated as fixed, like if you fetching data from mobile banking page you may want to look for `<b>Card number:</b>{%}<br>{*}<b>Balance:</b>{%}<br>`. Doing this with regexps will I guess make things more complicated.
Igor Romanov
What makes you think that having regular expressions be used somewhere deep in the bowels of your app will somehow make the whole thing more complicated? Simplifying text matching is the whole reason regular expressions exist.
Anon.
You may be right, I'm going to try now. Thanks.
Igor Romanov