tags:

views:

78

answers:

3

Are there any libraries or technologies(in any language) that provide a regular-expression-like tool for any sort of stream-like or list-like data(as opposed to only character strings)?

For example, suppose you were writing a parser for your pet programming language. You've already got it lexed into a list of Common Lisp objects representing the tokens.

You might use a pattern like this to parse function calls(using C-style syntax):

(pattern (:var (:class ident)) (:class left-paren) (:optional (:var object)) (:star (:class comma) (:var :object)) (:class right-paren))

Which would bind variables for the function name and each of the function arguments(actually, it would probably be implemented so that this pattern would probably bind a variable for the function name, one for the first argument, and a list of the rest, but that's not really an important detail).

Would something like this be useful at all?

+1  A: 

I don't know how many replies you'll receive on a subject like this, as most languages lack the sort of robust stream APIs you seem to have in mind; thus, most of the people reading this probably don't know what you're talking about.

Smalltalk is a notable exception, shipping with a rich hierarchy of Stream classes that--coupled with its Collection classes--allow you to do some pretty impressive stuff. While most Smalltalks also ship with regex support (the pure ST implementation by Vassili Bykov is a popular choice), the regex classes unfortunately are not integrated with the Stream classes in the same way the Collection classes are. This means that using streams and regexes in Smalltalk usually involves reading character strings from a stream and then testing those strings separately with regex patterns--not the sort "read next n characters up until a pattern matches," or "read next n characters matching this pattern" type of functionally you likely have in mind.

I think a powerful stream API coupled with powerful regex support would be great. However, I think you'd have trouble generalizing about different stream types. A read stream on a character string would pose few difficulties, but file and TCP streams would have their own exceptions and latencies that you would have to handle gracefully.

Actually, I thought the VisualWorks 7.6-packaged Regex parcel does have a few stream-matching methods. I'd verify but I don't have an image handy on this machine.
Ash Wilson
A: 

Try looking at scala.util.regexp, both the API documentation, and the code example at http://scala.sygneca.com/code/automata. I think would allow a computational linguist to match strings of words by looking for part of speech patterns, for example.

Ken Bloom
A: 

This is the principle behind most syntactic parsers, which operate in two phases. The first phase is the lexer, where identifiers, language keywords, and other special characters (arithmetic operators, braces, etc) are identified and split into Token objects that typically have a numeric field indicating the type of the lexeme, and optionally another field indicating the text of the lexeme.

In the second phase, a syntactic parser operates on the Token objects, matching them by magic number alone, to parse phrases. (Software for doing this includes Antlr, yacc/bison, Scala's cala.util.parsing.combinator.syntactical library, and plenty of others). The two phases don't entirely have to depend on each other -- you can get your Token objects from anywhere else that you like. The magic number aspect seems to be important, though, because the magic numbers are assigned to constants, and they're what make it easy to express your grammar in a readable language.

And remember, that anything you can accomplish with a regular expression can also be accomplished with a context-free grammar (usually just as easily).

Ken Bloom