views:

1027

answers:

6

I have a set of strings with numbers embedded in them. They look something like /cal/long/3/4/145:999 or /pa/metrics/CosmicRay/24:4:bgp:EnergyKurtosis. I'd like to have an expression parser that is

  • Easy to use. Given a few examples someone should be able to form a new expression. I want end users to be able to form new expressions to query this set of strings. Some of the potential users are software engineers, others are testers and some are scientists.
  • Allows for constraints on numbers. Something like '/cal/long/3/4/143:#>100&<1110' to specify that a string prefix with '/cal/long/3/4/143:' and then a number between (100,1110) is expected.
  • Supports '|' and . So the expression '/cal/(long|short)/3/4/' would match '/cal/long/3/4/1:2' as well as '/cal/short/3/4/1:2'.
  • Has a Java implementation available or would be easy to implement in Java.

Interesting alternative ideas would be useful. I'm also entertaining the idea of just implementing the subset of regular expressions that I need plus the numerical constraints.

Thanks!

+7  A: 

There's no reason to reinvent the wheel! The core of a regular expression engine is built on a strong foundation of mathematics and computer science; the reason we continue to use them today is they are principally sound and won't be improved in the foreseeable future.

If you do find or create some alternative parsing language that only covers a subset of the possibilities Regex can, you will quickly have a user asking for a concept that can be expressed in Regex but your flavor just plain leaves out. Spend your time solving problems that haven't been solved instead!

Rex M
Regular Expressions are mathematically sound and fast. But they *suck* really hard in terms of ease of use and maintainability. They're pure evil in that regard. Thats why theres a reason to reinvent.
B T
@BT that can be said for any language that isn't familiar to the person saying it.
Rex M
@Rex M, I disagree. Regex is at the very least OVERLY concise and hard to read. This is an opinion thing, I believe, but I' have learned them, unlearned them, relearned them.
Yar
A: 

Actually what you described is the Java Pattern Matcher. Which just happens to use Regex as its language.

WolfmanDragon
As far as I can tell there is no way to extend or change the grammar that Pattern uses.
Sean McCauliff
I don't understand why you would want to, but you should be able to pull vars into the pattern, that would extend it. Java Pattern Matcher is one of the most efficient RegEx tools there are. There are too many bad things to change in java to mess with one of the greats.
WolfmanDragon
+1  A: 

Unfortunately, not all programmers (myself included) are as familiar with RegEx as they ought be. This often means we end up writing our own string-parsing logic where RegEx could otherwise have served us well.

This isn't always bad. It's possible in some cases to write a DSL (a class, a cohesive set of methods) that's more elegant and readable and meets the precise needs of your problem domain. The trouble is that it can take dozens of iterations to distill the problem into a DSL that is simple and intuitive. And only if the DSL will be used far and wide in the application or by a large community is this trouble warranted. Don't write a elegant solution to a problem that only appears sporadically.

Mario
I've not heard the term DSL in this context before. That has led to some useful googling. Thanks!
Sean McCauliff
+3  A: 

I'm inclined to agree with Rex M, although your second requirement for numerical constraints complicates things. Unless you only allowed very basic constraints, I'm not aware of a way to succinctly express that in a regular expression. If there is such a way, please disregard the rest of my answer and follow the other suggestions here. :)

You might want to consider a parser generator - things like the classic lex and yacc. I'm not really familiar with the Java choices, but here's a list:

http://java-source.net/open-source/parser-generators

If you're not familiar, the standard approach would be to first create a lexer that turns your strings into tokens. Then you would pass those tokens onto a parser that applies your grammar to them and spits out some kind of result.

In your case, I envision the parser resulting in a combination of a regular expression and additional conditions. For your numerical constraint example, it might give you the regular expression \/cal/long/3/4/143:(\d+)\ and a constraint to apply to the first grouping (the \d+ portion) that requires that the number lie between 100 and 1100. You'd then apply the RE to your strings for candidates, and apply the constraint to those candidates to find your matches.

It's a pretty complicated approach, so hopefully there's a simpler way. I hope that gives you some ideas, at least.

Matthew Maravillas
It's a useful link. Thanks!
Sean McCauliff
+3  A: 

The Java constraint is a severe one. I would recommend using parsing combinators, but you will have to translate the ideas to Java using classes instead of functions. There are many, many papers available on this topic; one of the easiest to approach is Graham Hutton's Higher-Order Functions for Parsing. Hutton's approach makes it especially easy to decide to succeed or fail based on conditions like the magnitude of a number, as you show in your example.

Norman Ramsey
A: 

http://java-source.net/open-source/parser-generators and http://catalog.compilertools.net/java.html contain catalogs of tools for this. Compare also the stackoverflow question How can I parse code to build a compiler in Java?.

hstoerr