tags:

views:

60

answers:

3

I am looking for a regex to check for valid SI unit input in an http form. So for example,

kg/m^3

would be valid for density or

m/s^2

for acceleration.

It seems like the kind of problem that may have been solved in some open library; or there may be a clever way to do it starting with a limited set of base units. It is for use in an academic context where it is acceptable to require users to follow specific rules for inputs.

+1  A: 

HHm, I don't think regular expression can express that in the general sense but rather a context-free formalism should be used.

You can, however, do it for the usual types like the ones you have given.

This reg. exp. will do for your examples and these also (replace every \d+ with some reg. exp. for decimals if you will):

kg^2/m^3
kg*2/m
kg^2*2*m/m^3
kg/m/kg^3

(kg|m|s|A|K|cd|mol|\d+)((\*|\^)(kg|m|s|A|K|cd|mol|\d+))*(/(kg|m|s|A|K|cd|mol|\d+)((\*|\^)(kg|m|s|A|K|cd|mol|\d+))*)?

But not something like:

kg^(kg/m)/m/kg^3

Units in python:

http://stackoverflow.com/questions/2125076/unit-conversion-in-python

lasseespeholt
Thanks lasseespeholt; 2nd link led to `Scientific.Physics.PhysicalQuantities.PhysicalQuantity` which is fantastic on the Python side.
bvmou
+1  A: 

This one is related http://stackoverflow.com/questions/1483280/regex-for-distance-in-metric-system . You just need to extend the regex for the optional denominator and the exponent.

Jerome
+4  A: 

A pure regular expression solution is a bit inelegant, because it has to repeat captures multiple times to accomodate multiplication and division:

(?x)
(    # Capture 1: the entire matched composed unit string
  (  # Capture 2: one unit including the optional prefix
    (Y|Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)?  # Capture 3: optional prefix, taken from http://en.wikipedia.org/wiki/SI_prefix
    (m|g|s|A|K|mol|cd|Hz|N|Pa|J|W|C|V|F|Ω|S|Wb|T|H|lm|lx|Bq|Gy|Sv|kat|l|L) # Capture 4: Base units and derived units w/o °C, rad and sr, but with L/l for litre
    (\^[+-]?[1-9]\d*)? # Capture 5: Optional power with optional sign. \^0 and \^-0 are not permitted
  |                 # or
    1               # one permitted, e.g. in 1/s
  )
  (?:    # Zero or more repetitions of one unit, plus the multiplication sign
     ·(  # Capture 6: one unit including the optional prefix
        (Y|Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)?  # Capture 7
        (m|g|s|A|K|mol|cd|Hz|N|Pa|J|W|C|V|F|Ω|S|Wb|T|H|lm|lx|Bq|Gy|Sv|kat|l|L) # Capture 8
        (\^[+-]?[1-9]\d*)? # Capture 9
      |                 # or
        1               # one permitted, e.g. in 1/s
      )
  )*
  (?:    # Optional: possibly multiplied units underneath a denominator sign
      \/(  # Capture 10
        (Y|Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)?  # Capture 11
        (m|g|s|A|K|mol|cd|Hz|N|Pa|J|W|C|V|F|Ω|S|Wb|T|H|lm|lx|Bq|Gy|Sv|kat|l|L) # Capture 12
        (\^[+-]?[1-9]\d*)? # Capture 13
      |                 # or
        1               # one permitted, e.g. in 1/s
      )
      (?:    # Zero or more repetitions of one unit, plus the multiplication sign
         ·(  # Capture 14
            (Y|Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)?  # Capture 15
            (m|g|s|A|K|mol|cd|Hz|N|Pa|J|W|C|V|F|Ω|S|Wb|T|H|lm|lx|Bq|Gy|Sv|kat|l|L) # Capture 16
            (\^[+-]?[1-9]\d*)? # Capture 17
          |                 # or
            1               # one permitted, e.g. in 1/s
          )
      )*
  )?
)

I have included the litre as a unit, even though it is not an SI unit. I also require the standard multiplication sign. You may modify this if needed. If you construct the regular expression from several base strings, it becomes much easier to grasp:

prefix = "(Y|Z|E|P|T|G|M|k|h|da|d|c|m|µ|n|p|f|a|z|y)"
unit = "(m|g|s|A|K|mol|cd|Hz|N|Pa|J|W|C|V|F|Ω|S|Wb|T|H|lm|lx|Bq|Gy|Sv|kat|l|L)"
power = "(\^[+-]?[1-9]\d*)"
unitAndPrefix = "(" + prefix + "?" + unit + power + "?" + "|1" + ")"
multiplied = unitAndPrefix + "(?:·" + unitAndPrefix + ")*"
withDenominator = multiplied + "(?:\/" + multiplied + ")?"

The regular expression does not do any consistency checking, of course, it also accepts such things like kg^-1·kg^-1·1/kg^-2 as valid.

Of course, you can modify the regular expression as required, e.g. by using * as the multiplication character, etc.

nd
+1 for very complete answer. Just a remark, you could also allowed the + sign (optionnal) for the power, ie: (\^[+-]?[1-9]\d*)?
M42
+1 Insanity ;-) Nice answer, and yes complete. But this complexity of the reg. exp. also demonstrates why you shouldn't use reg. exp. for this problem. Instead a CFL parser would be cleaner and more expressive.
lasseespeholt