ansaurus

Question

How do you match only valid roman numerals with a regular expression?

Answer 1

+8 A:

Dive Into Python explains how you can do this in 7.3 Case Study: Roman Numerals.

yjerem 2008-11-06 01:33:09

Answer 2

+17 A:

Try:

^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

Breaking it down:

M{0,4}

This specifies the thousands section and basically restrains it to between 0 and 4000. It's a relatively simple:

   0: <empty>  matched by M{0}
1000: M        matched by M{1}
2000: MM       matched by M{2}
3000: MMM      matched by M{3}
4000: MMMM     matched by M{4}

(CM|CD|D?C{0,3})

Slightly more complex, this is for the hundreds section and covers all the possibilities:

  0: <empty>  matched by D?C{0} (with D not there)
100: C        matched by D?C{1} (with D not there)
200: CC       matched by D?C{2} (with D not there)
300: CCC      matched by D?C{3} (with D not there)
400: CD       matched by CD
500: D        matched by D?C{0} (with D there)
600: DC       matched by D?C{1} (with D there)
700: DCC      matched by D?C{2} (with D there)
800: DCCC     matched by D?C{3} (with D there)
900: CM       matched by CM

(XC|XL|L?X{0,3})

Same rules as previous section but for the tens place:

 0: <empty>  matched by L?X{0} (with L not there)
10: X        matched by L?X{1} (with L not there)
20: XX       matched by L?X{2} (with L not there)
30: XXX      matched by L?X{3} (with L not there)
40: XL       matched by XL
50: L        matched by L?X{0} (with L there)
60: LX       matched by L?X{1} (with L there)
70: LXX      matched by L?X{2} (with L there)
80: LXXX     matched by L?X{3} (with L there)
90: XC       matched by XC

(IX|IV|V?I{0,3})

This is the units section, handling 0 through 9 and also similar to the previous two sections (Roman numerals, despite their seeming weirdness, follow some logical rules once you figure out what they are):

0: <empty>  matched by V?I{0} (with V not there)
1: I        matched by V?I{1} (with V not there)
2: II       matched by V?I{2} (with V not there)
3: III      matched by V?I{3} (with V not there)
4: IV       matched by IV
5: V        matched by V?I{0} (with V there)
6: VI       matched by V?I{1} (with V there)
7: VII      matched by V?I{2} (with V there)
8: VIII     matched by V?I{3} (with V there)
9: IX       matched by IX

paxdiablo 2008-11-06 01:35:41

+1 Bet you weren't expecting to get anymore upvotes on this old answer were you ;)

AnthonyWJones 2009-12-04 14:02:52

Shouldn't it be M{0,3}?

lemon 2010-03-25 02:18:58

Possibly. I can see only inferences in the question that it needs to go up to 3999 (not a specific definite requirement) but I allow it up to 4999 anyway. If you truly want to restrict it to 3999 then by all means remove one of the Ms.

paxdiablo 2010-03-25 02:41:26

Thanks, helped a lot!

Raj 2010-06-29 09:57:22

Answer 3

+5 A:

Fortunately, the range of numbers is limited to 1..3999 or thereabouts. Therefore, you can build up the regex piece-meal.

<opt-thousands-part><opt-hundreds-part><opt-tens-part><opt-units-part>

Each of those parts will deal with the vagaries of Roman notation. For example, using Perl notation:

<opt-hundreds-part> = m/(CM|DC{0,3}|CD|C{1,3})?/;

Repeat and assemble.

Added: The <opt-hundreds-part> can be compressed further:

<opt-hundreds-part> = m/(C[MD]|D?C{0,3})/;

Since the 'D?C{0,3}' clause can match nothing, there's no need for the question mark. And, most likely, the parentheses should be the non-capturing type - in Perl:

<opt-hundreds-part> = m/(?:C[MD]|D?C{0,3})/;

Of course, it should all be case-insensitive, too.

You can also extend this to deal with the options mentioned by James Curran (to allow XM or IM for 990 or 999, and CCCC for 400, etc).

<opt-hundreds-part> = m/(?:[IXC][MD]|D?C{0,4})/;

Jonathan Leffler 2008-11-06 01:36:18

Answer 4

+4 A:

Actually, your premise is flawed. 990 IS "XM", as well as "CMXC".

The Romans were far less concerned about the "rules" than your third grade teacher. As long as it added up, it was OK. Hence "IIII" was just as good as "IV" for 4. And "IIM" was completely cool for 998.

(If you have trouble dealing with that... Remember English spellings were not formalized until the 1700s. Until then, as long as the reader could figure it out, it was good enough).

James Curran 2008-11-06 01:51:36

Humanity's funny that way. The rules are rules, except when they're not.

sblundy 2008-11-06 01:57:09

Sure, that's cool. But my "strict third grade teacher" syntax need makes a much more interesting regex problem, in my opinion...

Daniel Magliola 2008-11-06 03:03:30

Answer 5

A:

As Jeremy and Pax pointed out above ...

'^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'

should be the solution you're after ...

The specific URL that should have been attached (IMHO) is http://thehazeltree.org/diveintopython/7.html

Example 7.8 is the short form using {n,m}

2008-11-06 03:03:07

ansaurus

tags:

views:

answers:

How do you match only valid roman numerals with a regular expression?

related questions