tags:

views:

2602

answers:

5

Thinking about my other problem, i decided I can't even create a regular expression that will match roman numerals (let alone a context-free grammar that will generate them)

The problem is matching only valid roman numerals. Eg, 990 is NOT "XM", it's "CMXC"

My problem in making the regex for this is that in order to allow or not allow certain characters, I need to look back. Let's take thousands and hundreds, for example.

I can allow M{0,2}C?M (to allow for 900, 1000, 1900, 2000, 2900 and 3000). However, If the match is on CM, I can't allow following characters to be C or D (because I'm already at 900).

How can I express this in a regex?
If it's simply not expressible in a regex, is it expressible in a context-free grammar?

Thanks for any pointers!

+8  A: 

Dive Into Python explains how you can do this in 7.3 Case Study: Roman Numerals.

yjerem
+17  A: 

Try:

^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

Breaking it down:


M{0,4}

This specifies the thousands section and basically restrains it to between 0 and 4000. It's a relatively simple:

   0: <empty>  matched by M{0}
1000: M        matched by M{1}
2000: MM       matched by M{2}
3000: MMM      matched by M{3}
4000: MMMM     matched by M{4}

(CM|CD|D?C{0,3})

Slightly more complex, this is for the hundreds section and covers all the possibilities:

  0: <empty>  matched by D?C{0} (with D not there)
100: C        matched by D?C{1} (with D not there)
200: CC       matched by D?C{2} (with D not there)
300: CCC      matched by D?C{3} (with D not there)
400: CD       matched by CD
500: D        matched by D?C{0} (with D there)
600: DC       matched by D?C{1} (with D there)
700: DCC      matched by D?C{2} (with D there)
800: DCCC     matched by D?C{3} (with D there)
900: CM       matched by CM

(XC|XL|L?X{0,3})

Same rules as previous section but for the tens place:

 0: <empty>  matched by L?X{0} (with L not there)
10: X        matched by L?X{1} (with L not there)
20: XX       matched by L?X{2} (with L not there)
30: XXX      matched by L?X{3} (with L not there)
40: XL       matched by XL
50: L        matched by L?X{0} (with L there)
60: LX       matched by L?X{1} (with L there)
70: LXX      matched by L?X{2} (with L there)
80: LXXX     matched by L?X{3} (with L there)
90: XC       matched by XC

(IX|IV|V?I{0,3})

This is the units section, handling 0 through 9 and also similar to the previous two sections (Roman numerals, despite their seeming weirdness, follow some logical rules once you figure out what they are):

0: <empty>  matched by V?I{0} (with V not there)
1: I        matched by V?I{1} (with V not there)
2: II       matched by V?I{2} (with V not there)
3: III      matched by V?I{3} (with V not there)
4: IV       matched by IV
5: V        matched by V?I{0} (with V there)
6: VI       matched by V?I{1} (with V there)
7: VII      matched by V?I{2} (with V there)
8: VIII     matched by V?I{3} (with V there)
9: IX       matched by IX
paxdiablo
+1 Bet you weren't expecting to get anymore upvotes on this old answer were you ;)
AnthonyWJones
Shouldn't it be M{0,3}?
lemon
Possibly. I can see only inferences in the question that it needs to go up to 3999 (not a specific definite requirement) but I allow it up to 4999 anyway. If you truly want to restrict it to 3999 then by all means remove one of the Ms.
paxdiablo
Thanks, helped a lot!
Raj
+5  A: 

Fortunately, the range of numbers is limited to 1..3999 or thereabouts. Therefore, you can build up the regex piece-meal.

<opt-thousands-part><opt-hundreds-part><opt-tens-part><opt-units-part>

Each of those parts will deal with the vagaries of Roman notation. For example, using Perl notation:

<opt-hundreds-part> = m/(CM|DC{0,3}|CD|C{1,3})?/;

Repeat and assemble.

Added: The <opt-hundreds-part> can be compressed further:

<opt-hundreds-part> = m/(C[MD]|D?C{0,3})/;

Since the 'D?C{0,3}' clause can match nothing, there's no need for the question mark. And, most likely, the parentheses should be the non-capturing type - in Perl:

<opt-hundreds-part> = m/(?:C[MD]|D?C{0,3})/;

Of course, it should all be case-insensitive, too.

You can also extend this to deal with the options mentioned by James Curran (to allow XM or IM for 990 or 999, and CCCC for 400, etc).

<opt-hundreds-part> = m/(?:[IXC][MD]|D?C{0,4})/;
Jonathan Leffler
+4  A: 

Actually, your premise is flawed. 990 IS "XM", as well as "CMXC".

The Romans were far less concerned about the "rules" than your third grade teacher. As long as it added up, it was OK. Hence "IIII" was just as good as "IV" for 4. And "IIM" was completely cool for 998.

(If you have trouble dealing with that... Remember English spellings were not formalized until the 1700s. Until then, as long as the reader could figure it out, it was good enough).

James Curran
Humanity's funny that way. The rules are rules, except when they're not.
sblundy
Sure, that's cool. But my "strict third grade teacher" syntax need makes a much more interesting regex problem, in my opinion...
Daniel Magliola
A: 

As Jeremy and Pax pointed out above ...

'^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
should be the solution you're after ...

The specific URL that should have been attached (IMHO) is http://thehazeltree.org/diveintopython/7.html

Example 7.8 is the short form using {n,m}