views:

157

answers:

9

I have two questions:

1) I have a regular expression ([A-Z][a-z]{0,2})(\d*) and I am using Python's re.finditer() to match appropriate strings. My problem is, that I want to match only strings that contain no extra characters, otherwise I want to raise an exception.

I want to catch a following pattern: - capital letter, followed by 0, 1 or 2 small letters, followed by 0 or more numbers.

The pattern represents a chemical formula, i.e. atom followed by number of it's occurences. I want to put the atom into a dictionary with it's number of occurences, so I need to separate atoms (capital letter followed by 0, 1 or 2 small letters) and numbers, but remember that they belong together.

Example:

C6H5Fe2I   # this string should be matched successfully. Result: C6 H5 Fe2 I
H2TeO4     # this string should be matched successfully Result: H2 Te O4
H3PoooO5   # exception should be raised
C2tH6      # exception should be raised

2) second question is what kind of Exception should I raise in case the input string is wrong.

Thank you, Tomas

+2  A: 

You need slightly different regex:

^([A-Z][a-z]{0,2})(\d*)$

which won't match any of your example strings, however. You need to provide better description of why those strings supposed to match.

Just to test whether the whole string match you could use:

>>> re.match(r'(([A-Z][a-z]{,2})(\d*))+$', 'H2TeO4')
<_sre.SRE_Match object at 0x920f520>
>>> re.match(r'(([A-Z][a-z]{,2})(\d*))+$', 'H3PoooO5')
>>> 

I didn't find pure regex solution, but here is how to test and collect matches:

>>> res = re.findall(r'([A-Z][a-z]{,2})(\d*)(?=(?:[A-Z][a-z]{,2}\d*|$))', s)
>>> res
[('C', '6'), ('H', '5'), ('Fe', '2'), ('I', '')]
>>> ''.join(''.join(i) for i in res) == s
True
SilentGhost
Yes, exactly. This was the first thing that I tried, but I didn't match anything.
Tomas Novotny
@Manoj, instead of debating what the OP might or might not want, let's just ask the OP for clarification? Both SilentGhost's and Michel's answers look okay to me.
Bart Kiers
@Bart: See my comment attached to the OP's question. I did ask him for clarification.
Manoj Govindan
@Manoj, yes, I see that now. And I see that the OP clarified him/herself now by indicating what the expected output should be. My point is that before getting this clarification, there's no need to comment on other people's answers.
Bart Kiers
@Tomas, @Manoj, @Bart: see my edit.
SilentGhost
@Bart: I beg to differ about not posting comments. I base my comments on "do unto others ...". If ***I*** misread a question and provided an answer (that didn't meet the requirement), AND someone else saw it and understood it, then ***I'd*** like to be told. This would help *me* make changes (and it has in the past).
Manoj Govindan
@Bart: That said, I fully agree that the OP needs to make his requirement clear. I kind of guessed from the `finditer` that he was trying to match repeatedly in each string.
Manoj Govindan
@Manoj, okay, then we have a different opinion about it. As I said, IMO, the original question was rather vague and needed clarification first before dismissing other people's answers.
Bart Kiers
@Bart: Agreed about the clarity. And I was certainly not trying to dismiss other people's answers! There is alas no tone in text; otherwise I would have used a curious-helpful tone for my comments. I'm knocking off my comments anyhoo.
Manoj Govindan
@Manoj, yes, that is always a problem: how to get ones message over in such a way it was actually meant (especially in these small comment-boxes). Thanks for *your* clarification :)
Bart Kiers
+4  A: 

Here's a few different approaches you could use:

Compare lengths

  • Find the length of the original string.
  • Sum the length of the matched strings.
  • If the two numbers differ there were unused characters.

Note that you can also combine this method with your existing code rather than doing it as an extra step if you want to avoid parsing the string twice.

Regular expression for entire string

You can check if this regular expression matches the entire string:

^([A-Z][a-z]{0,2}\d*)*$

(Rubular)

Tokenize

You can use the following regular expression to tokenize the original string:

[A-Z][^A-Z]*

Then check each token to see if it matches your original regular expression.

Mark Byers
Nice approach. +1
JoshD
First regex: why start with `^`? Why end with `$` instead of `\Z`? Shouldn't 2nd `*` be `+`?
John Machin
+3  A: 

capital letter, followed by 0, 1 or 2 small letters, followed by 0 or more numbers

Ok then.

/^([A-Z][a-z]{0,2}\d*)+$/

Difference here being the extra grouping (foo)+ within the ^$ allowing you to capture pattern foo N times.

No global flag? Guess you'll have to split the result of that regex on the pattern again then.

annakata
Last time I checked, python had no G flag, so you won't get the individual matches..
poke
Wow, for real? How do you guys cope?
annakata
There is a findall() function, though. And sub() is always considered global. Normally it doesn't cause any problems
AHM
findall() is a good substitute for /g, you also might want to use the modifiers re.MULTILINE and/or re.DOTALL if you've got a blob of text that includes newlines.
synthesizerpatel
A: 

Use this pattern

(([A-Z][a-z]{0,2})(\d*))+

If it matches, great! If not, then handle it. I see no reason to raise an exception if it doesn't match. You'll have to provide more info.

JoshD
@JoshD This will match the empty string, which is not accepted by the original implementation.
Darth Android
@Darth Android: Thanks; changed.
JoshD
If the application requires a well formatted string, and he splits that string into the needed chunks and notices that it is invalid, raising an exception is the most appropriate thing to do.
poke
@poke: I agree, I would just like it if that was explicitly stated... I guess any error, even a custom one could be raised then.
JoshD
+2  A: 

Do you need to extract each individual part to process, or simply match for input validation? If you just need to match for validation, try ^([A-Z][a-z]{0,2}\d*)+$.

Darth Android
I like your solution. +1
JoshD
+2  A: 
>>> import re
>>> reMatch = re.compile( '([A-Z][a-z]{0,2})(\d*)' )
>>> def matchText ( text ):
        matches, i = [], 0
        for m in reMatch.finditer( text ):
            if m.start() > i:
                break
            matches.append( m )
            i = m.end()
        else:
            if i == len( text ):
                return matches
        raise ValueError( 'invalid text' )

>>> matchText( 'C6H5Fe2I' )
[<_sre.SRE_Match object at 0x021E2800>, <_sre.SRE_Match object at 0x021E28D8>, <_sre.SRE_Match object at 0x021E2920>, <_sre.SRE_Match object at 0x021E2968>]
>>> matchText( 'H2TeO4' )
[<_sre.SRE_Match object at 0x021E2890>, <_sre.SRE_Match object at 0x021E29F8>, <_sre.SRE_Match object at 0x021E2A40>]
>>> matchText( 'H3PoooO5' )
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    matchText( 'H3PoooO5' )
  File "<pyshell#3>", line 11, in matchText
    raise ValueError( 'invalid text' )
ValueError: invalid text
>>> matchText( 'C2tH6' )
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    matchText( 'C2tH6' )
  File "<pyshell#3>", line 11, in matchText
    raise ValueError( 'invalid text' )
ValueError: invalid text

To answer your second question a bit more clearly than with the code above: A ValueError is used in cases where a parameter was of the correct type but the value was not right. So for a function that uses a regex, it is obviously the best you can choose.

poke
A: 

for the Validation please to try if you are using .NET Framework

([A-Z][a-b]??[0-9]??)*

other wise

([A-Z][a-b]?[a-b]?[0-9]?[0-9]?)*
Mario
(1) Python supports reluctant quantifiers too, (2) those two regexes are not equivalent, and (3) both of them are wrong. `[a-z]??` matches at most one letter (the OP said there could be up to two); while `[0-9]?[0-9]?` matches a maximum of two digits (the OP didn't place any limit on those).
Alan Moore
(1)I don't know any thing about Python(2) those regexp are equivalent in .net Framework because [a-z]?? equivalent to [a-z]?[a-z]? in .net Framework sow it take between 0 and 2 character between a and z (3) about the unlimited of the digit replace [0-9]?? with [0-9]*
Mario
@Mario, "?? Matches the previous element zero or one time, but as few times as possible." So you are simply wrong, the extra "?" suffix switches matching to modest (non-greedy) mode.
macias
A: 

My go without regexp:

tests= (
'C6H5Fe2I',   # this string should be matched successfully. Result: C6 H5 Fe2 I
'H2TeO4',     # this string should be matched successfully Result: H2 Te O4
'H3PoooO5',   # exception should be raised
'C2tH6')      # exception should be raised

def splitter(case):
    case, original = list(case), case
    while case:
        if case[0].isupper():
            result = case.pop(0)
        else:
            raise ValueError('%r is not capital letter in %s position %i.' %
                             (case[0], original, len(original)-len(case)))
        for count in range(2):
            if case and case[0].islower():
                result += case.pop(0)
            else:
                break
        for count in range(2):
            if case and case[0].isdigit():
                result += case.pop(0)
            else:
                break
        yield result

for testcase in tests:
    try:
        print tuple(splitter(testcase))
    except ValueError as e:
        print(e)
Tony Veijalainen
A: 

You can do this in not much code with re.split -- yes, that's correct, re.split.

Here are the docs:

Invert your problem: split your input with a delimiter pattern that matches a valid atom+count. Have a capturing group so that the delimiter strings are kept. If the input string is valid, the non-delimiters in the result will all be empty strings.

>>> tests= (
... 'C6H5Fe2I',
... 'H2TeO4',
... 'H3PoooO5',
... 'C2tH6',
... 'Bad\n')
>>> import re
>>> pattern = r'([A-Z][a-z]{0,2}\d*)'
>>> for test in tests:
...     pieces = re.split(pattern, test)
...     print "\ntest=%r pieces=%r" % (test, pieces)
...     data = pieces[1::2]
...     rubbish = filter(None, pieces[0::2])
...     print "rubbish=%r data=%r" % (rubbish, data)
...

test='C6H5Fe2I' pieces=['', 'C6', '', 'H5', '', 'Fe2', '', 'I', '']
rubbish=[] data=['C6', 'H5', 'Fe2', 'I']

test='H2TeO4' pieces=['', 'H2', '', 'Te', '', 'O4', '']
rubbish=[] data=['H2', 'Te', 'O4']

test='H3PoooO5' pieces=['', 'H3', '', 'Poo', 'o', 'O5', '']
rubbish=['o'] data=['H3', 'Poo', 'O5']

test='C2tH6' pieces=['', 'C2', 't', 'H6', '']
rubbish=['t'] data=['C2', 'H6']

test='Bad\n' pieces=['', 'Bad', '\n']
rubbish=['\n'] data=['Bad']
>>>
John Machin