ansaurus

Question

How to check that a regular expression has matched a string completely, i.e. - the string did not contain any extra character?

Answer 1

+2 A:

You need slightly different regex:

^([A-Z][a-z]{0,2})(\d*)$

which won't match any of your example strings, however. You need to provide better description of why those strings supposed to match.

Just to test whether the whole string match you could use:

>>> re.match(r'(([A-Z][a-z]{,2})(\d*))+$', 'H2TeO4')
<_sre.SRE_Match object at 0x920f520>
>>> re.match(r'(([A-Z][a-z]{,2})(\d*))+$', 'H3PoooO5')
>>>

I didn't find pure regex solution, but here is how to test and collect matches:

>>> res = re.findall(r'([A-Z][a-z]{,2})(\d*)(?=(?:[A-Z][a-z]{,2}\d*|$))', s)
>>> res
[('C', '6'), ('H', '5'), ('Fe', '2'), ('I', '')]
>>> ''.join(''.join(i) for i in res) == s
True

SilentGhost 2010-10-07 07:35:59

Yes, exactly. This was the first thing that I tried, but I didn't match anything.

Tomas Novotny 2010-10-07 07:39:24

@Manoj, instead of debating what the OP might or might not want, let's just ask the OP for clarification? Both SilentGhost's and Michel's answers look okay to me.

Bart Kiers 2010-10-07 07:39:57

@Bart: See my comment attached to the OP's question. I did ask him for clarification.

Manoj Govindan 2010-10-07 07:44:28

@Manoj, yes, I see that now. And I see that the OP clarified him/herself now by indicating what the expected output should be. My point is that before getting this clarification, there's no need to comment on other people's answers.

Bart Kiers 2010-10-07 07:49:00

@Tomas, @Manoj, @Bart: see my edit.

SilentGhost 2010-10-07 07:51:15

@Bart: I beg to differ about not posting comments. I base my comments on "do unto others ...". If ***I*** misread a question and provided an answer (that didn't meet the requirement), AND someone else saw it and understood it, then ***I'd*** like to be told. This would help *me* make changes (and it has in the past).

Manoj Govindan 2010-10-07 07:52:03

@Bart: That said, I fully agree that the OP needs to make his requirement clear. I kind of guessed from the `finditer` that he was trying to match repeatedly in each string.

Manoj Govindan 2010-10-07 07:54:02

@Manoj, okay, then we have a different opinion about it. As I said, IMO, the original question was rather vague and needed clarification first before dismissing other people's answers.

Bart Kiers 2010-10-07 07:55:35

@Bart: Agreed about the clarity. And I was certainly not trying to dismiss other people's answers! There is alas no tone in text; otherwise I would have used a curious-helpful tone for my comments. I'm knocking off my comments anyhoo.

Manoj Govindan 2010-10-07 07:57:36

@Manoj, yes, that is always a problem: how to get ones message over in such a way it was actually meant (especially in these small comment-boxes). Thanks for *your* clarification :)

Bart Kiers 2010-10-07 08:03:13

Answer 2

+4 A:

Here's a few different approaches you could use:

Compare lengths

Find the length of the original string.
Sum the length of the matched strings.
If the two numbers differ there were unused characters.

Note that you can also combine this method with your existing code rather than doing it as an extra step if you want to avoid parsing the string twice.

Regular expression for entire string

You can check if this regular expression matches the entire string:

^([A-Z][a-z]{0,2}\d*)*$

(Rubular)

Tokenize

You can use the following regular expression to tokenize the original string:

[A-Z][^A-Z]*

Then check each token to see if it matches your original regular expression.

Mark Byers 2010-10-07 07:36:44

Nice approach. +1

JoshD 2010-10-07 07:53:03

First regex: why start with `^`? Why end with `$` instead of `\Z`? Shouldn't 2nd `*` be `+`?

John Machin 2010-10-07 11:36:12

Answer 3

+3 A:

capital letter, followed by 0, 1 or 2 small letters, followed by 0 or more numbers

Ok then.

/^([A-Z][a-z]{0,2}\d*)+$/

Difference here being the extra grouping (foo)+ within the ^$ allowing you to capture pattern foo N times.

No global flag? Guess you'll have to split the result of that regex on the pattern again then.

annakata 2010-10-07 07:44:46

Last time I checked, python had no G flag, so you won't get the individual matches..

poke 2010-10-07 07:56:40

Wow, for real? How do you guys cope?

annakata 2010-10-07 07:58:22

There is a findall() function, though. And sub() is always considered global. Normally it doesn't cause any problems

AHM 2010-10-07 08:08:19

findall() is a good substitute for /g, you also might want to use the modifiers re.MULTILINE and/or re.DOTALL if you've got a blob of text that includes newlines.

synthesizerpatel 2010-10-07 11:22:02

Answer 4

A:

Use this pattern

(([A-Z][a-z]{0,2})(\d*))+

If it matches, great! If not, then handle it. I see no reason to raise an exception if it doesn't match. You'll have to provide more info.

JoshD 2010-10-07 07:45:22

@JoshD This will match the empty string, which is not accepted by the original implementation.

Darth Android 2010-10-07 07:48:02

@Darth Android: Thanks; changed.

JoshD 2010-10-07 07:49:46

If the application requires a well formatted string, and he splits that string into the needed chunks and notices that it is invalid, raising an exception is the most appropriate thing to do.

poke 2010-10-07 08:07:18

@poke: I agree, I would just like it if that was explicitly stated... I guess any error, even a custom one could be raised then.

JoshD 2010-10-07 08:11:22

Answer 5

+2 A:

Do you need to extract each individual part to process, or simply match for input validation? If you just need to match for validation, try ^([A-Z][a-z]{0,2}\d*)+$.

Darth Android 2010-10-07 07:47:05

I like your solution. +1

JoshD 2010-10-07 07:54:42

Answer 6

+2 A:

>>> import re
>>> reMatch = re.compile( '([A-Z][a-z]{0,2})(\d*)' )
>>> def matchText ( text ):
        matches, i = [], 0
        for m in reMatch.finditer( text ):
            if m.start() > i:
                break
            matches.append( m )
            i = m.end()
        else:
            if i == len( text ):
                return matches
        raise ValueError( 'invalid text' )

>>> matchText( 'C6H5Fe2I' )
[<_sre.SRE_Match object at 0x021E2800>, <_sre.SRE_Match object at 0x021E28D8>, <_sre.SRE_Match object at 0x021E2920>, <_sre.SRE_Match object at 0x021E2968>]
>>> matchText( 'H2TeO4' )
[<_sre.SRE_Match object at 0x021E2890>, <_sre.SRE_Match object at 0x021E29F8>, <_sre.SRE_Match object at 0x021E2A40>]
>>> matchText( 'H3PoooO5' )
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    matchText( 'H3PoooO5' )
  File "<pyshell#3>", line 11, in matchText
    raise ValueError( 'invalid text' )
ValueError: invalid text
>>> matchText( 'C2tH6' )
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    matchText( 'C2tH6' )
  File "<pyshell#3>", line 11, in matchText
    raise ValueError( 'invalid text' )
ValueError: invalid text

To answer your second question a bit more clearly than with the code above: A ValueError is used in cases where a parameter was of the correct type but the value was not right. So for a function that uses a regex, it is obviously the best you can choose.

poke 2010-10-07 07:49:25

Answer 7

A:

for the Validation please to try if you are using .NET Framework

([A-Z][a-b]??[0-9]??)*

other wise

([A-Z][a-b]?[a-b]?[0-9]?[0-9]?)*

Mario 2010-10-07 08:10:06

(1) Python supports reluctant quantifiers too, (2) those two regexes are not equivalent, and (3) both of them are wrong. `[a-z]??` matches at most one letter (the OP said there could be up to two); while `[0-9]?[0-9]?` matches a maximum of two digits (the OP didn't place any limit on those).

Alan Moore 2010-10-08 21:39:49

(1)I don't know any thing about Python(2) those regexp are equivalent in .net Framework because [a-z]?? equivalent to [a-z]?[a-z]? in .net Framework sow it take between 0 and 2 character between a and z (3) about the unlimited of the digit replace [0-9]?? with [0-9]*

Mario 2010-10-12 06:08:25

@Mario, "?? Matches the previous element zero or one time, but as few times as possible." So you are simply wrong, the extra "?" suffix switches matching to modest (non-greedy) mode.

macias 2010-10-14 09:10:26

Answer 8

A:

My go without regexp:

tests= (
'C6H5Fe2I',   # this string should be matched successfully. Result: C6 H5 Fe2 I
'H2TeO4',     # this string should be matched successfully Result: H2 Te O4
'H3PoooO5',   # exception should be raised
'C2tH6')      # exception should be raised

def splitter(case):
    case, original = list(case), case
    while case:
        if case[0].isupper():
            result = case.pop(0)
        else:
            raise ValueError('%r is not capital letter in %s position %i.' %
                             (case[0], original, len(original)-len(case)))
        for count in range(2):
            if case and case[0].islower():
                result += case.pop(0)
            else:
                break
        for count in range(2):
            if case and case[0].isdigit():
                result += case.pop(0)
            else:
                break
        yield result

for testcase in tests:
    try:
        print tuple(splitter(testcase))
    except ValueError as e:
        print(e)

Tony Veijalainen 2010-10-07 08:29:05

Answer 9

A:

You can do this in not much code with re.split -- yes, that's correct, re.split.

Here are the docs:

Invert your problem: split your input with a delimiter pattern that matches a valid atom+count. Have a capturing group so that the delimiter strings are kept. If the input string is valid, the non-delimiters in the result will all be empty strings.

>>> tests= (
... 'C6H5Fe2I',
... 'H2TeO4',
... 'H3PoooO5',
... 'C2tH6',
... 'Bad\n')
>>> import re
>>> pattern = r'([A-Z][a-z]{0,2}\d*)'
>>> for test in tests:
...     pieces = re.split(pattern, test)
...     print "\ntest=%r pieces=%r" % (test, pieces)
...     data = pieces[1::2]
...     rubbish = filter(None, pieces[0::2])
...     print "rubbish=%r data=%r" % (rubbish, data)
...

test='C6H5Fe2I' pieces=['', 'C6', '', 'H5', '', 'Fe2', '', 'I', '']
rubbish=[] data=['C6', 'H5', 'Fe2', 'I']

test='H2TeO4' pieces=['', 'H2', '', 'Te', '', 'O4', '']
rubbish=[] data=['H2', 'Te', 'O4']

test='H3PoooO5' pieces=['', 'H3', '', 'Poo', 'o', 'O5', '']
rubbish=['o'] data=['H3', 'Poo', 'O5']

test='C2tH6' pieces=['', 'C2', 't', 'H6', '']
rubbish=['t'] data=['C2', 'H6']

test='Bad\n' pieces=['', 'Bad', '\n']
rubbish=['\n'] data=['Bad']
>>>

John Machin 2010-10-07 10:41:43

ansaurus

tags:

views:

answers:

How to check that a regular expression has matched a string completely, i.e. - the string did not contain any extra character?

related questions