tags:

views:

314

answers:

7

Say I have a text file to parse, which contains some fixed length content:

123jackysee        45678887
456charliewong     32145644
<3><------16------><--8---> # Not part of the data.

The first three characters is ID, then 16 characters user name, then 8 digit phone number.

I would like to write a regular expression to match and verify the input for each line, the one I come up with:

(\d{3})([A-Za-z ]{16})(\d{8})

The user name should contains 8-16 characters. But ([A-Za-z ]{16}) would also match null value or space. I think of ([A-Za-z]{8,16} {0,8}) but it would detect more than 16 characters. Any suggestions?

A: 
William Pursell
+6  A: 

No, no, no, no! :-)

Why do people insist on trying to pack so much functionality into a single RE or SQL statement?

My suggestion, do something like:

  • Ensure the length is 27.
  • Extract the three components into separate strings (0-2, 3-18, 19-26).
  • Check that the first matches "\d{3}".
  • Check that the second matches "[A-Za-z]{8,} *".
  • Check that the third matches "\d{8}".

If you want the entire check to fit on one line of source code, put it in a function, isValidLine(), and call it.

Even something like this would do the trick:

def isValidLine(s):
    if s.len() != 27 return false
    return s.match("^\d{3}[A-za-z]{8,} *\d{8}$"):

Don't be fooled into thinking that's clean Python code, it's actually PaxLang, my own proprietary pseudo-code. Hopefully, it's clear enough, the first line checks to see that the length is 27, the second that it matches the given RE.

The middle field is automatically 16 characters total due to the first line and the fact that the other two fields are fixed-length in the RE. The RE also ensures that it's eight or more alphas followed by the right number of spaces.

To do this sort of thing with a single RE would be some monstrosity like:

^\d{3}(([A-za-z]{8} {8})
      |([A-za-z]{9} {7})
      |([A-za-z]{10} {6})
      |([A-za-z]{11} {5})
      |([A-za-z]{12}    )
      |([A-za-z]{13}   )
      |([A-za-z]{14}  )
      |([A-za-z]{15} )
      |([A-za-z]{16}))
      \d{8}$

You could do it by ensuring it passes two separate REs:

^\d{3}[A-za-z]{8,} *\d{8}$
^.{27}$

but, since that last one is simply a length check, it's no different to the isValidLine() above.

paxdiablo
Indeed. The first thing you think of when faced with a problem is "I know, I'll use regular expressions." Now you have two problems.
Kevin Peterson
Alan Moore
I reason I asked is that it 'seems' to be solvable by a simple regex. Turn out that it's not that simple. I agree to a simpler approach instead of a long regex.
jackysee
A: 

Hmm... Depending on the exact version of Regex you're running, consider:

(?P<id>\d{3})(?=[A-Za-z\s]{16}\d)(?P<username>[A-Za-z]{8,16})\s*(?P<phone>\d{8})

Note 100% sure this will work, and I've used the whitespace escape char instead of an actual space - I get nervous with just the space character myself, but you may want to be more restrictive.

See if it works. I'm only intermediate with RegEx myself, so I might be in error.

Check out the named groups syntax for your version of RegEx a) exists and b) matches the standard I've used above.

EDIT:

Just to expand what I'm trying to do (sorry to make your eyes bleed, Pax!) for those without a lot of RegEx experience:

(?P<id>\d{3})

This will try to match a named capture group - 'id' - that is three digits in length. Most versions of RegEx let you use named capture groups to extract the values you matched against. This lets you do validation and data capture at the same time. Different versions of RegEx have slightly different syntaxes for this - check out http://www.regular-expressions.info/named.html for more detail regarding your particular implementation.

(?=[A-Za-z\s]{16}\d)

The ?= is a lookahead operator. This looks ahead for the next sixteen characters, and will return true if they are all letters or whitespace characters AND are followed by a digit. The lookahead operator is zero length, so it doesn't actually return anything. Your RegEx string keeps going from the point the Lookahead started. Check out http://www.regular-expressions.info/lookaround.html for more detail on lookahead.

(?P<username>[A-Za-z]{8,16})\s*

If the lookahead passes, then we keep counting from the fourth character in. We want to find eight-to-sixteen characters, followed by zero or more whitespaces. The 'or more' is actually safe, as we've already made sure in the lookahead that there can't be more than sixteen characters in total before the next digit.

Finally,

(?P<phone>\d{8})

This should check the eight-digit phone number.

I'm a bit nervous that this won't exactly work - your version of RegEx may not support the named group syntax or the lookahead syntax that I'm used to.

I'm also a bit nervous that this Regex will successfully match an empty string. Different versions of Regex handle empty strings differently.

You may also want to consider anchoring this Regex between a ^ and $ to ensure you're matching against the whole line, and not just part of a bigger line.

Ubiquitous Che
And now, thanks to you, my eyes are bleeding :-)
paxdiablo
Taking providermr's answer below, you could also try (?=[A-Za-z\s]{17})(\d{3})([A-Za-z]{3,16} {0,13})(\d{8}) - look up the ?= (lookahead) RegEx operator at http://www.regular-expressions.info/lookaround.html
Ubiquitous Che
Heh. Ninja's me as I was commenting. Yeah, RegEx is painful, but you can do some cool things with it if you persevere.
Ubiquitous Che
Oops... That should be 27, not 17. Damn, I'm bad at typing. ^_^
Ubiquitous Che
@Che (since I can't be bothered to write Ubiquitous), you can also write language parsers in COBOL or GUIs in native Xlib, but that doesn't mean it's a good idea.
paxdiablo
Fair call - but although it may hurt the eyeballs to *learn* how to do it in one line of RegEx, I don't think it hurts to have learned how to do it. :D Also, if you're looking for economy of scale, most RegEx libraries will easily outperform modern programming languages in terms of speed, especially on long strings.
Ubiquitous Che
I'll agree with you if you're talking about interpreted-type languages, @Che, since the RE engine is generally compiled - it will outperform complex string handling. But in a compiled language, the RE is at a disadvantage since it has to handle all generalities - a crafted parser will outdo it since it can be optimized for specific cases. In this case, since you'll probably compile the RE once and use it *many* times, it will be not so important - it's the continual compile-and-run-RE-once which slows down most code.
paxdiablo
I was thinking C# - I've tried to do some complex string processing in the past in C# that took *ages*, and was stupidly complex. It took me a while to learn how to do the same thing in Regex, but I was very, very glad I did. Much faster, and although the Regex was horrible, it was actually *less* complicated than my C# code - appropriate use of the named capture groups and lookaround assertions made it a breeze. Of course, the key word here is 'appropriate', and you've got me bang to rights that when not appropriate, it's better not to bother.
Ubiquitous Che
... I just thought it would be useful to show how one *could* write it in one Regex line. Whether or not jackysee *should* do this is an open question - and I grant that you're probably right in this case.
Ubiquitous Che
A: 

I would use the regex you suggested with a small addition:

(\d{3})([A-Za-z]{3,16} {0,13})(\d{8})

which will match things that have a non-whitespace username but still allow space padding. The only addition is that you would then have to check the length of each input to verify the correct number of characters.

Mitch
A: 

@OP,not every problem needs a regex. your problem is pretty simple to check. depending on what language you are using, they would have some sort of built in string functions. use them. the following minimal example is done in Python.

import sys
for line in open("file"):
    line=line.strip()
    # check first 3 char for digit
    if not line[0:3].isdigit(): sys.exit()
    # check length of username.
    if len(line[3:18]) <8 or len(line[3:18]) > 16: sys.exit()
    # check phone number length and whether they are digits.
    if len(line[19:26]) == 8 and not line[19:26].isdigit(): sys.exit()
    print line
ghostdog74
A: 

I also don't think you should try to pack all the functionality into a single regex. Here is one way to do it:

#!/usr/bin/perl

use strict;
use warnings;

while ( <DATA> ) {
    chomp;
    last unless /\S/;
    my @fields = split;
    if (
        ( my ($id, $name) = $fields[0] =~ /^([0-9]{3})([A-Za-z]{8,16})$/ )
            and ( my ($phone) = $fields[1] =~ /^([0-9]{8})$/ )
    ) {
        print "ID=$id\nNAME=$name\nPHONE=$phone\n";
    }
    else {
        warn "Invalid line: $_\n";
    }
}

__DATA__
123jackysee       45678887
456charliewong    32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678

And here is another way:

#!/usr/bin/perl

use strict;
use warnings;

while ( <DATA> ) {
    chomp;
    last unless /\S/;
    my ($id, $name, $phone) = unpack 'A3A16A8';
    if ( is_valid_id($id)
            and is_valid_name($name)
            and is_valid_phone($phone)
    ) {
        print "ID=$id\nNAME=$name\nPHONE=$phone\n";
    }
    else {
        warn "Invalid line: $_\n";
    }
}

sub is_valid_id    { ($_[0]) = ($_[0] =~ /^([0-9]{3})$/) }

sub is_valid_name  { ($_[0]) = ($_[0] =~ /^([A-Za-z]{8,16})\s*$/) }

sub is_valid_phone { ($_[0]) = ($_[0] =~ /^([0-9]{8})$/) }

__DATA__
123jackysee        45678887
456charliewong     32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678

Generalizing:

#!/usr/bin/perl

use strict;
use warnings;

my %validators = (
    id    => make_validator( qr/^([0-9]{3})$/ ),
    name  => make_validator( qr/^([A-Za-z]{8,16})\s*$/ ),
    phone => make_validator( qr/^([0-9]{8})$/ ),
);

INPUT:
while ( <DATA> ) {
    chomp;
    last unless /\S/;
    my %fields;
    @fields{qw(id name phone)} = unpack 'A3A16A8';

    for my $field ( keys %fields ) {
        unless ( $validators{$field}->($fields{$field}) ) {
            warn "Invalid line: $_\n";
            next INPUT;
        }
    }

    print "$_ : $fields{$_}\n" for qw(id name phone);
}

sub make_validator {
    my ($re) = @_;
    return sub { ($_[0]) = ($_[0] =~ $re) };
}

__DATA__
123jackysee        45678887
456charliewong     32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
Sinan Ünür
A: 

You can use lookahead: ^(\d{3})((?=[a-zA-Z]{8,})([a-zA-Z ]{16}))(\d{8})$

Testing:

    123jackysee        45678887      Match
    456charliewong     32145644      Match
    789jop             12345678      No Match - username too short
    999abcdefghijabcde12345678       No Match - username 'column' is less that 16 characters
    999abcdefghijabcdef12345678      Match
    999abcdefghijabcdefg12345678     No Match - username column more that 16 characters
jop