views:

148

answers:

7

I'm awful at regexes, but would love some help defining a rule that would take this text:

  1. Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.
    Tel: 380 7277050 Fax: 0141 959282 E-mail: [email protected] www.ilcuccio.it
    Accommodation in communal room or tent. French and English spoken. Contact: Cristina Belotti.

  2. Apicoltura Leida Barbara, Strada Crevenzolo 21, Viguzzolo, 15058 Alessandria.
    Tel: 0131 899166 & 392 9078020 E-mail: [email protected] The farm, situated in the plains, is certified organic (CCPB).

and return the addresses, that is, the rest of the line past [1-9].

Extra points for a coherent explanation that would actually help me learn a tad.

EDIT : I'll show my work as I go, until someone else steps in. Right now I have ^\d+\. which is a startline, digits, period.

+1  A: 
#!/usr/bin/perl
use strict; use warnings;

my $str = <<'EO_STR';
2. Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.
Tel: 380 7277050  Fax: 0141 959282  E-mail: [email protected]  www.ilcuccio.it
Accommodation in communal room or tent. French and English
spoken. Contact: Cristina Belotti.

3. Apicoltura Leida Barbara, Strada Crevenzolo 21, Viguzzolo, 15058 Alessandria.
Tel: 0131 899166 & 392 9078020  E-mail: [email protected]
The farm, situated in the plains, is certified organic (CCPB).
EO_STR

while ( $str =~ /^[0-9]\. ([^.]+)\./mg ) {
    print "$1\n";
}

As I understand, no . appears in the address part. So, the address is the part between the [0-9]\. and the following period. Therefore, the expression above captures all non-. characters between the [0-9]\. and the \. It uses the m modifier so ^ matches the beginning of each line rather than the beginning of the string. It uses the g modifier to go through each match in return.

If you just wanted to grab all captures:

my @addresses = $str =~ /^[0-9]\. ([^.]+)\./mg;

print $_, "\n" for @addresses;
Sinan Ünür
assuming no '.' in addresses seems like a very poor assumption.
Carl Coryell-Martin
@Carl Coryell-Martin: There is no spec. My assumption is based on the input shown in the post.
Sinan Ünür
+1  A: 

You want something like:

/^[1-9]+\. (.*)$/

^ means to start at the beginning of the line.

[1-9] means any number 1-9, but I think you knew that one.

+ means that we want multiple of the previous items matched. ie the numbers 1-9.

\. means literally find a .

(.*) should grab anything left in the line and stick in a variable for you to use.

$ means the expression should go to the end of the line.

In perl you should be able to pull the address out of $1.

devNoise
I think this is buggy; it won't match `10.`
Jason Orendorff
It should be /^[0-9]+\. (.*)$/
devNoise
A: 
^\d+\. (.*?)

Meaning:

^       At line start
\d+     take one or more digits
\.      followed by a period character and a space
(.*?)   match (and remember) all characters until line end

You can test your regular expressions online at RegExr: Free Online RegEx Testing Tool

Rubens Farias
A: 

/^\d+.\s+(.+)$/

  • Assert position at the start of the string «^»
  • Match a single digit 0..9 «\d+»
    • Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
  • Match the character "." literally «.»
  • Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) «\s+»
    • Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
  • Match the regular expression below and capture its match into backreference number 1 «(.+)»
    • Match any single character that is not a line break character «.+»
      • Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
  • Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

I use RegexBuddy for all my regexing. It has excellent help and an easy testing interface to check how your regex will work on some sample text.

PHP-Steven
A: 

what language are you using?? There is no need for regex. Here's an example in Python

myaddr="""2. Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.
Tel: 380 7277050  Fax: 0141 959282  E-mail: [email protected]  www.ilcuccio.it
Accommodation in communal room or tent. French and English
spoken. Contact: Cristina Belotti.
"""

print myaddr.split("\n",1)[0].split(" ",1)[-1]

It says, split the string on newlines (since your sample strings has newlines, right? ). Then get the first element of the splitted string. That will be your address part. Split on it again using spaces as delimiters and remove the first element , which is the digit. The rest will be your address. No regex needed. simple algorithm you can implement in your favourite language

PHP version:

$str = <<<EOF
2. Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.
    Tel: 380 7277050  Fax: 0141 959282  E-mail: [email protected]  www.ilcuccio.it
    Accommodation in communal room or tent. French and English
    spoken. Contact: Cristina Belotti.
EOF;

$s = explode("\n",$str,2);
$addr = explode(" ",$s[0]);
array_shift($addr);
print "Address is: " . implode($addr," ");
A: 

You really have two problems: finding the lines that start with numbers, and extracting the address portion. This little expression should find the lines:

^[[:space:]]*[[:digit:]]*\.[[:space:]]

The hat ("^") character matches the beginning of the line. This expression finds lines beginning with numbers and a period. It ignores any white space at the beginning.

The second problem - extracting the address - depends on the tool. For example, this Perl script prints only the address lines:

# perl -ne 'if (m/^\s*\d+\.\s*/) { s/^\s*\d+\.\s*//; print}' test.txt 

Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.
Apicoltura Leida Barbara, Strada Crevenzolo 21, Viguzzolo, 15058 Alessandria.

The "\s" and "\d" are Perl shorthand for matching spaces (\s) and digits (\d). Same regular expression. It just fits neatly on one line.

I used the expression twice. The first time finds the lines to print. And the second is a "substitute" command. It replaces the first expression with the second. In this case, the second contains blank - essentially erasing the numbers.

Robert Wohlfarth
@Robert: Please use SO's built-in formatting instead of hand-written HTML.
Alan Moore
That `if (m//)` clause is redundant; the `s///` operation does that for itself.
Alan Moore
+1  A: 

in ruby

mystring="1. Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.  \nTel: 380 7277050  Fax: 0141 959282  E-mail: [email protected]  www.ilcuccio.it  \nAccommodation in communal room or tent. French and English \nspoken. Contact: Cristina Belotti. \n\n2. Apicoltura Leida Barbara, Strada Crevenzolo 21, Viguzzolo, 15058 Alessandria.  \nTel: 0131 899166 & 392 9078020  E-mail: [email protected] \nThe farm, situated in the plains, is certified organic (CCPB).\n\n"

# scan returns a list like [['addr1'], ['addr2'], ['addr3'], ...]
puts mystring.scan(/^\d+\. (.+)$/)

output:

Il Cuccio, via Ronchi 43/b, 14047 Mombercelli, Asti.  
Apicoltura Leida Barbara, Strada Crevenzolo 21, Viguzzolo, 15058 Alessandria.
gnibbler