tags:

views:

178

answers:

7

I need to extract the text (characters and numbers) from a multiline string. Everything I have tried does not strip out the line feeds/carriage returns.

Here is the string in question:

"\r\n        50145395\r\n    "

In HEX it is: 0D 0A 20 20 20 20 20 20 20 20 35 30 31 34 35 33 39 35 0D 0A 20 20 20 20

I have tried the following:

$sitename =~ m/(\d+)/g;  
$sitename = $1;

and

$sitename =~ s/^\D+//g;  
$sitename =~ s/\D+$//g;

and

$sitename =~ s/^\s+//g;  
$sitename =~ s/\s+$//g;

In all cases I cannot get rid of any of the unwanted characters. I have run this in cygwin perl and Strawberry perl.

Thanks.

+4  A: 

Capturing match in list context returns captured strings:

#!/usr/bin/perl

use strict; use warnings;

my $s = join('', map chr(hex), qw(
    0D 0A 20 20 20 20 20 20 20 20 35 30 
    31 34 35 33 39 35 0D 0A 20 20 20 20
));

my ($x) = $s =~ /([A-Za-z0-9]+)/;

print "'$x'\n";

Output:

C:\Temp> uio
'50145395'
Sinan Ünür
I am getting the string from an XML document and I put up the hex representation to show the hex characters of this string.
Mel
@Mel: **So?** I used the hex representation of the string to test my code with the exact data you claimed to be using. Anyway, is this part of an attempt to use regular expressions to parse XML?
Sinan Ünür
+1 For the nice test case
Andomar
+2  A: 

I'm not sure that you need, but here is code extracting all words from string

my @words = ( $sitename =~ m/(\w+)/g );

It can be also done with split. But you need to use spaces now:

my @words = split( m/\s+/, $sitename );
Ivan Nevostruev
+1 For noticing that he said *characters and numbers*.
Sinan Ünür
Just to explain (as far as I understand it): this matches `m` all continuous word parts `\w+` and stores them into an array. You can combine them into a single string with `join('',@words)`
Andomar
+1  A: 

The obvious one I didn't see in your post:

$sitename =~ s/\D//g;

This removes all non-digits. To remove anything but word characters, you could:

$sitename =~ s/\W//g;

There's no need for ^ or $ if your intention is to replace every non-digit. Also, you can replace one character at a time if you use the global g option; no need to match more than one digit with \d+.

Andomar
A: 

Edit: My solution was incorrect; please instead pay attention to Sinan Ünür's solution.

Conrad Meyer
But the `s` has no effect if you're not using `.` ? hehe
Andomar
There is no **`.`** character in the pattern so this is completely and utterly irrelevant.
Sinan Ünür
The point is, the expression is applied to the entire string, instead of a line at a time.
Conrad Meyer
@Conrad Meyer: `m//` and `s///` is always applied to the entire string. The `s` modifier changes how the **pattern** is interpreted.
Sinan Ünür
Aha! Please forgive my lack of Perl knowledge. Thanks for the clarification!
Conrad Meyer
A: 

In the past I have done something like:

my $newline = chr(13) . chr(10);

$data =~ s/$newline/ /g;

You can check out other ascii character codes at: http://www.asciitable.com./

use strict;

my $newline = chr(13);
my $newline2 = chr(10);

my $words = "\r\n        50145395\r\n    ";

foreach my $char (split //, $words) {
 my $val=ord($char);    
 print "->$char<- ($val)\n";
}

print "$words\n";

$words =~ s/$newline//g;
$words =~ s/$newline2//g;
$words =~ s/[ ]+//g;

foreach my $char (split //, $words) {
 my $val=ord($char);    
 print "->$char<- ($val)\n";
}

print "$words\n";
Courtland
A: 

Do you want to remove only newlines and carriage returns? If so, this is what you want:

$sitename =~ s/[\r\n]//g;

If you want to remove all whitespace, not just newlines and linefeeds, use this instead:

$sitename =~ s/\s//g;
markusk
A: 
$x = <<END;
this is a multiline 
string. this is a multiline
string.
END

$x =~ s/\r?\n?//g;
print $x;
prime_number