views:

202

answers:

5

Let's say I have some original text:

here is some text that has a substring that I'm interested in embedded in it.

I need the text to match a part of it, say: "has a substring".

However, the original text and the matching string may have whitespace differences. For example the match text might be:

has a
substring

or

has  a substring

and/or the original text might be:

here is some
text that has
a substring that I'm interested in embedded in it.

What I need my program to output is:

here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.

I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.

Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.

+5  A: 

Been some time since I've used perl regular expressions, but what about:

$match = s/(has\s+a\s+substring)/[$1]/ig

This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.

You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.

EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.

David Andres
\s includes \r and \n, so just \s is the same as your [\s\r\n]
ysth
@ysth: you are correct, sir.
David Andres
I'd suggest `\s+` instead of `\s*` unless you want to match "hasasubstring", which I don't guess was one of the whitespace variations that the OP had in mind.
hobbs
hobbs is correct there, its' \s+. Seeing the answers, but not having tested them with my rather complex sets of data files, I think my problem was mainly not slicing the problem up into small enough slices.
singingfish
@hobbs: +1...thanks, updated the answer
David Andres
+3  A: 

This pattern will match the string that you're looking to find:

(has\s+a\s+substring)

So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.

Doug Hays
+2  A: 

In regexes, you can use + to mean "one or more." So something like this

/has\s+a\s+substring/

matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.

Putting it together with a substitution operator, you can say:

my $str = "here is some text that has     a  substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;

print $str;

And the output is:

here is some text that [match starts here]has     a  substring[match ends here] that I'm interested in embedded in it.
friedo
A: 

A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:

my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";

my $re = $search;
$re =~ s/\s+/\\s+/g;

$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;

print $original;

Output:

here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.

You might want to escape any meta-characters in the string. If someone is interested, I could add it.

MizardX
A: 

This is an example of how you could do that.

#! /opt/perl/bin/perl
use strict;
use warnings;

my $submatch = "has a\nsubstring";

my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";

print substr_match($str, $submatch), "\n";

sub substr_match{
  my($string,$match) = @_;

  $match =~ s/\s+/\\s+/g;

  # This isn't safe the way it is now, you will need to sanitize $match
  $string =~ /\b$match\b/;
}

This currently does anything to check the $match variable for unsafe characters.

Brad Gilbert
what do you mean by sanitize?
singingfish
Someone could run code in the regex `(?{system('rm /')})`
Brad Gilbert