tags:

views:

414

answers:

3

I have no experience using regular expressions in PHP, so I usually write some convoluted function using a series of str_replace(), substr(), strpos(), strstr() etc (you get the idea).

This time I want to do this correctly, I know I need to use a regex for this, but am confused as to what to use (ereg or preg), and how exactly the syntax should be.

NOTE: I am NOT parsing HTML, or XML, and sometimes I will be using delimiters other than (for example, | or ~ or [tag] or ::). I am looking for a generic way to do a wildcard replace in between two known delimiters using regex, I am not building an HTML or XML parser.

What I need is a regex that replaces this:

<sometag>everything in here</sometag>

with this:

<sometag>new contents</sometag>

I have read the documentation online for a bit, but I am confused, and am hoping one of you regex experts can pop in a simple solution. I suspect I will pass the values to a function, something like this:

$new_text = swapText ( "<sometag>", $the_new_text_to_go_into_the_dag );

function swapText ( $in_tag_with_brackets_to_update, $in_new_text ) {
 // define tags
 $starting_tag  = $in_tag_with_brackets_to_update;
 $ending_tag    = str_replace( "<", "</", $in_tag_with_brackets_to_update) );

 // not sure if this is the proper regex match string or not
 // and/or if any escaping needs to be done on the tags
 $find_string         = "{$starting_tag}.*{$ending_tag}";
 $replace_with_string = "{$starting_tag}{$in_new_text}{$ending_tag}";

 // after some regex, this function should return new version of <tag>data</tag>
}

Thanks.

+1  A: 

First, if it is html you are replacing, use something like simple html dom. If the format is exactly what you say (as in, <sometag> can't be <sometag >), then regex may be ok to use.

Don't use ereg based functions, as they are deprecated, use the preg functions.

preg_replace('%(<sometag>)[^<]*(</sometag>)%i', '$1something else$2', $str);

EDIT
A slightly better version of the above, now supports having a < in the text

preg_replace('%(<sometag>).*?(</sometag>)%i', '$1something else$2', $str);

The $1 and $2 are the matched text between the brackets. As these are constant, they could be replaced with the constant

preg_replace('%<sometag>.*?</sometag>%i', '<sometag>something else</sometag>', $str);
Yacoby
Use closing tags much?
Ewan Todd
the ending piece should be </sometag> not <sometag>. Do I need to escape a / with a \ (eg: <\/sometag>). Also what does [^<] do? Is it looking for text that starts with a < ? If so, that is not what I need. Thanks -
OneNerd
Fixed the end tag. [^<] matches all characters that are not '<'. Both examples fit your test data. If its not what you want, you need to explain more clearly what you do want.
Yacoby
This does not work, the slash in `</sometag>` will be seen as pattern delimiter resulting in a parse error. Either escape it or (better) use different pattern delimiters.
kemp
Please clarify the reason for the -1
Yacoby
The -1 was not from me, but your solution will fail if there are line breaks between the opening and closing tag.
Bart Kiers
+6  A: 

You say that you are not going to parse xml and then goes on to show an xml example. That's a bit confusing.

Now, the reason why you can't use regular expressions to parse xml, is that they aren't contextual. Therefore there are a whole class of problems that regular expressions can't be used for. This includes nested tags (Whether they are xml or not), so keep that in mind.

That out of the way, you should be using preg - not ereg. ereg is a lesser used, slower and now deprecated type of regular expressions. Just forget about it.

In pcre (Perl Compatible Regular Expressions), which is the language that preg uses, a . (dot) is a wildcard, that matches any single character (Except newline). You can put a quantifier after a match. A quantifier can be an explicit range of numbers, such as {1,3} (meaning at least one, but up to 3) or you can use one of the short hand symbols, such as + (Short for {1,}, meaning at least one) or * (Meaning any number, including zero). With this knowledge, you can match anything with .*.

By default, expressions will match the largest possible pattern (Known as being greedy). You can change this with the ? modifier. Thus .*? will match anything, but take the shortest possible pattern. This can then be used to match any delimited value like follows:

~<foo>.*?</foo>~

Note that I'm using ~ as the delimiter here to avoid having to escape / in the expression. The standard is to use / as delimiter, in which case the expression would have looked like this:

/<foo>.*?<\/foo>/

In general, the above is bad practise, since it's much better to match a negated character class than a dot, but to keep things simple for you, just ignore this until you get the basics under your skin. It'll work in most cases. In particular, since the . doesn't match newlines, this won't work if the content contains a newline character. If you need this you can do one of two things: Either you add a modifier to the expression, or, you replace the . with a character class, that includes newlines. For example [\s\S] (Meaning a whitespace character or a non-whitespace character, which is the same as anything). This is how the expression would look then:

~<foo>.*?</foo>~s

Or:

~<foo>[\s\S]*?</foo>~

To put all this to work, let's pass it to the preg_replace function:

echo preg_replace('~<foo>.*?</foo>~s', '<foo>Lorem Ipsum</foo>', $input);

If your tag-names are variable, you can build the expression up like you would with an SQL query. Just like SQL, you need to escape certain characters. Use preg_quote for that:

function swapText($tagname, $replacement_text, $input) {
  $tagname_escaped = preg_quote($tagname, '~');
  return preg_replace(
    '~<' . $tagname_escaped . '>.*?</' . $tagname_escaped . '>~s',
    '<' . $tagname . '>' . $replacement_text . '</' . $tagname . '>',
    $input);
}
troelskn
+1 for the good explanations
Yacoby
Note that `.` matches anything *except* line breaks. Besides that, excellent answer!
Bart Kiers
thanks. I think it will do what I need, and based on your excellent explanations, I think I can re-purpose the swapText function to handle other kinds of delimiters I am using throughout my app. Thanks again!
OneNerd
@bart Good point. I've updated the answer.
troelskn
+1  A: 

@OP, there's no need to use complicated regex or a parser if your task is very simple. an example just using your normal substrings....

$mystr='<sometag>everything in here</sometag>';
$start=strpos($mystr,"<sometag>");
$end=strpos($mystr,"</sometag>");
print substr($mystr,0,$start+strlen("<sometag>") ) . "new value" . substr($mystr,$end);
thanks - thought the regex would work, but yours worked better and also with newline characters which the regex didn't.
OneNerd