views:

318

answers:

5

Hi Guys I'm very new to regex, can you help me with this.

I have a string like this "<input attribute='value' >" where attribute='value' could be anything and I want to get do a preg_replace to get just <input />

How do I specify a wildcard to replace any number of any characters in a srting?

like this? preg_replace("/<input.*>/",$replacement,$string);

Many thanks

+10  A: 

What you have:

.*

will match "any character, and as many as possible.

what you mean is

[^>]+

which translates to "any character, thats not a ">", and there must be at least one

or altertaively,

.*?

which means "any character, but only enough to make this rule work"

BUT DONT

Parsing HTML with regexps is Bad

use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX

For example:

 <foo attr=">">

Will get grabbed wrongly by regex as

'<foo attr=" ' with following text of '">'

Which will lead you to this regex:

 `<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)>  etc etc

at which point you'll discover this lovely gem:

 <foo attr="'>\'\"">

and your head will explode.

( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )

Kent Fredric
The concept with "as many as possible" vs. "just enough" is called "greediness" in most documentation.
John Nilsson
@John: yeah, i know, but this guys obviously green on those terms ;)
Kent Fredric
You were right on the head explosion part... speaking from experience.
Mrgreen
How do you come to think such a monstrosity as "<foo attr='>'>" was even *possible* in HTML? I know you talk about XSS, but I guess we are not looking at a "How do I sanitize broken user input?" question here. Allowing users to input HTML is a big WTF in itself.
Tomalak
A: 
preg_replace("<input[^>]*>", $replacement, $string); 
// [^>] means "any character except the greater than symbol / right tag bracket"

This is really basic stuff, you should catch up with some reading. :-)

Tomalak
This almost works, but it fails on attributes which have a '>' in the value, e.g. <input attr="3>2">.
Adam Rosenfield
@Adam: which is *exactly* why you shouldn't use Regex to parse html.
Kent Fredric
Funnily, it looks like allowing > in attributes values was made only to make a point against using regexes on HTML (I never saw it used in real life). But that's a good point.
PhiLho
@PHiLho, hopefully not used in real life, but its one of the ways people create code for XSS purposes. And that gets ugly fast.
Kent Fredric
I'm sorry to say that but there are no '>' characters in attribute values. Never. If there are in *your* HTML, you've got a completely different problem at your hands.
Tomalak
I wouldn't say that this is *basic* stuff by any stretch of the imagination. Making sure you are matching HTML properly is a fairly difficult thing to achieve, especially when you start trying to match nested tags.
localshred
He did not say anything about nested tags. His question was quite straight-forward, and the regex that does what he wants really *is* basic.
Tomalak
+1  A: 

Some people were close... but not 100%:

This:

preg_replace("<input[^>]*>", $replacement, $string);

should be this:

preg_replace("<input[^>]*?>", $replacement, $string);

You don't want that to be a greedy match.

Timothy Khouri
Greediness is irrelevant here, as the use of [^>]* instead of .* will cause it to match all non-> characters until a > is found and the longest (greedy) and shortest (non-greedy) runs of non-> characters followed by a > will be identical in all cases.
Dave Sherohman
A: 

I'd recommend

Regulazy or regulator,

both free, both good, both by the awesome Osherove, both can be found here: Roy Osheroves tools

AndreasKnudsen
A: 

If I understand the question correctly, you have the code:

preg_replace("/<input.*>/",$replacement,$string);

and you want us to tell you what you should use for $replacement to delete what was matched by .*

You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:

preg_replace("/(<input).*(>)/","$1$2",$string);

Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:

preg_replace("/<input [^>]*>/","<input />",$string);

The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.

Jan Goyvaerts