ansaurus

Question

Answer 1

+10 A:

What you have:

.*

will match "any character, and as many as possible.

what you mean is

[^>]+

which translates to "any character, thats not a ">", and there must be at least one

or altertaively,

.*?

which means "any character, but only enough to make this rule work"

BUT DONT

Parsing HTML with regexps is Bad

use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX

For example:

 <foo attr=">">

Will get grabbed wrongly by regex as

'<foo attr=" ' with following text of '">'

Which will lead you to this regex:

 `<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)>  etc etc

at which point you'll discover this lovely gem:

 <foo attr="'>\'\"">

and your head will explode.

( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )

Kent Fredric 2008-11-12 21:40:50

The concept with "as many as possible" vs. "just enough" is called "greediness" in most documentation.

John Nilsson 2008-11-12 21:43:35

@John: yeah, i know, but this guys obviously green on those terms ;)

Kent Fredric 2008-11-12 21:44:56

You were right on the head explosion part... speaking from experience.

Mrgreen 2008-11-19 03:15:37

How do you come to think such a monstrosity as "<foo attr='>'>" was even *possible* in HTML? I know you talk about XSS, but I guess we are not looking at a "How do I sanitize broken user input?" question here. Allowing users to input HTML is a big WTF in itself.

Tomalak 2008-11-19 11:04:54

Answer 2

A:

preg_replace("<input[^>]*>", $replacement, $string); 
// [^>] means "any character except the greater than symbol / right tag bracket"

This is really basic stuff, you should catch up with some reading. :-)

Tomalak 2008-11-12 21:41:39

This almost works, but it fails on attributes which have a '>' in the value, e.g. <input attr="3>2">.

Adam Rosenfield 2008-11-12 21:43:56

@Adam: which is *exactly* why you shouldn't use Regex to parse html.

Kent Fredric 2008-11-12 21:45:30

Funnily, it looks like allowing > in attributes values was made only to make a point against using regexes on HTML (I never saw it used in real life). But that's a good point.

PhiLho 2008-11-12 22:17:26

@PHiLho, hopefully not used in real life, but its one of the ways people create code for XSS purposes. And that gets ugly fast.

Kent Fredric 2008-11-19 08:17:38

I'm sorry to say that but there are no '>' characters in attribute values. Never. If there are in *your* HTML, you've got a completely different problem at your hands.

Tomalak 2008-11-19 10:50:45

I wouldn't say that this is *basic* stuff by any stretch of the imagination. Making sure you are matching HTML properly is a fairly difficult thing to achieve, especially when you start trying to match nested tags.

localshred 2008-11-20 23:56:24

He did not say anything about nested tags. His question was quite straight-forward, and the regex that does what he wants really *is* basic.

Tomalak 2008-11-21 08:26:12

Answer 3

+1 A:

Some people were close... but not 100%:

This:

preg_replace("<input[^>]*>", $replacement, $string);

should be this:

preg_replace("<input[^>]*?>", $replacement, $string);

You don't want that to be a greedy match.

Timothy Khouri 2008-11-12 22:32:21

Greediness is irrelevant here, as the use of [^>]* instead of .* will cause it to match all non-> characters until a > is found and the longest (greedy) and shortest (non-greedy) runs of non-> characters followed by a > will be identical in all cases.

Dave Sherohman 2008-11-13 00:25:37

Answer 4

A:

I'd recommend

Regulazy or regulator,

both free, both good, both by the awesome Osherove, both can be found here: Roy Osheroves tools

AndreasKnudsen 2008-11-12 22:35:06

Answer 5

A:

If I understand the question correctly, you have the code:

preg_replace("/<input.*>/",$replacement,$string);

and you want us to tell you what you should use for $replacement to delete what was matched by .*

You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:

preg_replace("/(<input).*(>)/","$1$2",$string);

Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:

preg_replace("/<input [^>]*>/","<input />",$string);

The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.

Jan Goyvaerts 2008-11-13 16:55:21

ansaurus

tags:

views:

answers:

Very simple regular expression help

BUT DONT

related questions