ansaurus

Question

Regular expression to remove empty tags

Answer 1

+2 A:

.

qr{<span[^>]*(/>|>\s*?</span>)}

Should get the gist of them. ( Including XML style-self closing tags ie: )

But you really shouldn't use regex for HTML processing.

Answer only relevant to the context of the question that was visible before the formatting errors were corrected

Kent Fredric 2008-11-15 13:13:23

Perl code to an (unspecified!) PHP request? :-)

PhiLho 2008-11-15 13:16:34

Yeah, I couldn't be stuffed with nasty quoting styles needed :/ user exercise to make the regex suited to their language :p

Kent Fredric 2008-11-15 13:18:23

i'm really getting tired of people saying that you shouldn't use regexes on any sort of XML or HTML. Sometimes using something like Beautiful Soup *really isn't appropriate*.

nickf 2008-11-15 13:51:43

In this case it would be fine, as long as it never occurs inside quoted areas. That makes this very brittle, and I wouldn't use it, except in a pinch.

Brad Gilbert 2008-11-15 18:01:51

@nickf: its to combat the problem of millions of novices whom use it as the first port of call and then XSS-exploit themself.

Kent Fredric 2008-11-15 23:04:24

Answer 2

A:

Could you explain your solution if possible?

2008-11-15 13:32:33

Answer 3

+1 A:

I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!

So the answer should look like:

preg_replace('#( |\s)*?#si', '<$1>', $encoded);

(untested)

PhiLho 2008-11-15 13:33:40

\s* and \s*? are equivalent

Scott Evernden 2008-11-15 17:54:57

Answer 4

A:

purpose: I'm trying to filter out directly pasted MS-WORD content.

P.S. I've tried the code above - the empty space still stays untouched...

2008-11-15 13:40:05

Answer 5

A:

The problem comes when the span gets nested like:    

2008-11-15 14:11:49

Answer 6

+3 A:

Translating Kent Fredric's regexp to PHP :

preg_match_all('#<span[^>]*(?:/>|>(?:\s|&nbsp;)*</span>)#im', $html, $result);

This will match :

autoclosing spans
spans on multilines and whatever the case
spans with attributes
span with unbreakable spaces

Maybe you should about including spans containings only
as well...

As usual, when it comes to tweak regexp, some tools are handy :

http://regex.larsolavtorvik.com/

e-satis 2008-11-15 14:44:04

Answer 7

+1 A:

I've tried with this regex, but it needs adjusting:

In what way does the regex in the original question fail?

The problem comes when the span gets nested like:    

This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner  and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.

Dave Sherohman 2008-11-15 17:46:41

Yes, I agree but I wanted to know if there's a way to re-run it recursively? Otherwise it becomes difficult to predict the nested tags numbers/names...

2008-11-15 20:32:19

Answer 8

A:

Simply a good site when working with regular expressions, Regex http://regexlib.com/. Tester and cheat sheet very helpful

c00ke 2008-11-15 18:43:55

Answer 9

A:

If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.

This may not be a very elegant solution, but it'll perform well enough.

Jan Goyvaerts 2008-11-16 10:07:32

Answer 10

A:

Here is my solution to nesting tags problems, still not complete but close...

$test="<span>   <span>& nbsp;  </span> test <span>& nbsp; <span>& nbsp;  </span>  </span> & nbsp;& nbsp; </span>";

$pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';      
while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
{$test= preg_replace($pattern,'', $test);}

For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...

2008-11-19 08:39:48

ansaurus

tags:

views:

answers:

Regular expression to remove empty <span> tags

related questions