tags:

views:

3525

answers:

10

Hello, I would like such empty span tags (filled with   and space) to be removed:

<span> &nbsp; &nbsp; &nbsp; </span>

I've tried with this regex, but it needs adjusting:

(<span>(&nbsp;|\s)*</span>)

preg_replace('#<span>(&nbsp;|\s)*</span>#si','<\\1>',$encoded);

Cheers.

+2  A: 

.

qr{<span[^>]*(/>|>\s*?</span>)}

Should get the gist of them. ( Including XML style-self closing tags ie: )

But you really shouldn't use regex for HTML processing.

Answer only relevant to the context of the question that was visible before the formatting errors were corrected

Kent Fredric
Perl code to an (unspecified!) PHP request? :-)
PhiLho
Yeah, I couldn't be stuffed with nasty quoting styles needed :/ user exercise to make the regex suited to their language :p
Kent Fredric
i'm really getting tired of people saying that you shouldn't use regexes on any sort of XML or HTML. Sometimes using something like Beautiful Soup *really isn't appropriate*.
nickf
In this case it would be fine, as long as it never occurs inside quoted areas. That makes this very brittle, and I wouldn't use it, except in a pinch.
Brad Gilbert
@nickf: its to combat the problem of millions of novices whom use it as the first port of call and then XSS-exploit themself.
Kent Fredric
A: 

Could you explain your solution if possible?

+1  A: 

I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!

So the answer should look like:

preg_replace('#<span>(&nbsp;|\s)*?</span>#si', '<$1>', $encoded);

(untested)

PhiLho
\s* and \s*? are equivalent
Scott Evernden
A: 

purpose: I'm trying to filter out directly pasted MS-WORD content.

P.S. I've tried the code above - the empty space still stays untouched...

A: 

The problem comes when the span gets nested like: <span><span> &nbsp; </span></span>

+3  A: 

Translating Kent Fredric's regexp to PHP :

preg_match_all('#<span[^>]*(?:/>|>(?:\s|&nbsp;)*</span>)#im', $html, $result);

This will match :

  • autoclosing spans
  • spans on multilines and whatever the case
  • spans with attributes
  • span with unbreakable spaces

Maybe you should about including spans containings only
as well...

As usual, when it comes to tweak regexp, some tools are handy :

http://regex.larsolavtorvik.com/

e-satis
+1  A: 

I've tried with this regex, but it needs adjusting:

In what way does the regex in the original question fail?

The problem comes when the span gets nested like: <span><span> &nbsp; </span></span>

This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner <span> and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.

Dave Sherohman
Yes, I agree but I wanted to know if there's a way to re-run it recursively? Otherwise it becomes difficult to predict the nested tags numbers/names...
A: 

Simply a good site when working with regular expressions, Regex http://regexlib.com/. Tester and cheat sheet very helpful

c00ke
A: 

If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.

This may not be a very elegant solution, but it'll perform well enough.

Jan Goyvaerts
A: 

Here is my solution to nesting tags problems, still not complete but close...

$test="<span>   <span>& nbsp;  </span> test <span>& nbsp; <span>& nbsp;  </span>  </span> & nbsp;& nbsp; </span>";

$pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';      
while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
{$test= preg_replace($pattern,'', $test);}

For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...