tags:

views:

235

answers:

3

Hi,

how to match all contents outside a HTML tag?

My pseudo-HTML is:

<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>

I used the regular expression,

(?<=^|>)[^><]+?(?=<|$)

which would give me: "aaa bbb ccc ddd"

All I need is a way to ignore HTML tags with return: "bbb ccc"

+6  A: 

Regexes are a clunky and unreliable way to work on markup. I would suggest using a DOM parser such as SimpleHtmlDom:

//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')-&gt;find('a')-&gt;plaintext;

If you want to do that on the client, you can use a library such as jQuery like so:

$('a').each(function() {
    alert($(this).text());
});
karim79
A: 

Look for an approriate regex to match complete tags (e.g in a library like http://regexlib.com/) and remove them with using the substitute operator s///. Then use the rest.

fgm
I used this one, but this expression is typed incorrectly:s/(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)//|(?<=^|>)[^><]+?(?=<|$)
A: 

Thanks everybody,

the expressions of both together would be dirty work, but I would like the opposite output.

(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)

As pseudo string:

<h1>aaa</h1>

bbb <img src="bla" /> ccc

<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>

<div>dsada</div> hbhgjh

For simplification, I use this tool.

Greetings from germany!