views:

23

answers:

2

Hi, I want to strip out all JavaScript from a small snippet (4-6 lines) of HTML, i've read on here before that its best not to use REGEX on HTML, so if anybody knows a better way, please advise.

So for example i have the following code:

<a href="go/to/my/link" onclick="fetchMeSomeData(this)">My Link</a>
<p onfocus="doSomethingAmazing();"></p>

Now in PHP i want to replace the on(what ever event it is) event with just an empty space.

Thanks

A: 

I've build such regexp some time ago, looks a bit scary though :). Here is pure regexp, you might need to additionally mask special chars to match your language requirements.

(\son[a-z]+\s*=\s*"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"(?=[^<]*?>))|(\son[a-z]+\s*=\s*'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)*'(?=[^<]*?>))

Here is masked version (according to java standards), that you should be able to use as a string.

(\\son[a-z]+\\s*=\\s*\"[^\"\\\\\\r\\n]*(?:\\\\.[^\"\\\\\\r\\n]*)*\"(?=[^<]*?>))|(\\son[a-z]+\\s*=\\s*'[^'\\\\\\r\\n]*(?:\\\\.[^'\\\\\\r\\n]*)*'(?=[^<]*?>))

It looks only inside tags and takes into consideration masked quotes inside events. I'm sure it is not 100% bullet proof though.

serg
+1  A: 

Use the HTML Purifier library to strip things like JavaScript and plugins from the code. It's much better then a blacklist-based regex approach because it uses a full HTML parser and a whitelist to clean the HTML.

MiffTheFox