tags:

views:

58

answers:

2

I need a clever regex to match ... in these:

<img src="..."
<img src='...'
<img src=...

I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.

Any ideas how to match these 3 cases with one regex.

So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....

+5  A: 

Wow, second one I'm answering today.

Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.

Java, for example has JTidy and PHP has PHP Tidy.

UPDATE

Against my better judgement, I'm giving you this:

/<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/

Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).

Vivin Paliath
No, I planned not to use parser. The task is simple enough to be done by a small regex.
Lucho
What we are telling you is that the task is **not** simple enough to be done by a small regex. If it was, you'd have already made it happen.
Andy Lester
@Lucho, if the task is simple enough to be done by a regex, why are you asking us? We're telling you that the task is **not simple** enough to be solved by a regex (small or otherwise).
Vivin Paliath
Ok, you convince me :-)The world is cruel and probably full of ugly and messed up html code, so a parser is a rescue... but in one perfect world probably there will be possible to just grep the content of src attributes of img tags :D
Lucho
@Lucho perhaps, but probably not HTML is not regular :)
Vivin Paliath
+2  A: 

I already solved this one today in this posting. That should show you how easily these things are dealt with using regexen.

Not.

tchrist