tags:

views:

64

answers:

2

Hi, I need to process a HTML content and replace the IMG SRC value with the actual data. For this I have choose Regular Expressions.

In my first attempt I need to find the IMG tags. For this I am using the following expression:

<img.*src.*=\s*".*"

Then within the IMG tag I am looking for SRC="..." and replace it with the new SRC value. I am using the following expression to get the SRC:

src\s*=\s*".*"\s*

The second expression having issues:

For the following text it works:

<img alt="3D&quot;&quot;" hspace=
    "3D0" src="3D&quot;cid:TDCJXACLPNZD.hills.jpg&quot;" align=
    "3dbaseline" border="3d0" />

But for the following it does not:

<img alt="3D&quot;&quot;" hspace="3D0" src=
    "3D&quot;cid:UHYNUEWHVTSH.lilies.jpg&quot;" align="3dbaseline"
    border="3d0" />

What happens is the expression returns

src="3D&quot;cid:TDCJXACLPNZD.hills.jpg&quot;" align=
    "3dbaseline"

It does not return only the src part as expected.

I am using C++ Boost regex library.

Please help me to figure out the problem.

Thanks, Hilmi.

+2  A: 

The problem is that .* is a "greedy" match - it will grab as much text as it possibly can while still allowing the regex to match. What you probably want is something like this:

src\s*=\s*"[^"]*"\s*

which will only match non-doublequote characters inside the src string, and thus not go past the ending doublequote.

Amber
Thanks a lot this works. :)
HJ
A: 

Your first regex doesn't work on your sample text for me. I usually use this instead, when looking for specific HTML tags:

<img[^>]*>

Also, try this for your second expression:

src\s*=\s*"[^"]*"\s*

Does that help?

Chris Nielsen
Thanks a lot this works. :)
HJ