views:

107

answers:

3

I have blocks of code that look like this:

<table border="0"><tr><td><img src='http://profile.ak.fbcdn.net/object3/686/9/q142163634919_249.jpg'/&gt;&amp;nbsp;&amp;nbsp;&lt;/td&gt;&lt;td&gt;Gift of Life Marathon Blood Drive - "the group stood before a sea of 1,000 Long Trail Brewing Co. pint glasses..." (Rutland Herald, VT)</td></tr></table>

I need to find & replace everything but http://profile.ak.fbcdn.net/object3/686/9/q142163634919%5F249.jpg with nothing. So at the end, it should just be the url.

The only values that will not be the same as we go through the loop is the url and the description within the 2nd set of td tags. The # of characters in the description won't always be the same.

I got Regex Buddy & looked at a reference site for hours last night. Matching a single character seems pretty straightforward but I think it will take a while for me to figure this one out.

I believe there are different types of RegEx. The one I am working with is in Yahoo Pipes, not sure what type it is: http://pipes.yahoo.com/pipes/pipe.edit?%5Fid=436a316234281be629d357bbecae46b1

+2  A: 

I would strongly recommend using an HTML parser. HTML is not regular and consequently parsing with regexps is going to be prone to errors, edge cases etc.

Brian Agnew
an HTML parser is nice and needed when you need to create robust commercial software, but this would also mean that you do not use yahoo pipes and that you don't parse HTML which hasn't been generated by yourself. And it would be nice to suggest an HTML parser in order to help with the question...
Ralf
My focus is largely on the robust. Whether it's commercial or not is neither here nor there
Brian Agnew
+2  A: 

If your html looks exactly like this above, it should be easy:
img src='([^']*)'
() means that this will be stored in a special result veriable. So don't look at what the regexp matches, but at the result varible.
[^']* matches every character but a "'".

... and I don't think you need an HTML parser for this task. Only if you want to create really robust code :-)

Ralf
Not sure if I implemented this correctly but I tried Replace img src='([^']*)' with [nothing] and got this output: <table border="0"><tr><td></>  </td><td>Gift of Life Marathon Blood Drive - "the group stood before a sea of 1,000 Long Trail Brewing Co. pint glasses..." (Rutland Herald, VT)</td></tr></table> - I also tried replace [nothing] with img src='([^']*) but no change resulted. I figured it out though. Please see solution below.
Adam
A: 

Pipes is a slightly different beast. Because I am new at this, I ended up creating 3 separate find and replace rules to get the code down to just the essential url:

Replace ^.*= with [nothing]

This leaves:

'http://profile.ak.fbcdn.net/object3/686/9/q142163634919_249.jpg'/&amp;gt;&amp;nbsp;&amp;nbsp; Gift of Life Marathon Blood Drive - "the group stood before a sea of 1,000 Long Trail Brewing Co. pint glasses..." (Rutland Herald, VT)

Replace . with [nothing]

This just removes ' at the beginning.

Replace '.* with [nothing]

This removes everything after jpg beginning with '

End result: http://profile.ak.fbcdn.net/object3/686/9/q142163634919%5F249.jpg

I'm sure there is a way to combine those 3 rules into one but I got errors when I tried to do that. This works and does so consistently.

Adam