I am working on parsing HTML obtain all the href
s that match a particular url (let's call it "target url") and then get the anchor text. I have tried LinkExtractor, TokenParser, Mechanize, TreeBuilder modules. For below HTML:
<a href="target_url">
<img src=somepath/nw.gf alt="Open this result in new window">
</a>
all of them give "Open this result in new window" as the anchor text.
Ideally I would like to see blank value or a string like "image" returned so that I know there was no anchor text but the href
still matched the target url (http://www.yahoo.com in this case). Is there a way to get the desired result using other module or Perl regex?
Thanks,