views:

107

answers:

2

I have a need to do some simple modifications to HTML in C++, preferably without completely rewriting the HTML, such as what happens when I use libxml2 or MSHTML.

In particular I need to be able to read, and then (potentially) modify, the "src" attribute of all "img" elements. I need it to be robust enough to be able to do this with any valid HTML, but preferably without changing any of the other HTML in the process.

Are there any libraries out there that would be able to handle this? Or is this something I can do with regular expressions? I'm not too savvy with regular expressions, and I've read a lot of questions here that say you shouldn't use them to parse HTML, but I'm not clear if that applies to something like this or if that principle applies primarily to parsing in the context of building a tree from the HTML.

+1  A: 

Try looking at HTMLTidy

I have used it for similar things in the past.

Jeff Leonard
Thanks I'll give that a whirl.
Gerald
+2  A: 

Regular expressions aren't recommended for HTML because they don't handle nested tags well. They should be fine for this purpose.

Head Geek
Thanks, that's about what I gathered from the other questions/answers, but I wasn't positive. I suppose this could be a good excuse for me to finally learn regular expressions.
Gerald
I'd recommend it. They're extremely useful, and the learning curve really isn't that steep.
Head Geek
I used to dabble with regex in Perl about 8 or 9 years ago, but I've pretty much forgotten it all. But I just grabbed Boost Regex, and was able to figure out how to do what I needed to do in about an hour, with about 10 lines of code. And I ordered a couple of books on the subject, so I can actually understand everything I did :P
Gerald