ansaurus

Question

Answer 1

+6 A:

HTML Tidy is written in c, but there are bindings for practically every language/platform, including c++.

Rex M 2009-04-19 00:29:11

I'm not sure to understand, do you suggest I use some code from Tidy?

Klaim 2009-04-23 08:43:20

@Klaim sanitizing HTML is ideally a two-step process - first ensuring the markup is standardized and conforms to spec. Second is then stripping the HTML. If we try to do it all in one pass, we have to account for the countless ways HTML can be mangled and still parsed/executed by a browser. If you run the potential markup through something like HTML Tidy, it comes out so clean and normalized you can safely run it against a simple whitelist.

Rex M 2009-04-24 02:24:54

Thanks for the precisions, I'll try it.

Klaim 2009-04-24 20:52:01

Answer 2

A:

This was posted a few hours ago. It's just an article about regex's, but it happens to contain exactly what you want :) and I think this might be of interest as well.

Dunya Degirmenci 2009-04-19 00:32:54

Um... your first link is to a post that was written nearly a year ago. Maybe "a few hours" was a misspeak? ;-)

Head Geek 2009-04-19 00:39:23

Haha, I actually meant that it was posted *here on SO* a few hours back from then. Guess I should've explained better - but then again, forgive me for it was 4AM here in Turkey and I had been struggling to write a compression program for several hours :)

Dunya Degirmenci 2009-04-19 10:41:27

Those regular expressions have known vulnerabilities in them. Also, I doubt you'd want to load it into PCRE.

Edward Z. Yang 2009-04-24 03:18:38

Answer 3

A:

You could use libxml2's xmlEncodeSpecialChars.

Ben Straub 2009-04-24 02:17:59

Interesting, I'll try that. The problem I'm seing is adding such a "big" dependency just for sanitization. But if it work well, I can try to isolate the code and get it for my project.

Klaim 2009-04-24 20:51:23

Answer 4

+1 A:

You are asking quite the question here. Before you are going to get a good answer, you need to be clear on what exactly you want to "parse" OUT of your input. For example, you could look for any "<" chars, and convert them to something else, so they are not parsed by any HTML parser.

Or, you could search for the pattern of < and > followed by < / > pattern. (Excuse the space, I had to put it in here so the HTML parser HERE would not eat it). Then, you also need to look for the "< single element tags / >" as well.

You can actually look for valid/known HTML tags and strip THOSE out.

So, the question becomes, which method is correct for your solution? Knowing that if you make a simple parser, you may actually rip valid text out that contains greater-than, and less-than symbols.

So, here is my answer for you thus far.

If you want to simply REMOVE any HTML-esque style text, I'd recommend going with a regular expression engine (PCRE), and using it to parse your input, and remove all the matched strings. This probably the easy solution, but it does require you get and build PCRE, and there a GPL issues you need to be aware of, for your project. The parsing would probably be really easy to implement, and run quick.

The second option is to do it by walking a buffer, looking for the open HTML char (<), then parsing until you hit the first white space, then start walking, looking for the closing HTML char (>), then start walking again, looking for the matching CLOSING tag, based on what you just parsed. (Say, it's a DIV tag, you want to look for /DIV.)

I have code that does this in an STL HTML parser, but there are a lot of issues to consider going this route also. For example, you have entity codes to deal with, single element tags like IMG, P, and BR, to name a few.

If you want some REALLY great C code to look at, go look at the ClamAV project. They have an HTML parser that strips all the tags out of a page, and leaves you with JUST the text left over. (among other things it does..). Look in the file libclamav\htmlnorm.c for a great example on 'buffer walking' and parsing. It's not the fastest thing in the world, but it does work... The latest Clam might even have so much stuff tied into the HTML parser, it might actually be difficult to understand. If so, go back and look at an earlier version, like .88.4 or so. Just please be aware of the bugs in those older code bases, there are some good ones. :)

Hope this helps.

LarryF 2009-04-24 19:07:11

I added some precision on my needs. I'll try your last suggestion, hoping I can isolate the code enough.

Klaim 2009-04-24 20:50:20

It seems like you just need to 'filter' the < and > chars... So, just write a simple parser to remove them! The only glitch is that those MIGHT be needed in legit input, so you need to clarify that if that IS the case, then you have a much larger problem on your hands. I'd be interested in helping you solve this issue, as I love C/++ and now forever stuck in the C# world, this would be a nice project to work on. :)

LarryF 2009-04-25 00:39:42

I'll first try the solutions proposed here before considering making an home made solution as it seem a complex problem (the entry could contain javascript too...). I'll then consider you help. Anyway, the problem seem clear now? You can already work on a solution if you want I guess. I started working on something and figured that the problem was complex and already resolved by web applications running on C# anr ROR for example. Now I need an equivalent robust solution for C++.

Klaim 2009-04-25 10:38:53

Answer 5

A:

Use Qt's QWebkit and to parse the HTML Tree. Then spit the output with it. This would def clean up the html a little bit.

Ankur Gupta 2009-04-29 13:07:06

Isn't it a bit overkill? QT is not a dependency of my project and adding it just for that don't seem a good idea...

Klaim 2009-04-29 14:04:01

ansaurus

tags:

views:

answers:

HTML Sanitization in C++

related questions