I've heard it claimed that the simplest solution to preventing SQL injection attacks is to html encode all text before inserting into the database. Then, obviously, decode all text when extracting it. The idea being that if the text only contains ampersands, semi-colons and alphanumerics then you can't do anything malicious.
While I see a number of cases where this may seem to work, I foresee the following problems in using this approach:
- It claims to be a silver bullet. Potentially stopping users of this technique from understanding all the possible related issues - such as second-order attacks.
- It doesn't necessarily prevent any second-order / delayed payload attacks.
- It's using a tool for a purpose other than that which it was designed for. This may lead to confusion amongst future users/developers/maintainers of the code. It's also likley to be far from optimal in performance of effect.
- It adds a potential performance hit to every read and write of the database.
- It makes the data harder to read directly from the database.
- It increases the size of the data on disk. (Each character now being ~5 characters - In turn this may also impact disk space requirements, data paging, size of indexes and performance of indexes and more?)
- There are potential issues with high range unicode characters and combining characters?
- Some html [en|de]coding routines/libraries behave slightly differently (e.g. Some encode an apostrophe and some don't. There may be more differences.) This then ties the data to the code used to read & write it. If using code which [en|de]codes differently the data may be changed/corrupted.
- It potentially makes it harder to work with (or at least debug) any text which is already similarly encoded.
Is there anything I'm missing?
Is this actually a reasonable approach to the problem of preventing SQL injection attacks?
Are there any fundamental problems with trying to prevent injection attacks in this way?