tags:

views:

691

answers:

7

I've run into a few problems using a C# regex to implement a whitelist of allowed characters on web inputs. I am trying to avoid SQL injection and XSS attacks. I've read that whitelists of the allowable characters are the way to go.

The inputs are people names and company names.

Some of the problems are:

  1. Company names that have ampersands. Like "Jim & Sons". The ampersand is important, but it is risky.

  2. Unicode characters in names (we have asian customers for example), that enter their names using their character sets. I need to whitelist all these.

  3. Company names can have all kinds of slashes, like "S/A" and "S\A". Are those risky?

I find myself wanting to allow almost every character after seeing all the data that is in the DB already (and being entered by new users).

Any suggestions for a good whitelist that will handle these (and other) issues?

NOTE: It's a legacy system, so I don't have control of all the code. I was hoping to reduce the number of attacks by preventing bad data from getting into the system in the first place.

+2  A: 

Company names might have almost any kind of symbol in them, so I don't know how well this is going to work for you. I'd concentrate on shielding yourself directly from various attacks, not hoping that your strings are "naturally" safe.

(Certainly they can have ampersands, colons, semicolons, exclamation points, hyphens, percent signs, and all kinds of other things that could be "unsafe" in a host of contexts.)

mquander
+3  A: 

Do not try to sanitize names, especially with regex!

Just make sure that you are properly escaping the values and saving them safely in your DB, and them escaping them back when presenting in HTML

duckyflip
A: 

I think writing your own regexp is not a good idea: it would be very hard. Try leveraging existing functions of your web framework, there is lots of resources on the net. If you say C#, I assume you are using ASP.NET, try the following article: How To: Protect From Injection Attacks in ASP.NET

bbmud
A: 

This SO thread seems similar -- it might help.

JP Alioto
+3  A: 

This SO thread has a lot of good discussion on protecting yourself from injection attacks.

In short:

  1. Filter your input as best as you can
  2. Escape your strings using framework based methods
  3. Parameterize your sql statements

In your case, you can limit the name field to a small character set. The company field will be more difficult, and you need to consider and balance your users need for freedom of entry with your need for site security. As others have said, trying to write your own custom sanitation methods is tricky and risky. Keep it simple and protect yourself through your architecture - don't simply rely on strings being "safe", even after sanitization.

EDIT:

To clarify - if you're trying to develop a whitelist, it's not something that the community can hand out, since it's entirely dependent on the data you want. But let's look at a example of a regex whitelist, perhaps for names. Say I've whitelisted A-Z and a-z and space.

Regex reWhiteList = new Regex("^[A-Za-z ]+$")

That checks to see if the entire string is composed of those characters. Note that a string with a number, a period, a quote, or anything else would NOT match this regex and thus would fail the whitelist.

if (reWhiteList.IsMatch(strInput))
   // it's ok, proceed to step 2
else
   // it's not ok, inform user they've entered invalid characters and try again

Hopefully this helps some more! With names and company names you'll have a tough-to-impossible time developing a rigorous pattern to check against, but you can do a simple allowable character list, as I showed here.

patjbs
Step 1 is what I am trying to figure out. The referenced article mentions white lists.
jm
>> trying to develop a whitelist, it's not something that the community can hand out, I think it is something the community can help with. I'm trying to whitelist people names. Most people have them :) It's not some outlandish, uncommon thing. I agree with your approach. I just need to figure out the "reWhiteList"
jm
+1  A: 

Why filter or regex the data at all, or even escape it, you should be using bind variables to access the database.

This way, the customer could enter something like: anything' OR 'x'='x

And your application doesn't care because your SQL code doesn't parse the variable because it's not set when you prepare the statement. I.e.

'SELECT count(username) FROM usertable WHERE username = ? and password = ?'

then you execute that code with those variables set.

This works in PHP, PERL, J2EE applications, and so on.

krypt0
Can't they still enter javascript and do an XSS attack?
jm
You also need to html encode the data when you send it to the browser.
Dave Hinton
A: 

This is my current regex WHITELIST for a company name. Any input outside of these characters is rejected:

"^[0-9\p{L} '-.,\/\&]{0,50}$"

The \p{L} matches any unicode "letter". So, the accents and asian characters are whitelisted.

The \& is a bit problematic because it potentially allows javascript special characters.

the \' is problematic if not using parameterized queries, because of SQL injection.

the - could allow "--", also a potential for SQL injection if not using parameterized queries.

Also, the \p{L} won't work client-side, so you can't use it in the ASP.NET regular expression validator without disabling clientside validation: EnableClientScript="False"

jm