views:

1803

answers:

10

I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.

So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.

/**
 * Convert a string to the file/URL safe "slug" form
 *
 * @param string $string the string to clean
 * @param bool $is_filename TRUE will allow additional filename characters
 * @return string
 */
function sanitize($string = '', $is_filename = FALSE)
{
 // Replace all weird characters with dashes
 $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);

 // Only allow one dash separator at a time (and make string lowercase)
 return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}

Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?

$is-filename allows some additional characters like temp vim files

update: removed the star character since I could not think of a valid use

A: 
// CLEAN ILLEGAL CHARACTERS
function clean_filename($source_file)
{
    $search[] = " ";
    $search[] = "&";
    $search[] = "$";
    $search[] = ",";
    $search[] = "!";
    $search[] = "@";
    $search[] = "#";
    $search[] = "^";
    $search[] = "(";
    $search[] = ")";
    $search[] = "+";
    $search[] = "=";
    $search[] = "[";
    $search[] = "]";

    $replace[] = "_";
    $replace[] = "and";
    $replace[] = "S";
    $replace[] = "_";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";

    return str_replace($search,$replace,$source_file);

} 
Brant
+4  A: 

I found this larger function in the Chyrp code:

/**
 * Function: sanitize
 * Returns a sanitized string, typically for URLs.
 *
 * Parameters:
 *     $string - The string to sanitize.
 *     $force_lowercase - Force the string to lowercase?
 *     $anal - If set to *true*, will remove all non-alphanumeric characters.
 */
function sanitize($string, $force_lowercase = true, $anal = false) {
    $strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
                   "}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
                   "—", "–", ",", "<", ".", ">", "/", "?");
    $clean = trim(str_replace($strip, "", strip_tags($string)));
    $clean = preg_replace('/\s+/', "-", $clean);
    $clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
    return ($force_lowercase) ?
        (function_exists('mb_strtolower')) ?
            mb_strtolower($clean, 'UTF-8') :
            strtolower($clean) :
        $clean;
}

and this one in the wordpress code

/**
 * Sanitizes a filename replacing whitespace with dashes
 *
 * Removes special characters that are illegal in filenames on certain
 * operating systems and special characters requiring special escaping
 * to manipulate at the command line. Replaces spaces and consecutive
 * dashes with a single dash. Trim period, dash and underscore from beginning
 * and end of filename.
 *
 * @since 2.1.0
 *
 * @param string $filename The filename to be sanitized
 * @return string The sanitized filename
 */
function sanitize_file_name( $filename ) {
    $filename_raw = $filename;
    $special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
    $special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
    $filename = str_replace($special_chars, '', $filename);
    $filename = preg_replace('/[\s-]+/', '-', $filename);
    $filename = trim($filename, '.-_');
    return apply_filters('sanitize_file_name', $filename, $filename_raw);
}
Xeoncross
+2  A: 

I don't think having a list of chars to remove is save. I would rather use the following:

For filenames: Use an internal ID or a hash of the filecontent. Save the document name in a database. This way you can keep the original filename and still find the file.

For url parameters: Use urlencode() to encode any special characters.

ZeissS
I agree, most of the methods listed here remove *known dangerous* characters - my method removes everything that isn't a known *safe* character. Since most systems slug encode post URL's I would suggest we continue to follow this proven method rather than using the documented **UTF-8 unsafe** urlencode().
Xeoncross
+1  A: 

Try this:

function normal_chars($string)
{
    $string = htmlentities($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
    $string = preg_replace(array('~[^0-9a-z]~i', '~-+~'), ' ', $string);
    return trim($string);
}

Examples:

echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA

Based on the selected answer in this thread: http://stackoverflow.com/questions/2103797/url-friendly-username-in-php

John Conde
Very nice - I have never seen this done without a translation table (like wordpress uses). However, I don't think this function is enough as-is since it only translates special characters but does not remove dangerous characters. Maybe it can be *added to one above*...
Xeoncross
Alan Donnelly
+5  A: 

Some observations on your solution:

  1. 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
  2. \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
  3. The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Creating the slug

You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: http://neo22s.com/slug/

Sanitization in general

OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

The Encoder interface provides:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

http://code.google.com/p/owasp-esapi-php/

Alan Donnelly
You are correct about my assumption of the "u" modifier - I thought that it was for the text. I also forgot about the \w modifier including the underscore. I would normally convert all accented characters to ASCII - but I want this to work for other languages as well. I was assuming that there would be some kind of UTF-8 safe way that any character of a language could be used in a URL slug or filename so that even Arabic titles would work. After all, linux supports UTF-8 filenames and browsers *should* encode HTML links as needed. Big thanks for your input here.
Xeoncross
On second thought, you're actually right, but it's not just an issue with the browser encoding the links correctly. The easiest way to achieve close to what you want is to map non-ASCII characters to their closest ASCII equivalent and then URL-encode your link in the HTML body.The hard way is to ensure consistent UTF-8 encoding (or UTF-16, I think for some Chinese dialects) from your data store, through your webserver, application layer (PHP), page content, web browser and **not** urlencode your urls (but still strip 'undesirable' chars). This will give you nice non-encoded links and URLs.
Alan Donnelly
Good advice. I'm going to try to create a pure UTF-8 environment. Then, taking a several strings from non-ASCII languages, I'll remove dangerous chars (./;:etc...) and creating files and then HTML links to those files to see if I can click them and see if all this works. If not then I'll probably have to drop back to (raw)?urlencode() to allow UTF-8. I'll post back results here.
Xeoncross
Xeoncross
I award this answer because it got me thinking the most and also included a useful link to a project I never heard of that is worth looking into. I'll post once I find a the answer though.
Xeoncross
A: 

I've always thought Kohana did a pretty good job of it.

 public static function title($title, $separator = '-', $ascii_only = FALSE)
{
if ($ascii_only === TRUE)
{
// Transliterate non-ASCII characters
$title = UTF8::transliterate_to_ascii($title);

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'a-z0-9\s]+!', '', strtolower($title));
}
else
{
// Remove all characters that are not the separator, letters, numbers, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'\pL\pN\s]+!u', '', UTF8::strtolower($title));
}

// Replace all separator characters and whitespace by a single separator
$title = preg_replace('!['.preg_quote($separator).'\s]+!u', $separator, $title);

// Trim separators from the beginning and end
return trim($title, $separator);
}

The handy UTF8::transliterate_to_ascii() will turn stuff like ñ => n.

Of course, you could replace the other UTF8::* stuff with mb_* functions.

alex
+1  A: 

Depending on how you will use it, you might want to add a length limit to protect against buffer overflows.

Tgr
Yes, testing for mb_strlen() is always an important thing!
Xeoncross
A: 

In terms of file uploads, you would be safest to prevent the user from controlling the file name. As has already been hinted at, store the canonicalised filename in a database along with a randomly chosen and unique name which you'll use as the actual filename.

Using OWASP ESAPI, these names could be generated thus:

$userFilename   = ESAPI::getEncoder()->canonicalize($input_string);
$safeFilename   = ESAPI::getRandomizer()->getRandomFilename();

You could append a timestamp to the $safeFilename to help ensure that the randomly generated filename is unique without even checking for an existing file.

In terms of encoding for URL, and again using ESAPI:

$safeForURL     = ESAPI::getEncoder()->encodeForURL($input_string);

This method performs canonicalisation before encoding the string and will handle all character encodings.

jah
A: 

why not simply use php's urlencode? it replaces "dangerous" characters with their hex representation for urls (i.e. %20 for a space)

knittl
The % character is not recommended for filenames and hex encoded characters do not look as nice in the URL. Browsers can support UTF-8 strings which are much nicer and easier for non-ascii languages.
Xeoncross
+2  A: 

This should make your filenames safe...

$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);

and a more deeper solution to this is:

// Remove special accented characters - ie. sí.
$clean_name = strtr($string, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy');
$clean_name = strtr($clean_name, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);

This assumes that you want a dot in the filename. if you want it transferred to lowercase, just use

$clean_name = strtolower($clean_name);

for the last line.

SoLoGHoST