I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.
So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.
/**
* Convert a string to the file/URL safe "slug" form
*
* @param string $string the string to clean
* @param bool $is_filename TRUE will allow additional filename characters
* @return string
*/
function sanitize($string = '', $is_filename = FALSE)
{
// Replace all weird characters with dashes
$string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);
// Only allow one dash separator at a time (and make string lowercase)
return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}
Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?
$is-filename allows some additional characters like temp vim files
update: removed the star character since I could not think of a valid use
I found this larger function in the Chyrp code:
/**
* Function: sanitize
* Returns a sanitized string, typically for URLs.
*
* Parameters:
* $string - The string to sanitize.
* $force_lowercase - Force the string to lowercase?
* $anal - If set to *true*, will remove all non-alphanumeric characters.
*/
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"—", "–", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}
and this one in the wordpress code
/**
* Sanitizes a filename replacing whitespace with dashes
*
* Removes special characters that are illegal in filenames on certain
* operating systems and special characters requiring special escaping
* to manipulate at the command line. Replaces spaces and consecutive
* dashes with a single dash. Trim period, dash and underscore from beginning
* and end of filename.
*
* @since 2.1.0
*
* @param string $filename The filename to be sanitized
* @return string The sanitized filename
*/
function sanitize_file_name( $filename ) {
$filename_raw = $filename;
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
$special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
$filename = str_replace($special_chars, '', $filename);
$filename = preg_replace('/[\s-]+/', '-', $filename);
$filename = trim($filename, '.-_');
return apply_filters('sanitize_file_name', $filename, $filename_raw);
}
Alix Axel has done some incredible work in this area. His phunction framework includes several great text filters and transformations.