We need to generate a unique URL from the title of a book - where the title can contain any character. How can we search-replace all the 'invalid' characters so that a valid and neat lookoing URL is generated?
For instance:
"The Great Book of PHP"
www.mysite.com/book/12345/the-great-book-of-php
"The Greatest !@#$ Book of PHP"
www.mysite.com/book/12345/the-greatest-book-of-php
"Funny title "
www.mysite.com/book/12345/funny-title
Ah, slugification
// This function expects the input to be UTF-8 encoded.
function slugify($text)
{
// Swap out Non "Letters" with a -
$text = preg_replace('/[^\\pL\d]+/u', '-', $text);
// Trim out extra -'s
$text = trim($text, '-');
// Convert letters that we have left to the closest ASCII representation
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
// Make text lowercase
$text = strtolower($text);
// Strip out anything we haven't been able to convert
$text = preg_replace('/[^-\w]+/', '', $text);
return $text;
}
This works fairly well, as it first uses the unicode properties of each character to determine if it's a letter (or \d against a number) - then it converts those that aren't to -'s - then it transliterates to ascii, does another replacement for anything else, and then cleans up after itself. (Fabrik's test returns "arvizturo-tukorfurogep")
I also tend to add in a list of stop words - so that those are removed from the slug. "the" "of" "or" "a", etc (but don't do it on length, or you strip out stuff like "php")