I am writing a basic function to convert millions of names, in a one-time batch process, from their current uppercase form to a proper mixed case. I came up with the following function:
public string ConvertToProperNameCase(string input)
{
char[] chars = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(input.ToLower()).ToCharArray();
for (int i = 0; i + 1 < chars.Length; i++)
{
if ((chars[i].Equals('\'')) ||
(chars[i].Equals('-')))
{
chars[i + 1] = Char.ToUpper(chars[i + 1]);
}
}
return new string(chars);
}
It works in most cases such as:
There are some edge cases that do not work:
These are not captured and I am not sure if I can handle all these odd edge cases. How can I change or add to capture more edge cases? I am sure there are tons of edge cases I am not even thinking of, as well. All casing should following North American conventions too, meaning that if certain countries expect a different capitalization format, then the North American format takes precedence.
I think you'll run again a wall here because usually you won't be able to judge correctly if a conversion is reasonable or not.
Consider your edge cases
JASON MCDONALD -> Jason Mcdonald (Correct: Jason McDonald)
You could simply check for Mc at the beginning of your name and then apply your correction, right? But what if your person is named Mcizck (I made that up of course) and that should not be corrected to Mc Izck but should be left as is?
There is no 100% perfect solution to this problem. What you have here is a natural language problem, and they are really difficult to solve especially for a computer. Cultures are too different to be modeled correctly. Even if you say North-American conventions take precedence you'll have a high percentage of "false positives". Our society consists of a huge mix of cultures, it is simply not adequate to say "North-American takes precedence".
Without handling the edge cases, I guess your current solution will work 99% of the time. All further edge cases should be corrected manually if 100% correct names are really required.