How can this method to convert a name to proper case be improved?

Kelsey picture Kelsey · Apr 30, 2010 · Viewed 8.1k times · Source

I am writing a basic function to convert millions of names, in a one-time batch process, from their current uppercase form to a proper mixed case. I came up with the following function:

public string ConvertToProperNameCase(string input)
{
    char[] chars = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(input.ToLower()).ToCharArray();

    for (int i = 0; i + 1 < chars.Length; i++)
    {
        if ((chars[i].Equals('\'')) ||
            (chars[i].Equals('-')))
        {                    
            chars[i + 1] = Char.ToUpper(chars[i + 1]);
        }
    }
    return new string(chars);
}

It works in most cases such as:

  1. JOHN SMITH → John Smith
  2. SMITH, JOHN T → Smith, John T
  3. JOHN O'BRIAN → John O'Brian
  4. JOHN DOE-SMITH → John Doe-Smith

There are some edge cases that do not work:

  1. JASON MCDONALD → Jason Mcdonald (Correct: Jason McDonald)
  2. OSCAR DE LA HOYA → Oscar De La Hoya (Correct: Oscar de la Hoya)
  3. MARIE DIFRANCO → Marie Difranco (Correct: Marie DiFranco)

These are not captured and I am not sure if I can handle all these odd edge cases. How can I change or add to capture more edge cases? I am sure there are tons of edge cases I am not even thinking of, as well. All casing should following North American conventions too, meaning that if certain countries expect a different capitalization format, then the North American format takes precedence.

Answer

Johannes Rudolph picture Johannes Rudolph · Apr 30, 2010

I think you'll run again a wall here because usually you won't be able to judge correctly if a conversion is reasonable or not.

Consider your edge cases

JASON MCDONALD -> Jason Mcdonald (Correct: Jason McDonald)

You could simply check for Mc at the beginning of your name and then apply your correction, right? But what if your person is named Mcizck (I made that up of course) and that should not be corrected to Mc Izck but should be left as is?

There is no 100% perfect solution to this problem. What you have here is a natural language problem, and they are really difficult to solve especially for a computer. Cultures are too different to be modeled correctly. Even if you say North-American conventions take precedence you'll have a high percentage of "false positives". Our society consists of a huge mix of cultures, it is simply not adequate to say "North-American takes precedence".

Without handling the edge cases, I guess your current solution will work 99% of the time. All further edge cases should be corrected manually if 100% correct names are really required.