Why is capitalizing the first letter of a string so convoluted in Rust?

marshallm picture marshallm · Jul 16, 2016 · Viewed 9.9k times · Source

I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:

let mut s = "foobar";
s[0] = s[0].to_uppercase();

But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.

let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;

Is there an easier way than this, and if so, what? If not, why is Rust designed this way?

Similar question

Answer

Shepmaster picture Shepmaster · Jul 16, 2016

Why is it so convoluted?

Let's break it down, line-by-line

let s1 = "foobar";

We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.

let mut v: Vec<char> = s1.chars().collect();

This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.

v[0] = v[0].to_uppercase().nth(0).unwrap();

This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.

This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!

let s2: String = v.into_iter().collect();

Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.

let s3 = &s2;

And now we take a reference to that String.

It's a simple problem

Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?

I presume char::to_uppercase already properly handles Unicode.

Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases. Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.

why the need for all data type conversions?

Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.

indexing could return a multi-byte, Unicode character

There may be some mismatched terminology here. A char is a multi-byte Unicode character.

Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.

One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.

to_uppercase could return an upper case character

As mentioned above, ß is a single character that, when capitalized, becomes two characters.

Solutions

See also trentcl's answer which only uppercases ASCII characters.

Original

If I had to write the code, it'd look like:

fn some_kind_of_uppercase_first_letter(s: &str) -> String {
    let mut c = s.chars();
    match c.next() {
        None => String::new(),
        Some(f) => f.to_uppercase().chain(c).collect(),
    }
}

fn main() {
    println!("{}", some_kind_of_uppercase_first_letter("joe"));
    println!("{}", some_kind_of_uppercase_first_letter("jill"));
    println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
    println!("{}", some_kind_of_uppercase_first_letter("ß"));
}

But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.

Improved

Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.

fn some_kind_of_uppercase_first_letter(s: &str) -> String {
    let mut c = s.chars();
    match c.next() {
        None => String::new(),
        Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
    }
}