How to convert Rust strings to UTF-16?

Gigih Aji Ibrahim picture Gigih Aji Ibrahim · Aug 8, 2014 · Viewed 7.6k times · Source

Editor's note: This code example is from a version of Rust prior to 1.0 and is not valid Rust 1.0 code, but the answers still contain valuable information.

I want to pass a string literal to a Windows API. Many Windows functions use UTF-16 as the string encoding while Rust's native strings are UTF-8.

I know Rust has utf16_units() to produce a UTF-16 character iterator, but I don't know how to use that function to produce a UTF-16 string with zero as last character.

I'm producing the UTF-16 string like this, but I am sure there is a better method to produce it:

extern "system" {
    pub fn MessageBoxW(hWnd: int, lpText: *const u16, lpCaption: *const u16, uType: uint) -> int;
}

pub fn main() {
    let s1 = [
        'H' as u16, 'e' as u16, 'l' as u16, 'l' as u16, 'o' as u16, 0 as u16,
    ];
    unsafe {
        MessageBoxW(0, s1.as_ptr(), 0 as *const u16, 0);
    }
}

Answer

Vladimir Matveev picture Vladimir Matveev · Aug 8, 2014

Rust 1.8+

str::encode_utf16 is the stable iterator of UTF-16 values.

You just need to use collect() on that iterator to construct Vec<u16> and then push(0) on that vector:

pub fn main() {
    let s = "Hello";

    let mut v: Vec<u16> = s.encode_utf16().collect();
    v.push(0);
}

Rust 1.0+

str::utf16_units() / str::encode_utf16 is unstable. The alternative is to either switch to nightly (a viable option if you're writing a program, not a library) or to use an external crate like encoding:

extern crate encoding;

use std::slice;

use encoding::all::UTF_16LE;
use encoding::{Encoding, EncoderTrap};

fn main() {
    let s = "Hello";

    let mut v: Vec<u8> = UTF_16LE.encode(s, EncoderTrap::Strict).unwrap();
    v.push(0); v.push(0);
    let s: &[u16] = unsafe { slice::from_raw_parts(v.as_ptr() as *const _, v.len()/2) };
    println!("{:?}", s);
}

(or you can use from_raw_parts_mut if you want a &mut [u16]).

However, in this particular example you have to be careful with endianness because UTF_16LE encoding gives you a vector of bytes representing u16's in little endian byte order, while the from_raw_parts trick allows you to "view" the vector of bytes as a slice of u16's in your platform's byte order, which may as well be big endian. Using a crate like byteorder may be helpful here if you want complete portability.

This discussion on Reddit may also be helpful.