How to get the number of Characters in a String?

Ammar picture Ammar · Oct 1, 2012 · Viewed 96.4k times · Source

How can I get the number of characters of a string in Go?

For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.

Answer

VonC picture VonC · Oct 1, 2012

You can try RuneCountInString from the utf8 package.

returns the number of runes in p

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but its rune count is 2:

package main
    
import "fmt"
import "unicode/utf8"
    
func main() {
    fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozen adds in the comments:

Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At leats in Go 1.3.


And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)

The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.

Adds a new runtime function to count runes in a string. Modifies the compiler to detect the pattern len([]rune(string)) and replaces it with the new rune counting runtime function.

RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%

Stefan Steiger points to the blog post "Text normalization in Go"

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.

The definition of a character may vary depending on the application.
For normalization we will define it as:

  • a sequence of runes that starts with a starter,
  • a rune that does not modify or combine backwards with any other rune,
  • followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).

The normalization algorithm processes one character at at time.

Using that package and its Iter type, the actual number of "character" would be:

package main
    
import "fmt"
import "golang.org/x/text/unicode/norm"
    
func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"


Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.

package uniseg
    
import (
    "fmt"
    
    "github.com/rivo/uniseg"
)
    
func main() {
    gr := uniseg.NewGraphemes("👍🏼!")
    for gr.Next() {
        fmt.Printf("%x ", gr.Runes())
    }
    // Output: [1f44d 1f3fc] [21]
}

Two graphemes, even though there are three runes (Unicode code points).

You can see other examples in "How to manipulate strings in GO to reverse them?"

👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes: