Is there a function to count the number of words in a string? For example:
str1 <- "How many words are in this sentence"
to return a result of 7.
Use the regular expression symbol \\W
to match non-word characters, using +
to indicate one or more in a row, along with gregexpr
to find all matches in a string. Words are the number of word separators plus 1.
lengths(gregexpr("\\W+", str1)) + 1
This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W
's notion of non-word (one could work with other regular expressions, \\S+
, [[:alpha:]]
, etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit
solutions, which will allocate memory for each word. Regular expressions are described in ?regex
.
Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation
str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3
Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+
), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
Leading to
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3
Again the regular expression might be refined for different notions of 'word'.
I like the use of gregexpr()
because it's memory efficient. An alternative using strsplit()
(like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is
lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3
This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.