I have a problem with inconsistent encoding of character vector in R.
The text file which I read a table from is encoded (via Notepad++
) in UTF-8
(I tried with UTF-8 without BOM
, too.).
I want to read table from this text file, convert it do data.table
, set a key
and make use of binary search. When I tried to do so, the following appeared:
Warning message: In
[.data.table
(poli.dt, "żżonymi", mult = "first") : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.
and binary search does not work.
I realised that my data.table
-key
column consists of both: "unknown" and "UTF-8" Encoding types:
> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122
I tried to convert this column (before creating a data.table
object) with the use of:
Encoding(word) <- "UTF-8"
word<- enc2utf8(word)
but with no effect.
I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"
):
data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table
but with no effect.
My R.version:
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy
My session info:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
The Encoding
function returns unknown
if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII.
To discriminate between these two cases, call:
library(stringi)
stri_enc_mark(poli.dt$word)
To check whether each string is a valid UTF-8 byte sequence, call:
all(stri_enc_isutf8(poli.dt$word))
If it's not the case, your file is definitely not in UTF-8.
I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word
to verify this statement). If my guess is true, try:
read.csv2(file("filename", encoding="UTF-8"))
or
poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings
If data.table
still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"