I'm running R on a Windows machine which is directly linked to a PostgreSQL database. I'm not using RODBC. My database is encoded in UTF-8 as confirmed by the following R command:
dbGetQuery(con, "SHOW CLIENT_ENCODING")
# client_encoding
# 1 UTF8
However, when some text is read into R, it displays as strange text in R.
For example, the following text is shown in my PostgreSQL database: "Stéphane"
After exporting to R it's shown as: "Stéphane" (the é is encoded as é)
When importing to R I use the dbConnect
command to establish a connection and the dbGetQuery
command to query data using SQL. I do not specify any text encoding anywhere when connecting to the database or when running a query.
I've searched online and can't find a direct resolution to my issue. I found this link, but their issue is with RODBC, which I'm not using.
This link is helpful in identifying the symbols, but I don't just want to do a find & replace in R... way too much data.
I did try running the following commands below and I arrived at a warning.
Sys.setlocale("LC_ALL", "en_US.UTF-8")
# [1] ""
# Warning message:
# In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
# OS reports request to set locale to "en_US.UTF-8" cannot be honored
Sys.setenv(LANG="en_US.UTF-8")
Sys.setenv(LC_CTYPE="UTF-8")
The warning occurs on the Sys.setlocale("LC_ALL", "en_US.UTF-8")
command. My intuition is that this is a Windows specific issue and doesn't occur with Mac/Linux/Unix.
As Craig Ringer said, setting client_encoding
to windows-1252 is probably not the best thing to do. Indeed, if the data you're retrieving contains a single exotic character, you're in trouble:
Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: character 0xcca7 of encoding "UTF8" has no equivalent in "WIN1252" )
On the other hand, getting your R environment to use Unicode could be impossible (I have the same problem as you with Sys.setlocale
... Same in this question too.).
A workaround is to manually declare UTF-8 encoding on all your data, using a function like this one:
set_utf8 <- function(x) {
# Declare UTF-8 encoding on all character columns:
chr <- sapply(x, is.character)
x[, chr] <- lapply(x[, chr, drop = FALSE], `Encoding<-`, "UTF-8")
# Same on column names:
Encoding(names(x)) <- "UTF-8"
x
}
And you have to use this function in all your queries:
set_utf8(dbGetQuery(con, "SELECT myvar FROM mytable"))
EDIT: Another possibility is to use RPostgres unstead of RPostgreSQL. I tested it (with the same config as in your question), and as far as I can see all declared encodings are automatically set to UTF-8.