I'm trying to use the tm package in R to perform some text analysis. I tied the following:
require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.
I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:
find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;
as pointed out here Batch convert latin-1 files to utf-8 using iconv
But I still get the same error.
I'd appreciate any help.
None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).
The code is this simple
usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")