I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is:
library(tm)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(dplyr)
library(wordcloud)
require(reshape2)
files <- list.files(inputdir,pattern="*.txt")
GetNrcSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # positive - negative
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
return(sentiment)
}
The .txt files are stored in inputdir
and have names AB-City.0000
, where AB is an abbreviation of a country, City is a city name and 0000 is year (ranges from 2000 to 2017).
The function works for a single file as expected, i.e. GetNrcSentiment(files[1])
gives me a tibble with proper counts per sentiment. However, when i try to run it for the whole set, i.e.
nrc_sentiments <- data_frame()
for(i in files){
nrc_sentiments <- rbind(nrc_sentiments, GetNrcSentiment(i))
}
I get the following error message:
Joining, by = "word"
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The exact same code works well with longer documents, but gives an error when dealing with shorter texts. It seems that not all sentiments are found in small documents and as a result the number of columns vary for each document, which might lead to this error, but I am not sure. I would appreciate any advice on how to fix the problem. If a sentiment is not found, I would want the entry to be equal to zero (if it is the cause of my problem).
As an aside, bing sentiment function runs through about two dozen of files and gives a different error, which seems to point to the same problem (negative sentiment not found?):
GetBingSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>%
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
# return our sentiment dataframe
return(sentiment)
}
Error in mutate_impl(.data, dots) :
Evaluation error: object 'negative' not found.
EDIT: Following the recommendation by David Klotz I edited the code to
for(i in files){ nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i)) }
As a result, instead of throwing an error the nrc generates NA if words from a certain sentiment are not found, however after 22 joinings i get a different error:
Error in mutate_impl(.data, dots) : Evaluation error: object 'negative' not found.
The same error shows up when run the bing function with dplyr. Both dataframes by the time the functions reaches 22nd document contain columns for all sentiments. What may cause the error and how to can diagnose it?
dplyr's bind_rows
function is more flexible than rbind
, at least when it comes to missing columns:
nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i))