Time Series Forecasting using Random Forest in R

Shivam Sarin picture Shivam Sarin · Apr 3, 2017 · Viewed 7.9k times · Source

I'm trying to do a time series analysis using randomforest. PFB my code

Subsales<-read.csv('Sales.csv')
head(Subsales)

Sample Data:

Date               SKU                            City   Sales
      <date>                               <chr>   <chr> <dbl>
1 2014-08-11 Vaseline Petroleum Jelly Pure 60 ml Jeddah1   378
2 2014-08-18 Vaseline Petroleum Jelly Pure 60 ml Jeddah1   348
3 2014-08-25 Vaseline Petroleum Jelly Pure 60 ml Jeddah1   314
4 2014-09-01 Vaseline Petroleum Jelly Pure 60 ml Jeddah1   324
5 2014-09-08 Vaseline Petroleum Jelly Pure 60 ml Jeddah1   352
6 2014-09-15 Vaseline Petroleum Jelly Pure 60 ml Jeddah1   453


####Length of training & testing set Splitting it 80-20####

train_len=round(nrow(SubSales)*0.8) 
test_len=nrow(SubSales)



######Splitting dataset into training and testing#####

#### Training Set
training<-slice(SubSales,1:train_len) 
#### Testing Set
testing<-slice(SubSales,train_len+1:test_len)

training=training[c(1,4)]
testing=testing[c(1,4)]

library(randomForest)
set.seed(1234)
regressor = randomForest(formula=Sales~.,
                data=training,
                ntree=100)

y_pred = predict(regressor,newdata = testing)

I'm getting a stationary result when I use the predict function on test data set.All predicted values are 369, I've tried for another data set I got the same result. Can anyone tell me what am I doing wrong here?

Answer

JanLauGe picture JanLauGe · Apr 4, 2017

Let me try to rephrase your question to make sure I accurately understand what you want to do.

You have sales per day for a product, and you would like to predict sales as a function of the date into the future. You do NOT have any predictive variables, such as number of customers, amount spent on advertising, or anything else. Your input data looks like this:

Date        Sales
2014-08-11  378
2014-08-18  348
2014-08-25  314
2014-09-01  324
2014-09-08  352
2014-09-15  453
...

I think your RandomForest is behaving as expected. Random forest is a supervised machine learning algorithm that tries to predict y (response, here: Sales) given input variables x (predictors). Here, the only x you supply is date. However, each date is completely new to the random forest and the algorithm can therefore only guess that sales of your product on that day will be average.

You have two options:

Option 1) Stick with your approach of only using dates as predictors. You will need a different method, perhaps an autoregression approach such as ARIMA. This approach tries to detect trends in the data. Are sales more or less static, growing, or going down? Is there a weekly trend, a monthly trend, an annual trend? An example to get you started can be found here

Option 2) Use data collection and feature engineering to create features that help your RandomForest to predict values for new dates. For example, try to get data on how many customers came to the store on any given day, or extract the day of the week (Monday, Tuesday, ...) and keep that as a separate variable. The R-package lubridate will help you do this. A brief example below:

library(lubridate)
Subsales <- mutate(Subsales, Weekday = wday(Date, label = TRUE))

Hope this helps!