How to monitor progress of an apply function?

ChiseledAbs picture ChiseledAbs · May 24, 2016 · Viewed 9.7k times · Source

I need to work out a 2886*2886 correlation matrix, problem is that building an intermediary datatable (RESULT) takes a long time for it to be binded together so I would like to be able to do the following things while calling the last line RESULT=rbindlist(apply(COMB, 1, append)) in the code below :

  1. Estimate the time it will take for the apply function to complete
  2. Monitor its progress
  3. Be able to pause and continue at later time

Here is the code :

SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:2889)), each=600), VALUE=sample(c(TRUE,FALSE), 600, TRUE) )
> SOURCE
            NAME VALUE
      1:   NAME1  TRUE
      2:   NAME1  TRUE
      3:   NAME1  TRUE
      4:   NAME1  TRUE
      5:   NAME1  TRUE
     ---              
1733396: NAME999  TRUE
1733397: NAME999  TRUE
1733398: NAME999  TRUE
1733399: NAME999  TRUE
1733400: NAME999 FALSE

setkey(SOURCE,NAME)
a=SOURCE[,unique(NAME)]
COMB=data.table(expand.grid(a,a, stringsAsFactors=FALSE))
> COMB
             Var1    Var2
      1:    NAME1   NAME1
      2:   NAME10   NAME1
      3:  NAME100   NAME1
      4: NAME1000   NAME1
      5: NAME1001   NAME1
     ---                 
8346317:  NAME995 NAME999
8346318:  NAME996 NAME999
8346319:  NAME997 NAME999
8346320:  NAME998 NAME999
8346321:  NAME999 NAME999

append <- function(X) {
data.table(NAME1=X[1], VALUE1=SOURCE[X[1], VALUE], 
    NAME2=X[2], VALUE2=SOURCE[X[2], VALUE] )
}

RESULT=rbindlist(apply(COMB, 1, append))

Any idea ?

Also do you know if there is a faster way to generate the datatable RESULT from SOURCE ? RESULTis an intermediary datatable to work out the correlation values between VALUE1 and VALUE2 for each couple of NAME.

With a subset of SOURCE RESULTlooks like that :

SOURCE=SOURCE[sample(1:nrow(SOURCE), 3)]
setkey(SOURCE,NAME)
a=SOURCE[,unique(NAME)]
COMB=data.table(expand.grid(a,a, stringsAsFactors=FALSE))
RESULT=rbindlist(apply(COMB, 1, append))
> RESULT
      NAME1 VALUE1    NAME2 VALUE2
1: NAME1859   TRUE NAME1859   TRUE
2:  NAME768  FALSE NAME1859   TRUE
3:  NAME795   TRUE NAME1859   TRUE
4: NAME1859   TRUE  NAME768  FALSE
5:  NAME768  FALSE  NAME768  FALSE
6:  NAME795   TRUE  NAME768  FALSE
7: NAME1859   TRUE  NAME795   TRUE
8:  NAME768  FALSE  NAME795   TRUE
9:  NAME795   TRUE  NAME795   TRUE

Later on I'm going to do RESULT[,VALUE3:=(VALUE1==VALUE2)] to finally get the correlation values : RESULT[, mean(VALUE3), by=c("NAME1", "NAME2")] So maybe the whole process can be done more efficiently, who knows.

Answer

rafa.pereira picture rafa.pereira · May 24, 2016

You can use the library pbapply(git), which shows a time estimate and a progress bar to any function in the '*apply' family.

In the case of your question:

library(pbapply)      

result <- rbindlist( pbapply(COMB, 1, append) )

ps. This answer solves your two initial points. Regarding the third point, I'm not sure if it's possible to pause the function. In any case, your operation is indeed taking too long, so I would recommend you post a separate question asking how to optimize your task.