Filling in missing (blanks) in a data table, per category - backwards and forwards

Farrel picture Farrel · Sep 26, 2012 · Viewed 11.7k times · Source

I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.

Here is some fictitious data to orient you (as a dput ASCII text representation)

    structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy", 
"Timothy"), class = "factor"), patient.last.name = structure(c(3L, 
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones", 
"Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455, 
4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775, 
2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087, 
39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360, 
39184, 39194, 39198, 39216), procedure.code = c(44750, 38995, 
40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750, 
44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87, 
456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87, 
456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413, 
413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe", 
"Mark"), class = "factor"), referring.doctor.last = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead", 
"Wydell"), class = "factor"), referring.docotor.zip = c(15209, 
15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA, 
15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L, 
NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio", 
"cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name", 
"patient.last.name", "medical.record.nr", "date.of.service", 
"procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first", 
"referring.doctor.last", "referring.docotor.zip", "some.other.stuff"
), row.names = c(NA, 14L), class = "data.frame")

The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.

In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.

Answer

Matt Dowle picture Matt Dowle · Sep 27, 2012

A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?

Since you already have NAs in the data, that you wish to fill, it's not roll in data.table really. A rolling join is more for when your data has no NA but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.

So you do need na.locf as @Joshua suggested. I'm not aware of a function that fills NA forward and then the first value backwards, though.

In data.table, to use na.locf by group it's just :

require(data.table)
require(zoo)
DT[,doctor:=na.locf(doctor),by=patient]

which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf to roll the first non NA backwards.

Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by maintains the order of rows within each group.

Hope that gives you some hints.