R: Generic flattening of JSON to data.frame

Sim picture Sim · Jul 19, 2012 · Viewed 9.9k times · Source

This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.

There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.

In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].

If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.

This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.

Answer

shhhhimhuntingrabbits picture shhhhimhuntingrabbits · Aug 12, 2012

Hi @Sim I had cause to reflect on your problem yesterday define:

flatten<-function(x) {
    dumnames<-unlist(getnames(x,T))
    dumnames<-gsub("(*.)\\.1","\\1",dumnames)
    repeat {
        x <- do.call(.Primitive("c"), x)
        if(!any(vapply(x, is.list, logical(1)))){
           names(x)<-dumnames
           return(x)
        }
    }
}
getnames<-function(x,recursive){

    nametree <- function(x, parent_name, depth) {
        if (length(x) == 0) 
            return(character(0))
        x_names <- names(x)
        if (is.null(x_names)){ 
            x_names <- seq_along(x)
            x_names <- paste(parent_name, x_names, sep = "")
        }else{ 
            x_names[x_names==""] <- seq_along(x)[x_names==""]
            x_names <- paste(parent_name, x_names, sep = "")
        }
        if (!is.list(x) || (!recursive && depth >= 1L)) 
            return(x_names)
        x_names <- paste(x_names, ".", sep = "")
        lapply(seq_len(length(x)), function(i) nametree(x[[i]], 
            x_names[i], depth + 1L))
    }
    nametree(x, "", 0L)
}

(getnames is adapted from AnnotationDbi:::make.name.tree)

(flatten is adapted from discussion here How to flatten a list to a list without coercion?)

as a simple example

my_data<-list(x=list(1,list(1,2,y='e'),3))

> my_data[['x']][[2]][['y']]
[1] "e"

> out<-flatten(my_data)
> out
$x.1
[1] 1

$x.2.1
[1] 1

$x.2.2
[1] 2

$x.2.y
[1] "e"

$x.3
[1] 3

> out[['x.2.y']]
[1] "e"

so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.

A more complicated example

library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)

UPDATE

naive way to remove trailing .1

my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))

> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1"   "x.2.1" "x.2.2" "x.2.y" "x.3"