R - describe() output to a data frame

skumar picture skumar · Jun 19, 2016 · Viewed 28.8k times · Source

I want to create a data frame using describe() function. Dataset under consideration is iris. The data frame should look like this:

    Variable    n   missing unique  Info    Mean    0.05    0.1   0.25  0.5    0.75 0.9   0.95
   Sepal.Length 150    0    35      1       5.843   4.6     4.8   5.1   5.8    6.4  6.9   7.255
   Sepal.Width  150    0    23      0.99    3.057   2.345   2.5   2.8   3      3.3  3.61  3.8
Petal.Length    150    0    43      1       3.758   1.3     1.4   1.6   4.35   5.1  5.8   6.1
 Petal.Width    150    0    22      0.99    1.199   0.2     0.2   0.3   1.3    1.8  2.2   2.3
     Species    150    0    3                                   

Is there a way out to coerce the output of describe() to data.frame type? When I try to coerce, I get an error as shown below:

library(Hmisc)
statistics <- describe(iris)
statistics[1]
first_vec <- statistics[1]$Sepal.Length
as.data.frame(first_vec)
#Error in as.data.frame.default(first_vec) : cannot coerce class ""describe"" to a data.frame

Thanks

Answer

gung - Reinstate Monica picture gung - Reinstate Monica · Jun 19, 2016

The way to figure this out is to examine the objects with str():

data(iris)
library(Hmisc)
di <- describe(iris)
di
# iris 
# 
# 5  Variables      150  Observations
# -------------------------------------------------------------
# Sepal.Length 
#       n missing  unique    Info    Mean     .05     .10     .25     .50     .75     .90     .95 
#     150       0      35       1   5.843   4.600   4.800   5.100   5.800   6.400   6.900   7.255
# 
# lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9 
# -------------------------------------------------------------
# ...
# -------------------------------------------------------------
# Species 
#       n missing  unique 
#     150       0       3 
# 
# setosa (50, 33%), versicolor (50, 33%) 
# virginica (50, 33%) 
# -------------------------------------------------------------
str(di)
# List of 5
# $ Sepal.Length:List of 6
# ..$ descript    : chr "Sepal.Length"
# ..$ units       : NULL
# ..$ format      : NULL
# ..$ counts      : Named chr [1:12] "150" "0" "35" "1" ...
# .. ..- attr(*, "names")= chr [1:12] "n" "missing" "unique" "Info" ...
# ..$ intervalFreq:List of 2
# .. ..$ range: atomic [1:2] 4.3 7.9
# .. .. ..- attr(*, "Csingle")= logi TRUE
# .. ..$ count: int [1:100] 1 0 3 0 0 1 0 0 4 0 ...
# ..$ values      : Named chr [1:10] "4.3" "4.4" "4.5" "4.6" ...
# .. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
# ..- attr(*, "class")= chr "describe"
# $ Sepal.Width :List of 6
# ...
# $ Species     :List of 5
# ..$ descript: chr "Species"
# ..$ units   : NULL
# ..$ format  : NULL
# ..$ counts  : Named num [1:3] 150 0 3
# .. ..- attr(*, "names")= chr [1:3] "n" "missing" "unique"
# ..$ values  : num [1:2, 1:3] 50 33 50 33 50 33
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "Frequency" "%"
# .. .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
# ..- attr(*, "class")= chr "describe"
# - attr(*, "descript")= chr "iris"
# - attr(*, "dimensions")= int [1:2] 150 5
# - attr(*, "class")= chr "describe"

We see that di is a list of lists. We can take it apart by looking at just the first sublist. You can convert that into a vector:

unlist(di[[1]])
#             descript              counts.n 
#       "Sepal.Length"                 "150" 
#       counts.missing         counts.unique 
#                  "0"                  "35" 
#          counts.Info           counts.Mean 
#                  "1"               "5.843" 
#           counts..05            counts..10 
#              "4.600"               "4.800" 
#           counts..25            counts..50 
#              "5.100"               "5.800" 
#           counts..75            counts..90 
#              "6.400"               "6.900" 
#           counts..95   intervalFreq.range1 
#              "7.255"                 "4.3" 
#  intervalFreq.range2   intervalFreq.count1 
#                "7.9"                   "1" 
#  ...
#            values.H3             values.H2 
#                "7.6"                 "7.7" 
#            values.H1 
#                 "7.9" 
str(unlist(di[[1]]))
# Named chr [1:125] "Sepal.Length" "150" "0" "35" ...
# - attr(*, "names")= chr [1:125] "descript" "counts.n" "counts.missing" "counts.unique" ...

It is very, very long (125). The elements have been coerced to all be of the same (and most inclusive) type, namely, character. It seems you want the 2nd through 12th elements:

unlist(di[[1]])[2:12]
#     counts.n counts.missing  counts.unique    counts.Info 
#        "150"            "0"           "35"            "1" 
#  counts.Mean     counts..05     counts..10     counts..25 
#      "5.843"        "4.600"        "4.800"        "5.100" 
#   counts..50     counts..75     counts..90 
#      "5.800"        "6.400"        "6.900" 

Now you have something you can start to work with. But notice that this only seems to be the case for numerical variables; the factor variable species is different:

unlist(di[[5]])
#     descript       counts.n counts.missing  counts.unique 
#    "Species"          "150"            "0"            "3" 
#      values1        values2        values3        values4 
#         "50"           "33"           "50"           "33" 
#      values5        values6 
#         "50"           "33" 

In that case, it seems you only want elements two through four.

Using this process of discovery and problem solving, you can see how you'd take the output of describe apart and put the information you want into a data frame. However, this will take a lot of work. You'll presumably need to use loops and lots of if(){ ... } else{ ... } blocks. You might just want to code your own dataset description function from scratch.