spread() function in tidyr

HLD25 picture HLD25 · Feb 12, 2015 · Viewed 12.5k times · Source

I have a CRSP list of stock price as follow

    PERMNO  date        TICKER  RETX
1   10138   2007-01-03  TROW    0.045236
2   10138   2007-01-04  TROW    0.008743
3   10138   2007-01-05  TROW    -0.001950
4   10138   2007-01-08  TROW    0.018237
5   10138   2007-01-09  TROW    0.004051
6   10138   2007-01-10  TROW    0.005734
7   10138   2007-01-11  TROW    0.019637
8   10138   2007-01-12  TROW    0.005591
...
1   10145   2007-01-03  HON -0.003095
2   10145   2007-01-04  HON -0.000443
3   10145   2007-01-05  HON -0.009539
4   10145   2007-01-08  HON 0.006047
5   10145   2007-01-09  HON 0.007124
6   10145   2007-01-10  HON -0.006189
7   10145   2007-01-11  HON 0.016681
8   10145   2007-01-12  HON -0.003282
9   10145   2007-01-16  HON 0.001317
10  10145   2007-01-17  HON -0.001754
11  10145   2007-01-18  HON -0.010979
...

Once I use tidyr::spread(x,TICKER,RETX), it return a matrix with most of value to be NA. Is there is any other function to rearrange the matrix, listing each stock price in one column? or how to achieve it by a few line?

update: I figure out it is the PERMNO column causing the problem. After I get rid of the PERMNO column another problem appear:

> spread(A1[,2:4],TICKER,RETX)
Error: Duplicate identifiers for rows (129717, 143815), (129718, 143816), ...

So, I just randomly select two row mentioned in the message

       PERMNO       date TICKER     RETX
129717  75104 2007-01-03    CBS 0.012172
> A1[143815,]
       PERMNO       date TICKER    RETX
143815  76226 2007-01-03    CBS 0.01347

Turn out the data set is quite dirty and it contains duplicated series. A better solution would be using PERMNO as key. Here is what I get

    date        10225       10516       10909       ...
1   2007-01-03  0.005738    0.003129    -0.006593   ...
2   2007-01-04  -0.011062   -0.005615   0.028761    ...
3   2007-01-05  0.000824    -0.001568   -0.022366   ...
4   2007-01-08  -0.005059   0.005027    -0.003520   ...
5   2007-01-09  0.002956    -0.024383   0.000883    ...
6   2007-01-10  -0.003301   -0.008651   -0.010587   ...
...

It's frustrating, but I finally get something. Is there anyway to replace the numerical column name with the matching TICKER. Here is a demo

    PERMNO  date        FO          HON        ...
1   10225   2007-01-03  0.005738    -0.003095  ...
2   10225   2007-01-04  -0.011062   -0.000443  ...
3   10225   2007-01-05  0.000824    -0.009539  ...
4   10225   2007-01-08  -0.005059   0.006047   ...
5   10225   2007-01-09  0.002956    0.007124   ...
6   10225   2007-01-10  -0.003301   -0.006189  ...
7   10225   2007-01-11  0.007925    0.016681   ...
8   10225   2007-01-12  -0.010914   -0.003282  ...

Answer

shirewoman2 picture shirewoman2 · Feb 12, 2015

If you have duplicated data in places, you'll first need to get rid of those values because otherwise, if you use tidyr::spread, it will replace the value with the length. Anyway, assuming that you've removed duplicates using unique or something similar, here's how I would do it with tidyr, since that's what you asked and since tidyr is pretty nifty and concise:

 A1 <- spread(A1[, c("date", "TICKER", "RETX")], TICKER, RETX)

If you include PERMNO, you will get NAs for every line where that particular value of TICKER doesn't have a matching value in PERMNO.