I have a CRSP list of stock price as follow
PERMNO date TICKER RETX
1 10138 2007-01-03 TROW 0.045236
2 10138 2007-01-04 TROW 0.008743
3 10138 2007-01-05 TROW -0.001950
4 10138 2007-01-08 TROW 0.018237
5 10138 2007-01-09 TROW 0.004051
6 10138 2007-01-10 TROW 0.005734
7 10138 2007-01-11 TROW 0.019637
8 10138 2007-01-12 TROW 0.005591
...
1 10145 2007-01-03 HON -0.003095
2 10145 2007-01-04 HON -0.000443
3 10145 2007-01-05 HON -0.009539
4 10145 2007-01-08 HON 0.006047
5 10145 2007-01-09 HON 0.007124
6 10145 2007-01-10 HON -0.006189
7 10145 2007-01-11 HON 0.016681
8 10145 2007-01-12 HON -0.003282
9 10145 2007-01-16 HON 0.001317
10 10145 2007-01-17 HON -0.001754
11 10145 2007-01-18 HON -0.010979
...
Once I use tidyr::spread(x,TICKER,RETX)
, it return a matrix with most of value to be NA.
Is there is any other function to rearrange the matrix, listing each stock price in one column? or how to achieve it by a few line?
update: I figure out it is the PERMNO column causing the problem. After I get rid of the PERMNO column another problem appear:
> spread(A1[,2:4],TICKER,RETX)
Error: Duplicate identifiers for rows (129717, 143815), (129718, 143816), ...
So, I just randomly select two row mentioned in the message
PERMNO date TICKER RETX
129717 75104 2007-01-03 CBS 0.012172
> A1[143815,]
PERMNO date TICKER RETX
143815 76226 2007-01-03 CBS 0.01347
Turn out the data set is quite dirty and it contains duplicated series. A better solution would be using PERMNO as key. Here is what I get
date 10225 10516 10909 ...
1 2007-01-03 0.005738 0.003129 -0.006593 ...
2 2007-01-04 -0.011062 -0.005615 0.028761 ...
3 2007-01-05 0.000824 -0.001568 -0.022366 ...
4 2007-01-08 -0.005059 0.005027 -0.003520 ...
5 2007-01-09 0.002956 -0.024383 0.000883 ...
6 2007-01-10 -0.003301 -0.008651 -0.010587 ...
...
It's frustrating, but I finally get something. Is there anyway to replace the numerical column name with the matching TICKER. Here is a demo
PERMNO date FO HON ...
1 10225 2007-01-03 0.005738 -0.003095 ...
2 10225 2007-01-04 -0.011062 -0.000443 ...
3 10225 2007-01-05 0.000824 -0.009539 ...
4 10225 2007-01-08 -0.005059 0.006047 ...
5 10225 2007-01-09 0.002956 0.007124 ...
6 10225 2007-01-10 -0.003301 -0.006189 ...
7 10225 2007-01-11 0.007925 0.016681 ...
8 10225 2007-01-12 -0.010914 -0.003282 ...
If you have duplicated data in places, you'll first need to get rid of those values because otherwise, if you use tidyr::spread
, it will replace the value with the length. Anyway, assuming that you've removed duplicates using unique or something similar, here's how I would do it with tidyr, since that's what you asked and since tidyr is pretty nifty and concise:
A1 <- spread(A1[, c("date", "TICKER", "RETX")], TICKER, RETX)
If you include PERMNO
, you will get NAs for every line where that particular value of TICKER
doesn't have a matching value in PERMNO
.