Edit: I condensed this question given that it was probably too involved to begin with. The meat of the question is in bold below.
I'd like to know more about the object that is actually created when using DataFrame.rolling
or Series.rolling
:
print(type(df.rolling))
<class 'pandas.core.window.Rolling'>
Some background: consider the oft-used alternative with np.as_strided
. This code snippet itself isn't important, but its result is my reference point in asking this question.
def rwindows(a, window):
if a.ndim == 1:
a = a.reshape(-1, 1)
shape = a.shape[0] - window + 1, window, a.shape[-1]
strides = (a.strides[0],) + a.strides
windows = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return np.squeeze(windows)
Here rwindows
will take a 1d or 2d ndarray
and build rolling "blocks" equal to the specified window size (as below). How does a .rolling
object compare to the ndarray
output below? Is it an iterator, with certain attributes stored for each block? Or something else entirely? I've tried playing around with tab completion on the object with attributes/methods such as __dict__
and _get_index()
and they're not telling me much. I've also seen a _create_blocks
method in pandas--does it at all resemble the strided
method?
# as_strided version
a = np.arange(5)
print(rwindows(a, 3)) # 1d input
[[0 1 2]
[1 2 3]
[2 3 4]]
b = np.arange(10).reshape(5,2)
print(rwindows(b, 4)) # 2d input
[[[0 1]
[2 3]
[4 5]
[6 7]]
[[2 3]
[4 5]
[6 7]
[8 9]]]
Using the NumPy approach above (OLS implementation here) is necessitated by the fact that func
within pandas.core.window.Rolling.apply must
produce a single value from an ndarray input *args and **kwargs are passed to the function
So the argument can't be another rolling object. I.e.
def prod(a, b):
return a * b
df.rolling(3).apply(prod, args=((df + 2).rolling(3),))
-----------------------------------------------------------------------
...
TypeError: unsupported operand type(s) for *: 'float' and 'Rolling'
So this is really from where my question above stems. Why is it that the passed function must use a NumPy array and produce a single scalar value, and what does this have to do with the layout of a .rolling
object?
I suggest you have a look at the source code in order to get into the nitty gritty of what rolling does. In particular I suggest you have a look at the rolling
functions in generic.py and window.py. From there you can have a look at the Window
class which is used if you specify a window type or the default Rolling
class. The last one inherits from _Rolling_and_Expanding
and ultimately _Rolling
and _Window
.
That said, I'll give my two cents: Pandas' whole rolling mechanism relies on the numpy function apply_along_axis
. In particular it is used here in pandas. It is used in conjunction with the windows.pyx
cython module. In goes your series, out comes the aggregated rolling window. For typical aggregation functions it handles them for you efficiently, but for custom ones (using apply()
) it uses a roll_generic()
in windows.pyx
.
The rolling function in pandas operates on pandas data frame columns independently. It is not a python iterator, and is lazy loaded, meaning nothing is computed until you apply an aggregation function to it. The functions which actually apply the rolling window of data aren't used until right before an aggregation is done.
A source of confusion might be that you're thinking of the rolling object as a dataframe. (You have named the rolling object df
in your last code snippet). It really isn't. It is an object which can produce dataframes by applying aggregations over the window logic it houses.
The lambda you are supplying is applied for each cell of your new dataframe. It takes a window backwards (along each column) in your old dataframe, and it aggregates it to one single cell in the new dataframe. The aggregation can be things like sum
, mean
, something custom you've made, etc., over some window size, say 3. Here are some examples:
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
df.rolling(3).mean().dropna()
... which can also be done by:
df.rolling(3).apply(np.mean).dropna()
... and produces:
a
2 3.0
3 6.0
4 9.0
(The first column is the index value and can be ignored here, and for the next examples.)
Notice how we supplied an existing numpy aggregation function. That's the idea. We're supposed to be able to supply anything we want as long as it conforms to what aggregation functions do, i.e., take a vector of values and produce a single value from it. Here is another one where we create a custom aggregation function, in this case the L2 norm of the window:
df.rolling(3).apply(lambda x: np.sqrt(x.dot(x))).dropna()
if you're not familiar with lambda functions this is the same as:
def euclidean_dist(x):
return np.sqrt(x.dot(x))
df.rolling(3).apply(euclidean_dist).dropna()
... yielding:
a
2 2.236068
3 3.741657
4 5.385165
Just to make sure, we can manually check that np.sqrt(0**2 + 1**2 + 2**2)
is indeed 2.236068
.
[In your original edit, in the] last code snippet, your code is probably failing early than you expect. It is failing before the invocation of df.apply(...)
You are trying to add a rolling object named df
to the number 2 before it is passed to df.apply(...)
. The rolling object isn't something you do operations on. The aggregation function you have supplied also doesn't conform to an aggregation function in general. The a
is a list with the values of a window, b
would be a constant extra parameter you pass in. It can be a rolling object if you want, but it wouldn't typically be something you would like to do. To make it more clear, here is something which is similar to what you were doing in your original edit but works:
a = np.arange(8)
df = pd.DataFrame(a, columns=['a'])
n = 4
rol = df.rolling(n)
def prod(window_list, constant_rol):
return window_list.dot(constant_rol.sum().dropna().head(n))
rol.apply(prod, args=(rol,)).dropna()
# [92.0, 140.0, 188.0, 236.0, 284.0]
It is a contrived example, but I'm showing it to make the point that you can pass in whatever you want as a constant, even the rolling object you are using itself. The dynamic part is the first argument a
in your case or window_list
in my case. All defined windows, in the form of individual lists, are passed into that function one by one.
Based on your followup comments this might be what you're looking for:
import numpy as np
import pandas as pd
n = 3
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
def keep(window, windows):
windows.append(window.copy())
return window[-1]
windows = list()
df['a'].rolling(n).apply(keep, args=(windows,))
df = df.tail(n)
df['a_window'] = windows
which adds arrays/vectors to each rolling block thus producing:
a a_window
2 2 [0.0, 1.0, 2.0]
3 3 [1.0, 2.0, 3.0]
4 4 [2.0, 3.0, 4.0]
Note that it only works if you do it on a column at a time. If you want to do some math on the window before you store it away in keep
that is fine too.
That said, without more input on exactly what you are trying to achieve it is hard to construct an example which suits your needs.
If your ultimate goal is to create a dataframe of lagging variables then I'd go for using real columns using shift()
:
import numpy as np
import pandas as pd
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
for i in range(1,3):
df['a-%s' % i] = df['a'].shift(i)
df.dropna()
... giving:
a a-1 a-2
2 2 1.0 0.0
3 3 2.0 1.0
4 4 3.0 2.0
(There might be some more beautiful way of doing it, but it gets the job done.)
Regarding your variable b
in your first code snippet, remember DataFrames in pandas aren't typically handled as tensors of arbitrary dimensions/object. You can probably stuff whatever you want into it, but ultimately strings, time objects, ints and floats is what is expected. That might be the reasons the designers of pandas haven't bothered with allowing rolling aggregation to non-scalar values. It doesn't even seem like a simple string is allowed as output of the aggregation function.
Anyway, I hope this answer some of your questions. If not let me know, and I'll try to help you out in the comments, or an update.
Final note on the _create_blocks()
function of rolling objects.
The _create_blocks()
function handles the reindexing and binning when you use the freq
argument of rolling
.
If you use freq with, say, weeks such that freq=W
:
import pandas as pd
a = np.arange(50)
df = pd.DataFrame(a, columns=['a'])
df.index = pd.to_datetime('2016-01-01') + pd.to_timedelta(df['a'], 'D')
blocks, obj, index = df.rolling(4, freq='W')._create_blocks(how=None)
for b in blocks:
print(b)
... then we get the binned (not rolling) original data week-by-week:
a
a
2016-01-03 2.0
2016-01-10 9.0
2016-01-17 16.0
2016-01-24 23.0
2016-01-31 30.0
2016-02-07 37.0
2016-02-14 44.0
2016-02-21 NaN
Notice that this isn't the output of the aggregated rolling. This is simply the new blocks it works on. After this. We do an aggregation like sum
and get:
a
a
2016-01-03 NaN
2016-01-10 NaN
2016-01-17 NaN
2016-01-24 50.0
2016-01-31 78.0
2016-02-07 106.0
2016-02-14 134.0
2016-02-21 NaN
... which checks out with a test summation: 50 = 2 + 9 + 16 + 23.
If you don't use freq
as an argument it simply returns the original data structure:
import pandas as pd
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
blocks, obj, index = df.rolling(3)._create_blocks(how=None)
for b in blocks:
print(b)
... which produces ...
a
a
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
... and is used for rolling window aggregation.