I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4x4 and I want to extract a 2x2 array from it.
Here is our array:
from numpy import *
x = range(16)
x = reshape(x,(4,4))
print x
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
The line and columns to remove are the same. The easiest case is when I want to extract a 2x2 submatrix that is at the beginning or at the end, i.e. :
In [33]: x[0:2,0:2]
Out[33]:
array([[0, 1],
[4, 5]])
In [34]: x[2:,2:]
Out[34]:
array([[10, 11],
[14, 15]])
But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]
? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn't seem to work:
In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])
I found one way, which is:
In [61]: x[[1,3]][:,[1,3]]
Out[61]:
array([[ 5, 7],
[13, 15]])
First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I'd certainly like to hear it.
Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?
To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array x
from your question. The buffer assigned to x
will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j]
, NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j
(and multiplying with the size of an int to get an actual memory offset).
If you extract a subarray by basic slicing like y = x[0:2,0:2]
, the resulting object will share the underlying buffer with x
. But what happens if you acces y[i,j]
? NumPy can't use i*y.shape[1]+j
to calculate the offset into the array, because the data belonging to y
is not consecutive in memory.
NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j]
, what is actually calculated is i*x.strides[0]+j*x.strides[1]
(and this already includes the factor for the size of an int):
x.strides
(16, 4)
When y
is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y
would just be equal to x
.) The new array object will have a different shape then x
and maybe a different starting offset into the buffer, but will share the strides with x
(in this case at least):
y.shape
(2,2)
y.strides
(16, 4)
This way, computing the memory offset for y[i,j]
will yield the correct result.
But what should NumPy do for something like z=x[[1,3]]
? The strides mechanism won't allow correct indexing if the original buffer is used for z
. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.
This is covered in depth in the NumPy documentation on indexing.
Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:
x[[[1],[3]],[1,3]]
This is because the index arrays are broadcasted to a common shape. Of course, for this particular example, you can also make do with basic slicing:
x[1::2, 1::2]