Hi python enthusiasts!
I'm currently working with signal filtering for research purposes and decided to use SciPy. Nothing special, just automation of routine work.
So, here is the code
from scipy.signal import medfilt
print(medfilt([2,6,5,4,0,3,5,7,9,2,0,1], 5))
But the matter is that returned sequense is calculated wrong
SciPy: [ 2. 4. 4. 4. 4. 4. 5. 5. 5. 2. 1. 0.]
Me : [ 5. 4.5 4. 4. 4. 4. 5. 5. 5. 2. 1.5 1.]
It seems to be, that developers of package messed up one detail. When aperture (kernel in terms of SciPy) is greater than window to analyze, there is another rule of filtering.
For example with kernel=5
filtered subsequence of [2, 6, 5]
has median 5 and not 2 as SciPy calculated isn't it? And in the same way, if kernel=5
for subsequence [2,6,5,4]
medians are 5 and 4 we need to take average between them, so, the median is 4.5.
Can someone explain me who's got the right result in that case?
I believe that both you and SciPy have correct results. The difference is in what happens at the boundaries, but I believe that both you and SciPy have made valid choices.
The question is what should happen when your sliding window is at the edges, and there is no valid data to use to fill in your sliding window.
You chose to take the median of the valid part of the sliding window, which makes sense, but may add some bias because your edge points are overrepresented compared to all other points.
SciPy instead chose to extend the signal at either edge by padding zeros. So, on the boundaries, SciPy is essentially calculating
>>> np.median([0, 0, 2, 6, 5])
2.0
>>> np.median([0, 2, 6, 5, 4])
4.0
>>> np.median([9, 2, 0, 1, 0])
1.0
>>> np.median([2, 0, 1, 0, 0])
0.0
The reason why SciPy does this is almost definitely speed related: it is optimized for doing the same thing many times over, and it is much easier to optimize median
for a whole bunch of 5-element arrays than it is to optimize it for a whole bunch of 5-element arrays, and also two 4-element arrays and two 3-element arrays. There is definitely an argument to be made that it should not pad with zeros, but instead with the boundary values, but it should be noted that no boundary strategy is going to be perfect; the ideal way to deal with boundary issues is going to depend on your particular signal.
If you see Wikipedia's description of median filters, they extend the signal at either edge by padding it with the value at the edges, which also seems reasonable. They also note these three other ways of dealing with boundary issues:
In the end, you really need to try different options and see what works best for your signal. A core assumption of this kind of filtering is that your signal is going to be pretty large, and the boundary issue should never be that critical (since the majority of the signal does not exist on the boundary). It would be nice if SciPy allowed you to select what it should do at the boundaries, though!