So you have an array
1
2
3
60
70
80
100
220
230
250
For a better understanding:
How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing
[1 2 3] [60 70 80 100] [220 230 250]
Background:
y-axis is frequency, x-axis is number. These numbers are the ten highest amplitudes being represented by their frequencies. I want to create three discrete numbers from them for pattern recognition. There could be many more points but all of them are grouped by a relatively big frequency difference as you can see in this example between about 50 and about 0 and between about 100 and about 220. Note that what is big and what is small changes but the difference between clusters remains significant compared to the difference between elements of a group/cluster.
This is a simple algorithm implemented in python that check whether or not a value is too far (in terms of standard deviation) from the mean of a cluster:
from math import sqrt
def stat(lst):
"""Calculate mean and std deviation from the input list."""
n = float(len(lst))
mean = sum(lst) / n
stdev = sqrt((sum(x*x for x in lst) / n) - (mean * mean))
return mean, stdev
def parse(lst, n):
cluster = []
for i in lst:
if len(cluster) <= 1: # the first two values are going directly in
cluster.append(i)
continue
mean,stdev = stat(cluster)
if abs(mean - i) > n * stdev: # check the "distance"
yield cluster
cluster[:] = [] # reset cluster to the empty list
cluster.append(i)
yield cluster # yield the last cluster
This will return what you expect in your example with 5 < n < 9
:
>>> array = [1, 2, 3, 60, 70, 80, 100, 220, 230, 250]
>>> for cluster in parse(array, 7):
... print(cluster)
[1, 2, 3]
[60, 70, 80, 100]
[220, 230, 250]