I have a conceptual question on building a histogram on the fly with Python. I am trying to figure out if there is a good algorithm or maybe an existing package.
I wrote a function, which runs a Monte Carlo simulation, gets called 1,000,000,000 times, and returns a 64 bit floating number at the end of each run. Below is the said function:
def MonteCarlo(df,head,span):
# Pick initial truck
rnd_truck = np.random.randint(0,len(df))
full_length = df['length'][rnd_truck]
full_weight = df['gvw'][rnd_truck]
# Loop using other random trucks until the bridge is full
while True:
rnd_truck = np.random.randint(0,len(df))
full_length += head + df['length'][rnd_truck]
if full_length > span:
break
else:
full_weight += df['gvw'][rnd_truck]
# Return average weight per feet on the bridge
return(full_weight/span)
df
is a Pandas dataframe object having columns labeled as 'length'
and 'gvw'
, which are truck lengths and weights, respectively. head
is the distance between two consecutive trucks, span
is the bridge length. The function randomly places trucks on the bridge as long as the total length of the truck train is less than the bridge length. Finally, calculates the average weight of the trucks existing on the bridge per foot (total weight existing on the bridge divided by the bridge length).
As a result I would like to build a tabular histogram showing the distribution of the returned values, which can be plotted later. I had some ideas in mind:
Keep collecting the returned values in a numpy vector, then use existing histogram functions once the MonteCarlo analysis is completed. This would not be feasable, since if my calculation is correct, I would need 7.5 GB of memory for that vector only (1,000,000,000 64 bit floats ~ 7.5 GB)
Initialize a numpy array with a given range and number of bins. Increase the number of items in the matching bin by one at the end of each run. The problem is, I do not know the range of values I would get. Setting up a histogram with a range and an appropriate bin size is an unknown. I also have to figure out how to assign values to the correct bins, but I think it is doable.
Do it somehow on the fly. Modify ranges and bin sizes each time the function returns a number. This would be too tricky to write from scratch I think.
Well, I bet there may be a better way to handle this problem. Any ideas would be welcome!
On a second note, I tested running the above function for 1,000,000,000 times only to get the largest value that is computed (the code snippet is below). And this takes around an hour when span = 200
. The computation time would increase if I run it for longer spans (the while loop runs longer to fill the bridge with trucks). Is there a way to optimize this you think?
max_w = 0
i = 1
while i < 1000000000:
if max_w < MonteCarlo(df_basic, 15., 200.):
max_w = MonteCarlo(df_basic, 15., 200.)
i += 1
print max_w
Thanks!
Here is a possible solution, with fixed bin size, and bins of the form [k * size, (k + 1) * size[. The function finalizebins returns two lists: one with bin counts (a), and the other (b) with bin lower bounds (the upper bound is deduced by adding binsize).
import math, random
def updatebins(bins, binsize, x):
i = math.floor(x / binsize)
if i in bins:
bins[i] += 1
else:
bins[i] = 1
def finalizebins(bins, binsize):
imin = min(bins.keys())
imax = max(bins.keys())
a = [0] * (imax - imin + 1)
b = [binsize * k for k in range(imin, imax + 1)]
for i in range(imin, imax + 1):
if i in bins:
a[i - imin] = bins[i]
return a, b
# A test with a mixture of gaussian distributions
def check(n):
bins = {}
binsize = 5.0
for i in range(n):
if random.random() > 0.5:
x = random.gauss(100, 50)
else:
x = random.gauss(-200, 150)
updatebins(bins, binsize, x)
return finalizebins(bins, binsize)
a, b = check(10000)
# This must be 10000
sum(a)
# Plot the data
from matplotlib.pyplot import *
bar(b,a)
show()