Charting massive amounts of data

SomethingBetter picture SomethingBetter · Jan 27, 2011 · Viewed 8.9k times · Source

We are currently using ZedGraph to draw a line chart of some data. The input data comes from a file of arbitrary size, therefore, we do not know what the maximum number of datapoints in advance. However, by opening the file and reading the header, we can find out how many data points are in the file.

The file format is essentially [time (double), value (double)]. However, the entries are not uniform in the time axis. There may not be any points between say t = 0 sec and t = 10 sec, but there might be 100K entires between t = 10 sec and t = 11 sec, and so on.

As an example, our test dataset file is ~2.6 GB and it has 324M points. We'd like to show the entire graph to the user and let her navigate through the chart. However, loading up 324M points to ZedGraph not only is impossible (we're on a 32-bit machine), but also not useful since there is no point of having so many points on the screen.

Using the FilteredPointList feature of ZedGraph also appears to be out of question, since that requires loading the entire data first and then performing filtering on that data.

So, unless we're missing anything, it appears that our only solution is to -somehow- decimate the data, however as we keep working on it, we're running into a lot of problems:

1- How do we decimate data that is not arriving uniformly in time?

2- Since the entire data can't be loaded into memory, any algorithm needs to work on the disk and so needs to be designed carefully.

3- How do we handle zooming in and out, especially, when the data is not uniform on the x-axis.

If data was uniform, upon initial load of the graph, we could Seek() by predefined amount of entries in the file, and choose every N other samples and feed it to ZedGraph. However, since the data is not uniform, we have to be more intelligent in choosing the samples to display, and we can't come up with any intelligent algorithm that would not have to read the entire file.

I apologize since the question does not have razor-sharp specificity, but I hope I could explain the nature and scope of our problem.

We're on Windows 32-bit, .NET 4.0.

Answer

Gabriel Magana picture Gabriel Magana · Jan 27, 2011

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end, because I put in all the features we needed.

Basically you need to get the range of data (min and max possible/needed index values), subdivide into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).

So what I am saying is similar to what you are saying. You mention you do not want to plot every X elements because there might be a long stretch of time (index values on the x axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value based, so in your example of no data between the 0 sec and 10 sec index values I would still put data points there, they would just have the same values among themselves.

The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application.

You might get away with not writing your own graph component and just write the data summarization algorithm.