I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.
Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.
I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:
Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?
If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?
But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:
Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g,
n
lines at a time? If so, how?Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.
Can I "sample" from the file; e.g. read only every
z
lines, effectively downsampling my data?
Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk
(or something), then I'd save changes made to a file and load the next "chunk" of data and display it.
Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).
Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).
I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound on my question in any way you'd like.
This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.
Thank you so much,
Isaac
EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.