views:

270

answers:

4

I am writing a PHP function that will need to loop over an array of pointers and for each item, pull in that data (be it from a MySQL database or flat file). Would anyone have any ideas of optimizing this as there could potentially be thousands and thousands of iterations?

My first idea was to have a static array of cached data that I work on and any modifications will just change that cached array then at the end I can flush it to disk. However in a loop of over 1000 items, this would be useless if I only keep around 30 in the array. Each item isn't too big but 1000+ of them in memory is way too much, hence the need for disk storage.

The data is just gzipped serialized objects. Currently I am using a database to store the data but I am thinking maybe flat files would be quicker (I don't care about concurrency issues and I don't need to parse it, just unzip and unserialize). I already have a custom iterator that will pull in 5 items at a time (to cut down on DB connections) and store them in this cache. But again, using a cache of 30 when I need to iterate over thousands is fairly useless.

Basically I just need a way to iterate over these many items quickly.

A: 

Clearly, keeping it in memory is faster than anything else. How big is each item? Even if they are 1K each, ten thousand of them is only 10 M.

wallyk
If they are images or something similar, you could see orders of tens of megs or more depending on the compression level. If you have 100 15-meg image objects in memory, you've just used ~1.5 gig of memory!
San Jacinto
Each item is probably around 1-5k but then having say 100 concurrent users, that gets high quickly.
Louis
A: 

you can always break out on a loop after you get the data you need. so that it will not continue on looping. if it is a flat file you are storing.. you server HDD will suffer containing thousands or millions of files with different file size. But if you are talking about the whole actual file stored in a DB. then it is much better to store it in a folder and just save the path of that file in the DB. And try putting the pulled items in an XML. so that it is much easier to access and it can contain many attributes for the details of the item pulled e.g (Name, date uploaded, etc).

Treby
+1  A: 

Well, you haven't given a whole lot to go on. You don't describe your data, and you don't describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and...

So anything anybody says here is going to be a complete shot in the dark.

...so along those lines, here's a shot in the dark.

If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.

This is called the LRU algroithm. It's a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it's probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.

Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.

< edit > Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don't see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.

In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory... because you've already visited them!

If your network is large, you aren't going to get away with no disk I/O. The trick is to find a way to minimize it. < /edit >

San Jacinto
Yeah sorry it's hard to explain. The data is a serialized object so when I unserialize it, it will be in memory. That is what I am doing at the moment. But the problem lies when I need to reiterate over everything, the ones in cache will be the last used so when another function starts a new iteration, all those in cache won't be needed til the end. For the record, I am working on neuron networks, hence this insanity ;)
Louis
You are right. The problem is, for training. I have to run the network forwards with some inputs which means iterating over neurons in one layer then neurons in the next etc etc. Then run it backwards by calculating the error in the recently run network. I'm trying hard to optimize this so only one iteration occurs but I don't think it is possible. Yes either way disk IO is going to be needed but maybe bringing in many at a time is the answer. Memory is cheap after all.
Louis
A: 

You could use memcached to store objects the first time they are read, then use the cached version in the subsequent calls. Memcached use the RAM to store objects so as long you have enough memory, you will have a great accceleration. There is a php api to memcached

solidgumby