views:

72

answers:

5

I have to use a thousands of data read off from a file, and use the data hundreds of times repetitively in order to train and test my AI algorthm. Right now, I have two possible solutions. One is to keep reading off directly from the file everytime I need to use the thousands of data. The other one is to read off from the file and store the data into an ArrayList, and then repetitively use the arraylist by looping through it. Which way is faster? If possible, could someone also provide me with the big o notation for each of the two methods? Also, could there be a completely new way of approaching this problem that can reduce the time it takes to read the over-flooding amount of data?

Thanks, Hee

+2  A: 

you should write a simple performance test for both but i am pretty certain that reading from the disk and caching the results in memory via your arraylist will win every time. Overhead/latency of File IO will cause your results to diverge as the number of items you read increases.

Athens
A distributed cache such as memcache or implementing your own LRU algorithm which may be overkill in lieu of memcache is a solution for local memory limitations.
Athens
+1  A: 

I think that:

  • reading from the ArrayList is much faster.
  • the big O is the same, it is the unit of time of the operations that is different

The problem happens if your memory is not big enough to hold all this. Then you have to resort to use the file, trading speed for (memory) size.

KLE
A: 

No need for big O analysis. Memory I/O always outperforms disk I/O (moving parts). Just study memory based sorting algorithms versus disk based sorting algorithms, and you'll see.

Disk I/O should be considered when you have so much data that it won't fit into memory.

Mike Yam
Not really, now that we have solid state drives.
Joel
Solid state drives still aren't faster than ram.
Mike Yam
Upvoting to cancel the poorly-reasoned downvote. Do some research, would you?
Platinum Azure
Thanks Platinum. As soon as I saw the down vote on your perfectly valid answer, I wanted to do the same, but I'm new here and don't have enough points to vote.
Mike Yam
Platinum, I returned the favor to reverse the unreasonable downvote.
Mike Yam
+2  A: 

Do you use the data serially or in a random-access method? If it is random-access, then it may be faster to load it into memory once since you won't have to move the file pointer around. There would be a big-o penalty if you needed to allocate memory to do the operation on the data on each iteration, but without more info I couldn't say what it was.

If you access the data serially, then there is no difference in the "big-o" between the two methods. It is completely OS and physical architecture-dependent. On a good OS with good filesystem caching, the two methods should be similar, with a speed advantage going to caching in an array list and a space advantage going to reading from the file since you don't have to keep the memory allocation around.

My best advice is to implement and time both methods on your target OS and CPU. Because of the order-of-magnitude difference in speeds between CPU processing speed, CPU memory caches, RAM, and disk access, performance on modern architectures is extremely hard to predict when you have two algorithms with identical big-o.

David Gladfelter
+1  A: 

As others have said, big-O analysis will be the same.

This is because you're always reading through all the data the first time, and then reusing the data in the same way each time.

This is a good example of why asymptotic analysis is not always enough: Here your difference is going to be due to memory vs disk I/O. Disk I/O tends to take milliseconds; memory will take microseconds, possibly approaching nanoseconds if your data can be cached in just the right way.

If not everything will fit in memory, though, you really have no choice but to go with the file-reading approach. And it will be slow. But that's just how it goes sometimes, unfortunately.

Platinum Azure
If he spills over into virtual memory there's no performance loss relative to reading from the disk. And he gains by adding more memory if he loads it into memory.
Joel
Okay, my answer assumes his physical memory is fairly high, sure. But as you point out, he gains if he has lots of memory and loads it all into memory. That's precisely what my answer says. Assuming you're the one who gave the -1, I think you're being a little overly persnickety. IN GENERAL memory is faster than disk, and that was the entire point of what I was trying to say.
Platinum Azure