I am working on a (database-ish) project, where data is stored in a flat file. For reading/writing I'm using the RandomAccessFile
class. Will I gain anything from multithreading, and giving each thread an instance each of RandomAccessFile
, or will one thread/instance be just as fast? Is there any difference in reading/writing, as you can make instances that only do the reading, and can't write?
views:
1046answers:
6 Oops, RandomAccessFile
is synchronised, so if you share an instance then you'll only have one thread running at one anyway.RandomAccessFile
is not synchronised, and sharing between threads is not entirely safe. You will, as ever, need to be careful when you have multiple thread accessing the same mutable datastructure, particularly when the vagaries of operating systems are involved.
Small operations of RandomAccessFile
are hideously slow.
For maximum performance, you are probably better off going straight for java.nio
, although I would suggest getting something working before getting it to work fast. OTOH, keep performance in mind.
Looking at the JavaDoc on RandomAccessFile the class itself is not synchronized. It appears that you can use a synchronous mode for read and write operations. If you don't use the synchronized mode though you are going to have to manage the locks on reading and writing yourself which is far from trivial. The same is going to be true for straight java.io when using multiple threads.
If at all possible you probably want to look at using a database since a database provides this kind of multi-threaded abstraction. You might also look at what syslog options are available for Java or even log4j.
There is an option to memory map your flat file with NIO. In that case the OS memory manager becomes responsible for moving in-out sections of the file. You can also apply region locks for writers.
A fairly common question. Basically using multiple threads will not make your hard drive go any faster. Instead performing concurrent request can make it slower.
Disk subsystems, esp IDE, EIDE, SATA, are designed to read/write sequentially fastest.
By my experience from C++ development the answer is: Yes, using multiple threads can improve performance when reading files. This applies to both sequential and serial access. I proved this more than once, although i always found that the real bottlenecks are somewhere else.
The reason is, that for disk access a thread will be suspended until the disk operation has completed. But most disks today support Native Command Queueing see (SAS) or Segate (SATA) (as well as do most RAID systems) and therefore do not have to handle requests in the order you make them.
Thus if you read 4 file chunks sequential, your program will have to wait for the first chunk, then you request the second one and so one. If you request the 4 chunks with 4 threads, they may be returned all at once. This kind of optimization has limits, but it works (although i have experiences only with C++ here). I measured that multiple threads can improve sequential read performance by more than 100%.
I now did a benchmark with the code below (excuse me, its in cpp). The code reads a 5 MB textfile with a number of threads passed as a command line argument.
The results clearly show that multiple threads always speed up a program:
Update: It came to my mind, that file caching will play quite a role here. So i made copies of the testdata file, rebooted and used a different file for each run. Updated results below (old ones in brackets). The conclusion remains the same.
Runtime in Seconds
Machine A (Dual Quad Core XEON running XP x64 with 4 10k SAS Drives in RAID 5)
- 1 Thread: 0.61s (0.61s)
- 2 Threads: 0.44s (0.43s)
- 4 Threads: 0.31s (0.28s) (Fastest)
- 8 Threads: 0.53s (0.63s)
Machine B (Dual Core Laptop running XP with one fragmented 2.5 Inch Drive)
- 1 Thread: 0.98s (1.01s)
- 2 Threads: 0.67s (0.61s) (Fastest)
- 4 Threads: 1.78s (0.63s)
- 8 Threads: 2.06s (0.80s)
Sourcecode (Windows):
// FileReadThreads.cpp : Defines the entry point for the console application.
//
#include "Windows.h"
#include "stdio.h"
#include "conio.h"
#include <sys\timeb.h>
#include <io.h>
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
int threadCount = 1;
char *fileName = 0;
int fileSize = 0;
double GetSecs(void);
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
DWORD WINAPI FileReadThreadEntry(LPVOID lpThreadParameter)
{ char tx[255];
int index = (int)lpThreadParameter;
FILE *file = fopen(fileName, "rt");
int start = (fileSize / threadCount) * index;
int end = (fileSize / threadCount) * (index + 1);
fseek(file, start, SEEK_SET);
printf("THREAD %4d started: Bytes %d-%d\n", GetCurrentThreadId(), start, end);
for(int i = 0;; i++)
{
if(! fgets(tx, sizeof(tx), file))
break;
if(ftell(file) >= end)
break;
}
fclose(file);
printf("THREAD %4d done\n", GetCurrentThreadId());
return 0;
}
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
int main(int argc, char* argv[])
{
if(argc <= 1)
{
printf("Usage: <InputFile> <threadCount>\n");
exit(-1);
}
if(argc > 2)
threadCount = atoi(argv[2]);
fileName = argv[1];
FILE *file = fopen(fileName, "rt");
if(! file)
{
printf("Unable to open %s\n", argv[1]);
exit(-1);
}
fseek(file, 0, SEEK_END);
fileSize = ftell(file);
fclose(file);
printf("Starting to read file %s with %d threads\n", fileName, threadCount);
///////////////////////////////////////////////////////////////////////////
// Start threads
///////////////////////////////////////////////////////////////////////////
double start = GetSecs();
HANDLE mWorkThread[255];
for(int i = 0; i < threadCount; i++)
{
mWorkThread[i] = CreateThread(
NULL,
0,
FileReadThreadEntry,
(LPVOID) i,
0,
NULL);
}
WaitForMultipleObjects(threadCount, mWorkThread, TRUE, INFINITE);
printf("Runtime %.2f Secs\nDone\n", (GetSecs() - start) / 1000.);
return 0;
}
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
double GetSecs(void)
{
struct timeb timebuffer;
ftime(&timebuffer);
return (double)timebuffer.millitm +
((double)timebuffer.time * 1000.) - // Timezone needed for DbfGetToday
((double)timebuffer.timezone * 60. * 1000.);
}