tags:

views:

176

answers:

5

For searching a string in a file and writing the lines with matched string to another file it takes 15 - 20 mins for a single zip file of 70MB(compressed state). Is there any ways to minimise it.

my source code:

getting Zip file entries

zipFile = new ZipFile(source_file_name);

entries = zipFile.entries();

while (entries.hasMoreElements())

{ ZipEntry entry = (ZipEntry)entries.nextElement();

if (entry.isDirectory()) 
{ 
continue; 
} 
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }

zipFile.close();

Searching String

public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException

{ 

    int count = 0; 
    int countw = 0; 
    int countl = 0; 
    String s; 
    String[] str; 
    BufferedReader br2 = new BufferedReader(new InputStreamReader(in)); 
    System.out.println(CThread.currentThread()); 

        while ((s = br2.readLine()) != null) 
        { 
            str = s.split(search); 
            count = str.length - 1; 
            countw += count; //word count 
            if (s.contains(search)) 
            { 
            countl++;  //line count 
            WriteFile(CThread,s, outfile.toString(), search); 
            } 
        } 

    br2.close(); 
    in.close(); 


} 

--------------------------------------------------------------------------------

public void WriteFile(Thread CThread,String line, String out, String search) throws IOException

{ 
    BufferedWriter bufferedWriter = null; 
    System.out.println("writre thread"+CThread.currentThread()); 
    bufferedWriter = new BufferedWriter(new FileWriter(out, true)); 
    bufferedWriter.write(line); 
    bufferedWriter.newLine(); 
    bufferedWriter.flush(); 
} 

Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.

+3  A: 

I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.

I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.

If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.

You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.

You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.

More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.

Uri
`split() is going to be very expensive` +1 for that
Rakesh Juyal
A: 

One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.

In the writing thread you should queue the incoming entries before handling them.

Of course, you should maybe first debug where that time is spent, is it the IO or something else.

fish
"As an extra optimization the thread writing the results could buffer them into memory" - This is exactly what BufferedWriter already does internally.
Adamski
Yes, of course, you are right. That was more like a general comment.
fish
+3  A: 

You are reopening the file output handle for every single line you write.

This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.

Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.

Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.

Adamski
A: 

wow, what are you doing in this method

WriteFile(CThread,s, outfile.toString(), search);

every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));

Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.

Rakesh Juyal
A: 

There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.

Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.

(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)

Stephen C