views:

574

answers:

2

In my place where I work, used to have files with more than million rows per file. Even though the server memory are more than 10GB with 8GB for JVM, sometimes the server get hanged for few moments and chokes the other tasks.

I profiled the code and found that while file reading memory use rises in Giga bytes frequently(1GB to 3GB) and then suddenly comes back to normal. It seems that this frequent high and low memory uses hangs my servers. Of course this was due to Garbage collection.

Which API should I use to read the files for better performance?

Righ now I am using BufferedReader(new FileReader(...)) to read these CSV files.

Process : How am I reading the file?

  1. I read files line by line.
  2. Every line has few columns. based on the types I parse them correspondingly(cost column in double, visit column in int, keyword column in String, etc..).
  3. I push the eligible content(visit > 0) in a HashMap and finally clears that Map at the end of the task

Update

I do this reading of 30 or 31 files(one month's data) and store the eligible in a Map. Later this map is used to get some culprits in different tables. Therefore reading is must and storing that data is also must. Although I have switched the HashMap part to BerkeleyDB now but the issue at the time of reading file is same or even worse.

+5  A: 

I profiled the code and found that while file reading memory use rises in Giga bytes frequently(1GB to 3GB) and then suddenly comes back to normal. It seems that this frequent high and low memory uses hangs my servers. Of course this was due to Garbage collection.

Using BufferedReader(new FileReader(...)) won't cause that.

I suspect that the problem is that you are reading the lines/rows into an array or list, processing them and then discarding the array/list. This will cause the memory usage to increase and then decrease again. If this is the case, you can reduce memory usage by processing each line/row as you read it.

EDIT: We are agreed that the problem is about the space used to represent the file content in memory. An alternative to a huge in-memory hashtable is to go back to the old "sort merge" approach we used when computer memory was measured in Kbytes. (I'm assuming that the processing is dominated by a step where you are doing a lookup with keys K to get the associated row R.)

  1. If necessary, preprocess each of the input files so that they can be sorted on the key K.

  2. Use an efficient file sort utility to sort all of the input files into order on the K. You want to use a utility that will use a classical merge sort algorithm. This will split each file into smaller chunks that can be sorted in memory, sort the chunks, write them to temporary files, then merge the sorted temporary files. The UNIX / Linux sort utility is a good option.

  3. Read the sorted files in parallel, reading all rows that relate to each key value from all files, processing them and then stepping on to the next key value.

Actually, I'm a bit surprised that using BerkeleyDB didn't help. However, if profiling tells you that most time was going in building the DB, you may be able to speed it up by sorting the input file (as above!) into ascending key order before you build the DB. (When creating a large file-based index, you get better performance if the entries are added in key order.)

Stephen C
added my further comments
DKSRathore
Yes, You seems right Stephen. But sort is not a helpful thing for me here. As out of approx 2.8M rows, 2.4M rows have distinct keys and approx 200K used to be cross the eligibility criteria to have a place in the Map/BDB. The culprit I found is seems like split function and new Object creation all the way in the code. I shall try to minimize the new Object creation as well as optimize the same object variable use.
DKSRathore
+9  A: 

BufferedReader is one of the two best APIs to use for this. If you really had trouble with file reading, an alternative might be to use the stuff in NIO to memory-map your files and then read the contents directly out of memory.

But your problem is not with the reader. Your problem is that every read operation creates a bunch of new objects, most likely in the stuff you do just after reading.

You should consider cleaning up your input processing with an eye on reducing the number and/or size of objects you create, or simply getting rid of objects more quickly once no longer needed. Would it be possible to process your file one line or chunk at a time rather than inhaling the whole thing into memory for processing?

Another possibility would be to fiddle with garbage collection. You have two mechanisms:

  • Explicitly call the garbage collector every once in a while, say every 10 seconds or every 1000 input lines or something. This will increase the amount of work done by the GC, but it will take less time for each GC, your memory won't swell as much and so hopefully there will be less impact on the rest of the server.

  • Fiddle with the JVM's garbage collector options. These differ between JVMs, but java -X should give you some hints.

Update: Most promising approach:

Do you really need the whole dataset in memory at one time for processing?

Carl Smotricz
Helpful comment. I already call the GC once memory breaches certain threshold, but this does not seems to helping me. You pointed that the Objects creation is the culprit. It might be. As I saw some BigDecimal classes in stack trace along with classes I created.
DKSRathore
If there are data types in the profile that you didn't know about previously, chances are you're using a 3rd party library for something. There could be a possible culprit for leaking memory!
Carl Smotricz
No. Not at all. All I am using is Integer, Long, Double, File, FileReader, BufferedReader, line.split('\t',-1), Map, HashMap and all from Java standard Library. i use my own class Key and Data each class has 4 variables. And I aggregate data by Key and store them in HashMap.
DKSRathore
Explicit GC is a bad idea. But -XX:+DisableExplicitGC will remove that problem. Instead, monitor the GC statistics using "jstat" or more advanced tools. Adjust the new vs old ratio, adjust the eden vs survior ratio.
Christian
Thanks for the clarification. since your HashMap will contain lots of elements, you can save a *little* pain by pre-allocating a large enough HashMap (use the constructor with the `initialCapacity` argument).
Carl Smotricz
No Carl. this was the first I thought. InitialCapacity allocation does not helped me significantly
DKSRathore
Next tip: I think `java.util.Scanner` will make your job simpler and a little faster, and probably use less memory. I don't like `line.split()` as it uses a regular expression, with perhaps far too much overhead. Scanner does too, but you just create it once for the whole file and the regexp can be reused.
Carl Smotricz
I have tried the split with StringUtils.split as well with no improvement, Now I shall give a try to Scanner. Le me see if this solves my problems. Thanks Carl
DKSRathore
These are all interesting things to try but they will not significantly affect the memory used by your process. BufferedReader's memory usage will be constant. line.split()'s temporary storage is directly proportional to the length of a single line. If you are storing your entire file's contents in memory until it has been read and then flushing it to berkley DB or something then your scalability will always be limited by file size. You can move the size around but you will always be limited. They key is to read what you need and store what you can as soon as possible.
PSpeed
Heh, I told him that too but he won't listen to me :) Maybe his requirements are such that he can't. I'm betting on the possibility that a lot of unnecessary garbage objects are being (temporarily) constructed in the parsing process, and faster than the GC can (efficiently) clean them up. But I agree that this is less likely than that he's simply buffering a lot of data.
Carl Smotricz
@ DKSRathore: If you continue to have problems, please just show us the code! You can use PasteBin if it's too big to sensibly post here. Also a small handful of sample data, and some idea about how big (in lines) the file for a given day may become. Finally: If you're using BerkeleyDB in in-memory mode, that doesn't solve the problem, it only moves it to a different area of memory :)
Carl Smotricz
Ok Carl. Berkeley DB may be one of the part where memory consumption is happening where too many new Objects are being created while storing to BerkeleyDB or getting from it. I think less number of Object creation and reuse of same object variable is the only solution instead of looking for other API.
DKSRathore