Ok so after spending two days trying to figure out the problem, and reading about dizillion articles, i finally decided to man up and ask to for some advice(my first time here).
Now to the issue at hand - I am writing a program which will parse api data from a game, namely battle logs. There will be A LOT of entries in the database(20+ million) and so the parsing speed for each battle log page matters quite a bit.
The pages to be parsed look like this: http://api.erepublik.com/v1/feeds/battle_logs/10000/0. (see source code if using chrome, it doesnt display the page right). It has 1000 hit entries, followed by a little battle info(lastpage will have <1000 obviously). On average, a page contains 175000 characters, UTF-8 encoding, xml format(v 1.0). Program will run locally on a good PC, memory is virtually unlimited(so that creating byte[250000] is quite ok).
The format never changes, which is quite convenient.
Now, I started off as usual:
//global vars,class declaration skipped
public WebObject(String url_string, int connection_timeout, int read_timeout, boolean redirects_allowed, String user_agent)
throws java.net.MalformedURLException, java.io.IOException {
// Open a URL connection
java.net.URL url = new java.net.URL(url_string);
java.net.URLConnection uconn = url.openConnection();
if (!(uconn instanceof java.net.HttpURLConnection)) {
throw new java.lang.IllegalArgumentException("URL protocol must be HTTP");
}
conn = (java.net.HttpURLConnection) uconn;
conn.setConnectTimeout(connection_timeout);
conn.setReadTimeout(read_timeout);
conn.setInstanceFollowRedirects(redirects_allowed);
conn.setRequestProperty("User-agent", user_agent);
}
public void executeConnection() throws IOException {
try {
is = conn.getInputStream(); //global var
l = conn.getContentLength(); //global var
} catch (Exception e) {
//handling code skipped
}
}
//getContentStream and getLength methods which just return'is' and 'l' are skipped
Here is where the fun part began. I ran some profiling (using System.currentTimeMillis()) to find out what takes long ,and what doesnt. The call to this method takes only 200ms on avg
public InputStream getWebPageAsStream(int battle_id, int page) throws Exception {
String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
WebObject wobj = new WebObject(url, 10000, 10000, true, "Mozilla/5.0 "
+ "(Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)");
wobj.executeConnection();
l = wobj.getContentLength(); // global variable
return wobj.getContentStream(); //returns 'is' stream
}
200ms is quite expected from a network operation, and i am fine with it. BUT when i parse the inputStream in any way(read it into string/use java XML parser/read it into another ByteArrayStream) the process takes over 1000ms!
for example, this code takes 1000ms IF i pass the stream i got('is') above from getContentStream() directly to this method:
public static Document convertToXML(InputStream is) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
doc.getDocumentElement().normalize();
return doc;
}
this code too, takes around 920ms IF the initial InputStream 'is' is passed in(dont read into the code itself - it just extracts the data i need by directly counting the characters, which can be done thanks to the rigid api feed format):
public static parsedBattlePage convertBattleToXMLWithoutDOM(InputStream is) throws IOException {
// Point A
BufferedReader br = new BufferedReader(new InputStreamReader(is));
LinkedList ll = new LinkedList();
String str = br.readLine();
while (str != null) {
ll.add(str);
str = br.readLine();
}
if (((String) ll.get(1)).indexOf("error") != -1) {
return new parsedBattlePage(null, null, true, -1);
}
//Point B
Iterator it = ll.iterator();
it.next();
it.next();
it.next();
it.next();
String[][] hits_arr = new String[1000][4];
String t_str = (String) it.next();
String tmp = null;
int j = 0;
for (int i = 0; t_str.indexOf("time") != -1; i++) {
hits_arr[i][0] = t_str.substring(12, t_str.length() - 11);
tmp = (String) it.next();
hits_arr[i][1] = tmp.substring(14, tmp.length() - 9);
tmp = (String) it.next();
hits_arr[i][2] = tmp.substring(15, tmp.length() - 10);
tmp = (String) it.next();
hits_arr[i][3] = tmp.substring(18, tmp.length() - 13);
it.next();
it.next();
t_str = (String) it.next();
j++;
}
String[] b_info_arr = new String[9];
int[] space_nums = {13, 10, 13, 11, 11, 12, 5, 10, 13};
for (int i = 0; i < space_nums.length; i++) {
tmp = (String) it.next();
b_info_arr[i] = tmp.substring(space_nums[i] + 4, tmp.length() - space_nums[i] - 1);
}
//Point C
return new parsedBattlePage(hits_arr, b_info_arr, false, j);
}
I have tried replacing the default BufferedReader with
BufferedReader br = new BufferedReader(new InputStreamReader(is), 250000);
This didnt change much. My second try was to replace the code between A and B with: Iterator it = IOUtils.lineIterator(is, "UTF-8");
Same result, except this time A-B was 0ms, and B-C was 1000ms, so then every call to it.next() must have been consuming some significant time.(IOUtils is from apache-commons-io library).
And here is the culprit - the time taken to parse the stream to string, be it by an iterator or BufferedReader in ALL cases was about 1000ms, while the rest of the code took 0ms(e.g. irrelevant). This means that parsing the stream to LinkedList, or iterating over it, for some reason was eating up a lot of my system resources. question was - why? Is it just the way java is made...no...thats just stupid, so I did another experiment.
In my main method I added after the getWebPageAsStream():
//Point A
ba = new byte[l]; // 'l' comes from wobj.getContentLength above
bytesRead = is.read(ba); //'is' is our URLConnection original InputStream
offset = bytesRead;
while (bytesRead != -1) {
bytesRead = is.read(ba, offset - 1, l - offset);
offset += bytesRead;
}
//Point B
InputStream is2 = new ByteArrayInputStream(ba);
//Now just working with 'is2' - the "copied" stream
The InputStream->byte[] conversion took again 1000ms - this is the way many ppl suggested to read an InputStream, and stil it is slow. And guess what - the 2 parser methods above (convertToXML() and convertBattlePagetoXMLWithoutDOM(), when passed 'is2' instead of 'is' took, in all 4 cases, under 50ms to complete.
I read a suggestion that the stream waits for connection to close before unblocking, so i tried using HttpComponentsClient 4.0 (http://hc.apache.org/httpcomponents-client/index.html) instead, but the initial InputStream took just as long to parse. e.g. this code:
public InputStream getWebPageAsStream2(int battle_id, int page) throws Exception {
String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet(url);
HttpParams p = new BasicHttpParams();
HttpConnectionParams.setSocketBufferSize(p, 250000);
HttpConnectionParams.setStaleCheckingEnabled(p, false);
HttpConnectionParams.setConnectionTimeout(p, 5000);
httpget.setParams(p);
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
l = (int) entity.getContentLength();
return entity.getContent();
}
took even longer to process(50ms more for just the network) and the stream parsing times remained the same. Obviously it can be instantiated so as to not create HttpClient and properties every time(faster network time), but the stream issue wont be affected by that.
So we come to the center problem - why does the initial URLConnection InputStream(or HttpClient InputStream) take so long to process, while any stream of same size and content created locally is orders of magnitude faster? I mean, the initial response is already somewhere in RAM, and I cant see any good reasong why it is processed so slowly compared to when a same stream is just created from a byte[].
Considering I have to parse million of entries and thousands of pages like that, a total processing time of almost 1.5s/page seems WAY WAY too long.
Any ideas?
P.S. Please ask in any more code is required - the only thing I do after parsing is make a PreparedStatement and put the entries into JavaDB in packs of 1000+, and the perfomance is ok ~ 200ms/1000entries, prb could be optimized with more cache but I didnt look into it much.