views:

57

answers:

4

Hello SO

Well , it is said that Java is 10x faster than python in terms of performance, thats what i see from benchmarks too. But what really brings down java is its startup time of JVM which is quite stupid.

This is a test i made:

$time xlsx2csv.py Types\ of\ ESI\ v2.doc-emb-Package-9
...
<output skipped>
real    0m0.085s
user    0m0.072s
sys     0m0.013s


$time java  -jar -client /usr/local/bin/tika-app-0.7.jar -m Types\ of\ ESI\ v2.doc-emb-Package-9

real    0m2.055s
user    0m2.433s
sys     0m0.078s

Same file , a 12 KB ms XLSX embedded file inside Docx and Python is 25x faster !! WTH!!

its takes 2.055 sec for Java , which is a joke,.

I know it is all due to startup time, but what i need is i need to call it via a script to parse some documents which i do not want to re-invent the wheel in python.

But as to parse 10k+ files , it is just not practical..

Anyway to speed it up (I already tried -client option and it only speed up by so little(20%) ).

My another idea? Run it as a long-running daemon , communicate using UDP or Linux-ICP sockets locally?

+2  A: 

Try Nailgun.

Note: I don't use it personally.

Zan Lynx
Sounds Perfect!! thats what i need!! Let me try it out and will let u know.
V3ss0n
PERFECT solution for me. I had tested and amazed how simple it is , without ever need to write a single line of code in java , it give directly Client-server long-running process! nailgun rocks!
V3ss0n
+2  A: 

Um... write the documents to a directory (if they're not already) and have the Java program process all of them in one go?

Michael Borgwardt
The problem is , everytime Parsed need to communicate back (for processing , put inside DB) , so thats not a point , thanks tho , i already consider this option.
V3ss0n
A: 

There are lots of ways to do this - basically anything will work providing it keeps the JVM alive for the duration of all of your batch processing.

e.g., why not just alter the Java program to loop through all the files and process them all in one invocation of the JVM?

Or you could build a simple GUI application in Swing and have some visual way to run the batch (e.g. select target directories, then press "Process All..." button).

Or you could use a Clojure REPL as a way to script the execution of the appropriate Java job....

Or you could create a server process with something like Netty and send all your files through that....

mikera
Thanks but , What i am doing is server side , web-app , ajaxed. Yes i already have process all button , directory browser , search engine everything already written , in Python (search engine is Sphinx in C) .
V3ss0n
+1  A: 

For everyone who having java-startup performance trouble in your scripts , Nailgun fix it for you! Nailgun save my day!

V3ss0n