I'm running a nightly cpu-intensitive java-application on a Ec2-server (c1.xlarge) which has 8 cores, 7,5 GB RAM (running Linux / Ubuntu 9.10 Karmic 64 bit)
The appplication is architected in such a way that a variable number of workers are constructed (each in their own thread) and fetch messages from a queue to process them.
Throughput is the main concern here and performance is measured in processed msg / second. the app is NOT Ram bound... and as far as I can see not IO-bound. (although I'm not a star in Linux. I'm using dstat to check for io-load which are pretty low and cpu wait-signals (which are almost non-existent))
I'm seeing the following when spawning different nr of workers (wokrer-threads)
worker: throughput 1.3 msg / sec / worker
worker: ~ throughput 0.8 msg / sec / worker
worker: ~ throughput 0.5 msg / sec / worker
worker: ~ throughput 0.05 msg / sec / worker
I was expecting a near-linair increase in throughput, but reality proves otherwise. 3 questions:
what might be causing the sub-linair perf going from 1 worker --> 2 workers and 2 workers --> 3 workers
what might be causing the (almost) complete halt when going from 3 workers to 4 workers? It looks like a kind of deadlock-situation or something.. (can this happen dus to heavy context-switching?)
how would I start measuring where the problems occur? My dev-box has 2 cpu's and is running under windows. What I normally do is attach a gui-profiler and check for threading-issues. But the problem only really starts to manifest itself my more than 2 threads.
some more background info:
workers are spawned using a Executors.newScheduledThreadPool
a workers-thread does calculations based on the msg (CPU-intensive). Each worker-thread contains a seperate persistQueue used for offloading writing to disk (and thus make use of CPU / IO concurrency. )
persistQueue = new ThreadPoolExecutor(1,1,100,TimeUnit.MILLISECONDS,new ArrayBlockingQueue(maxAsyncQueueSize),new ThreadPoolExecutor.AbortPolicy());
The flow (per worker) goes like this:
the worker-thread puts the result of a msg in the persistQueue and gets on with processing the next msg.
The ThreadpoolExecutor (of which we have 1 per worker-thread) only contains 1 thread which processes all incoming data (waiting in the persistQueue ) and writes it to disk (Berekeley DB + Apache Lucene)
the idea is that 1. and 2. can run concurrent for the most part since 1. is cpu-heavy and 2. is io-heavy.
It's possible that persistQueue becomes full. This is done bc. otherwise a slow io-system might cause flooding of the queues, and result in OOM-errors. (yes it's a lot of data) . In that case the workerThread pauses until it can write it's content to persistQueue. A full queue hasn't happenend yet on this setup, (which is another reason I think the app is definitely not IO-bound)
The last info:
workers are isolated from the others concerning their data, except:
they share some heavily used static final maps (used as caches. The maps are memory-intensive so I can't keep them local to a worker even if I wanted to). Operations that workers perform on these caches are: iterations, lookups, contains. (no writes, deletes, etc)
these shared maps are accessed without synchronization (no need right? )
workers populate their local data by selecting data from mysql (based on keys in the received msg) . So this is a potential bottleneck. However, most of the data are reads, queried tables are optimized with indexes and again not IO-bound.
I have to admit that I haven't done much Mysql-server optimizing yet ( in terms of config -params) But I just don't think that is the problem.
output is written to:
- BerekelyDB (using memcached(b)-client) . All workers share 1 server.
- Lucene (using a home-grown low-level indexer). Each workers has a seperate indexer.
even when disabling output writing the problems occur.
This is a huge post, I realize that, but I hope you can give me some pointers as to what this might be, or how to start monitoring / deducing where the problem lies.
Thanks, Geert-Jan